Spellcheck and autocorrect are two of the coolest features of search engines. They are not only cool they are also key for a smooth search experience. For the context of this post, I am going to stick with English language searches. Peter Norvig, a Google scientist has broken this subject down beautifully here to show the intuition behind an industry grade spellchecker with some code. Several other follow up posts on this subject made it more accessible, especially for people like me who doesn’t have any formal education on the subject matter of Natural language processing (NLP) or Search engines.
But every time I look at Google’s spell checker and autocorrect I was not only in awe of it but also realized that spell check requires covering several use cases and that those weren't covered by Peter (or maybe he left it to the readers to figure out). I just wanted to capture a couple of those outlying cases in this post, In fact, with my limited abilities, I am only speculating at best how Google might have handled those cases. Links to the accompanying code is included at the end of the post.
First, let's look at few cases the post covers.
Single word typos
Here is the simplest case, where the word socket is misspelled sucket.This is handled by the simple notion of “edits” based on Damerau–Levenshtein distance. Intuition behind it is very simple, if any word in the search query is not a valid word as per the “source-of-truth” (assume an English dictionary, for now, more on this later) “edit” the letters in the word until you get a valid word (one might say it's an oversimplification and I would agree).
Let’s break it down: “Edit” is nothing but a fancy way to say changing the letters in the original word. Now if you wonder what constitutes a change: It could be an Insert, Delete, Substitute or a Transpose of one or more letters in the original word. If you change one character then it is said that the edit distance between the original word and changed word is 1 and so on. In this case, if you generate substitution edits on “sucket” with all the 26 English alphabets a-z, we would end up substituting each of the 26 characters to each of 6 characters positions of the word “sucket” i.e. 26 * 6 permutations.
Here the possible list of substitution edits, with a lot of unreal words and with possible duplicates.
Out of all permutations, one would be replacing the 2nd letter of the sucket with ‘O’ from the alphabets which would yield socket. If you looked up the dictionary socket would be a valid a word. Now how many changes you had to do to arrive at socket from sucket is 1 substitution, then edit distance = 1. (Mays et al (1991) observed that 80% of misspellings derived from single instances of insertion, deletion, or substitution). Also as you can notice Google spell checker is confident and it uses “showing results” instead of “did you mean” with an option to go back to original query. This is called immediate refinement as the autocorrect needed only a single edit and “sucket” probably appeared to Google as an infrequently searched phrase based on its query log and query ranking. (There is a lot more to this )
Multi-word expressions and typos
Before getting into multi-word typos and spell check let's examine a couple of query rewriting techniques and understand them to see how they can be important for spellchecking and autocorrection.
Multiword expression detection or Collocation detection
The English language has special words called compounds words, i.e. words made of two or more individual words like waterproof or weatherproof that are to be spelled without a whitespace in between. There is another type in this category of words like New York for instance has to be spelled with a whitespace in between.
Let’s take weatherproof, if a user inadvertently adds a whitespace between words weather and proof Google spell checker considers it as a typo comes back with a suggestion.
Let’s take New york, when the user types it, it isn’t considered as a typo but it is recognized as a single word and relevant results are shown for New york city as a whole in spite of the white space.
This technique is called collocation detection or identifying multi-word expressions. To achieve this Google maintains a list of possible co-occurring word pairs to lookup to. A dictionary or thesaurus cannot provide this, so Google should have amassed these word pairs from millions of websites they get to crawl. Technically these pairs are called bigrams. If the search phrase contains possible bigrams it would be considered as a single word instead of 2 different words. In fact Google has not only bigrams but it maintains lists of 1-grams all the way up to n-grams along with the frequency of occurrences. Peter has been kind enough to share some sample *-gram files here. So a bigram frequency corpus would look like below
Another nuance in English multi-word expressions is need for word segmentation. In natural language searches users tend to inadvertently miss adding a white space between legitimate words. Like for instance Google automatically recognizes machine and gun are two different words when user types machinegun. Same is the case with the phrase fashionsneakers. This is case of word segmentation – 2 or more legitimate words but not necessarily bigrams or n-grams are to be recognised , segmented and searched appropriately for accurate results.
Let’s think about the logic behind this. Naively we could take one character at a time from the entire phrase and check against a “source-of-truth” to see if it is a legitimate English word. If yes, segment it out and continue the process from that point on until the end of the phrase. Let’s take the example of machinegun and it should look something like below.
Similar to previous case the “source-of-truth” cannot be a mere english dictionary as natural language searches can include expressions like slangs or acronyms in the search phrase which might have to be segmented. Consider the following case:
A dictionary may not recognise GST as a valid word, so Google uses its overwhelmingly large set of words amassed from crawling millions of websites over the period of years. Technically it is a word frequency table like the once used for n-grams except here it will be just single words and frequencies. The frequency table would look something like this
So the logic is pretty simple, split the words with different possible prefix and suffix permutations. For instance, machinegun would probably split into several instances as below
Check if all prefix and suffix parts of a split for an instance exist in the table. Say, “M” and “achinegun” and get the probability of both parts and multiply their probabilities. Here is a refresher on how to get probability from a frequency table. Once we got this product of probabilities for all splits, get the max of all the probabilities obtained and that will be the most likely segmentation for the given phrase.
The algorithm would return Machine + Gun as the most likely word segmentation because p(‘Machine’) x p(‘gun’) would have been the maximum when compared to the product of the probability of all the other split instances. (p(word) is the probability for a word occurring in frequency table). Now you know why data science and machine learning needs statistics and probability as an essential pre-requisite, in one of the future posts will show you why linear algebra and vector calculus is also an essential prerequisite as well.
Multi-word spelling mistake with need for segmentation
All is well till this point, but things can get complicated if rethink a key assumption. Word segmentation assumes possible word segments are spelled correctly and hence checks individual parts of the phrase directly against the frequency table. What if a possible multi-word like fashnaasneekers is spelled incorrectly and it needs a word segmentation for extracting right results? This case is not covered by Peter and I wanted to take a stab at it. The idea is very simple: combine spell check logic and word segmentation logic.
Here is the code and link to the corpus file used in the code. As you can see I tried the code with victirusecreet to see if the logic can correct and segment to victoria secret and runningsooes to running shoes. Again, accuracy entirely depends on the word density of the corpus file you use.