Spelling Correction

Peter Norvig’s spelling corrector is an interesting example of using some statistical techniques for the very practical purpose of spelling correction, inspired by a conversation on the Google ‘Did You Mean’ spelling suggestion functionality. There’s an excellent explanation of the background in his article, so I’ll skim over the ideas and how you might implement them in PHP.

Read More

Simple Collaborative Filtering

Collaborative filtering is a way of trying to present more relevant information to users, by choosing what to show based on how other users have acted. The “You might also like” boxes on Amazon and similar are probably the most popular example of this kind of technology, but it’s applicable in recommending all kind of content outside of ecommerce, particular anything that often has user ratings, such as video or games.

Read More

Shingling - Near Duplicate Detection

Determining whether two documents are exactly the same is pretty easy, just use some suitably sized hash and look for a match. A document will generally only hash to the same as another though if they are identical - the smallest change, or the same content on another site with a different header and footer, for example, will cause the hash to be quite different. These near duplicates are very common, and being able to detect them can be useful in a whole range of situations. Shingling is one process for relatively cheaply detecting these duplicates.

Read More

Text Classification (And Twitter)

Classification techniques are used for spam filters, author identification, intrusion detection and a host of other applications. They can be used to help organise data into a structure, or to add tags to allow users to find documents. While the latest classification algorithms are at the cutting edge of machine learning, there are still thousands of systems using simpler algorithms to great effect.

Read More

Tries And Wildcards

One nice bit of search query functionality, particularly in boolean systems, is the wildcard match. If you aren’t sure whether the title you’re trying to remember contains the word academy, academic, academically, or academics then you might be well served by trying all four: academ*.

Read More

Simple Search: Phrases

In an earlier post we looked at a simple search system that could handle straightforward boolean combinations of words in a query. Much of the time we can treat even ‘natural’ searches like that, assuming that a search like php information retrieval is “look for any document containing the words php AND information AND retrieval”, but sometimes the user is searching for that specific phrase in that specific order.

Read More

Simple Search: The Vector Space Model

One of the issues with the boolean search model is that results are unranked - every matching document for a query contains all of the terms in that query, and there’s no real way of saying which are ‘better’. However, if we could weight the terms in a document based on how representative they were of the document as a whole, we could order our results by the ones that were the best match for the query. This is the idea that forms the basis for the vector space model.

Read More

Tokenisation

Taking a string and separating it into tokens is one of those smaller problems in search that seems initially simple - split on spaces - but can quickly become overwhelmed with edge cases. Ignoring the problem of other languages, some of which don’t even necessarily use a space, the exceptions tend to fall into two categories, punctuation related and normalisation.

Read More

Block Based External Sort

Memory isn’t something that we have to worry about very much in PHP, as memory management is handled for us by the Zend engine. However, when it does become an issue it becomes a very big one - most PHP script are limited as to how much memory they can consume. While this makes a lot of sense for web processes, and is in general not a problem, when you have a lot of data to deal with it can make life difficult.

Read More