Friday, July 20, 2007

Spellchecking and text matching

As a web-based system, PHP primarily works with text in various languages: messageboards, IRC chats, web mail, etc. All of these lend themselves excellently to the task of spellchecking, so it should be no surprise that PHP has an extension specifically for it. The extension is based on the Aspell library (don’t ask; the names of spelling libraries has caused much confusion in the past!), so you need to download and install that before you start.

Once you have PHP configured with Pspell support, you can get started playing around with it. Although there are quite a few functions for the extension (see the PHP manual for more information), you really only need three: pspell_new(), pspell_check(), and pspell_replace().

int pspell_new ( string language [, string spelling [, string jargon [, string encoding [, int mode]]]]) bool pspell_check ( int dictionary_link, string word) array pspell_suggest ( int dictionary_link, string word)

Here’s a basic example of pspell_new() and pspell_check() in action. It creates a new English dictionary and checks if the word “Baboon” exists:

$pspell = pspell_new("en");

if (pspell_check($pspell, "Baboon")) {
echo "You can spell!";
} else {
echo "Back to school for you...";
}

?>

Note that you need to store the return value from pspell_new() in order to use it in later functions. The pspell_check() function takes the opened Pspell dictionary as its first parameter and the word to check as its second parameter. It then returns true if the word was found or false otherwise.

If we work in pspell_suggest() we can also have Pspell recommend alternatives to wrongly spelt words. You use pspell_suggest() in the same way as pspell_check(), except that it returns an array of suggestions rather than a boolean. Let’s extend that previous script a little:

$pspell = pspell_new("en");
$word = "babooon";

if (pspell_check($pspell, $word)) {
echo "You can spell!";
} else {
$suggestions = pspell_suggest($pspell, $word);
if (count($suggestions)) {
echo "I didn't understand '$word'. You probably meant one of these:n";

foreach($suggestions as $suggestion) {
echo " $suggestionn";
}
} else {
echo "Back to school for you...";
}
}
?>

Note that this time I’ve used “babooon” as the test word, which shouldn’t exist in the dictionary. This time, when pspell_check() fails, we run the word through pspell_suggest() and, if there are suggestions, print them out. Easy, huh?

Of course, the problem here is that you can only check one word at a time, which is pretty useless. What we really want to do is have our script take a string of text and break it up into individual words, checking each of them. Fortunately it’s not much harder than the previous script!

$pspell = pspell_new("en");
$sentence = "The quik brown fox jumpd over the lazyyy dog";
$words = explode(" ", $sentence);

foreach($words as $word) {
if (pspell_check($pspell, $word)) {
// this word is fine; print as-is
echo $word, " ";
} else {
// this word is bad; look for suggestions
$suggestions = pspell_suggest($pspell, $word);

if (count($suggestions)) {
// we have suggestions for this word; print them out
echo " ";
} else {
// no suggestions; just print the word
echo $word;
}
}
}
?>

No comments: