# Identifying Misspelled Words in your Dataset with Hunspell

## Brandon Rozek

January 22, 2018

I assume in this article that you have hunspell and it’s integration with python installed. If not, please refer to the article mention above and follow the prerequisite steps.

### Background Information

Misspelled words are common when dealing with survey data or data where humans type in the responses manually. In the Dress Attributes Dataset this is apparent when looking at the sleeve lengths of the different dresses.

dresses_data['SleeveLength'].value_counts()

Ouch, so many misspelled words. This is when my brain is racking up all the ways I can automate this problem away. Hence my stumbling upon Markus' post. ### Automagically Correcting Data First, I decided to completely ignore what Markus warns in his post and automatically correct all the words in that column. To begin the code, let's import and create an instance of the spellchecker:
from hunspell import HunSpell
spellchecker = HunSpell('/usr/share/hunspell/en_US.dic', '/usr/share/hunspell/en_US.aff')


I modified his correct_words function so that it only corrects one word and so I can apply it along the SleeveLength column.

def correct_word(checker, word, add_to_dict=[]):
"Takes in a hunspell object and a word and corrects the word if needed"
# Add custom words to the dictionary

corrected = ""
# Check to see if it's a string
if isinstance(word, str):
# Check the spelling
ok = checker.spell(word)
if not ok:
# Grab suggestions for misspelled word
suggestions = checker.suggest(word)
if suggestions:
# Grab the best suggestion
best = suggestions[0]
corrected = best
else:
# There are no suggestions for misspelled word, return the original
corrected = word
else:
# Word is spelled correctly
corrected = word
else:
## Not a string. Return original
corrected = word
return corrected


Now let’s apply the function over the SleeveLength column of the dataset:

dresses_data['SleeveLength'] = dresses_data['SleeveLength'].apply(
lambda x: correct_word(spellchecker, x))


Doing so creates the following series: