Brandon Rozek

Thoughts on Web Development, Statistics, and Linux

Posts

Identifying Misspelled Words in your Dataset with Hunspell

This article is based on one written by Markus Konrad at this link https://datascience.blog.wzb.eu/2016/07/13/autocorrecting-misspelled-words-in-python-using-hunspell/

I assume in this article that you have hunspell and it's integration with python installed. If not, please refer to the article mention above and follow the prerequisite steps.

This article is inspired from the need to correct misspelled words in the Dress Attributes Dataset. I'll share with you my initial pitfall, and what I ended up doing instead.

Background Information

Misspelled words are common when dealing with survey data or data where humans type in the responses manually. In the Dress Attributes Dataset this is apparent when looking at the sleeve lengths of the different dresses.

dresses_data['SleeveLength'].value_counts()
Word Frequency
sleevless 223
full 97
short 96
halfsleeve 35
threequarter 17
thressqatar 10
sleeveless 5
sleeevless 3
capsleeves 3
cap-sleeves 2
half 1
Petal 1
urndowncollor 1
turndowncollor 1
sleveless 1
butterfly 1
threequater 1

Ouch, so many misspelled words. This is when my brain is racking up all the ways I can automate this problem away. Hence my stumbling upon Markus' post.

Automagically Correcting Data

First, I decided to completely ignore what Markus warns in his post and automatically correct all the words in that column.

To begin the code, let's import and create an instance of the spellchecker:

from hunspell import HunSpell
spellchecker = HunSpell('/usr/share/hunspell/en_US.dic', '/usr/share/hunspell/en_US.aff')

I modified his correct_words function so that it only corrects one word and so I can apply it along the SleeveLength column.

def correct_word(checker, word, add_to_dict=[]):
    "Takes in a hunspell object and a word and corrects the word if needed"   
    # Add custom words to the dictionary
    for w in add_to_dict:
        checker.add(w)

    corrected = ""
    # Check to see if it's a string
    if isinstance(word, str):
        # Check the spelling
        ok = checker.spell(word)
        if not ok:
            # Grab suggestions for misspelled word
            suggestions = checker.suggest(word)
            if suggestions:
                # Grab the best suggestion
                best = suggestions[0]
                corrected = best
            else:
                # There are no suggestions for misspelled word, return the original
                corrected = word 
        else:
            # Word is spelled correctly
            corrected = word
    else:
        ## Not a string. Return original
        corrected = word
    return corrected

Now let's apply the function over the SleeveLength column of the dataset:

dresses_data['SleeveLength'] = dresses_data['SleeveLength'].apply(
    lambda x: correct_word(spellchecker, x))

Doing so creates the following series:

Word Frequency
sleeveless 232
full 97
short 96
half sleeve 35
three quarter 17
throatiness 10
cap sleeves 3
cap-sleeves 2
Petal 1
butterfly 1
turndowncollor 1
half 1
landownership 1
forequarter 1

As you might be able to tell, this process didn't go as intended. landownership isn't even a length of a sleeve!

Reporting Misspelled Items and Allowing User Intervention

This is when I have to remember, technology isn't perfect. Instead we should rely on ourselves to identify what the word should be correctly spelled as.

Keeping that in mind, I modified the function again to take in a list of the data, and return a dictionary that has the misspelled words as the keys and suggestions as the values represented as a list.

def list_word_suggestions(checker, words, echo = True, add_to_dict=[]):
    "Takes in a list of words and returns a dictionary with mispellt words as keys and suggestions as a list. Also prints it out"
    # add custom words to the dictionary
    for w in add_to_dict:
        checker.add(w)
    
    suggestions = {}
    for word in words:
        if isinstance(word, str):
            ok = checker.spell(word)
            if not ok and word not in suggestions:
                suggestions[word] = checker.suggest(word)
                if not suggestions[word] and echo:
                    print(word + ": No suggestions")
                elif echo:
                    print(word + ": " + "[", ", ".join(repr(i) for i in suggestions[word]), "]")
    return suggestions

With that, I can use the function on my data. To do so, I convert the pandas values to a list and pass it to the function:

s = list_word_suggestions(spellchecker, dresses_data['SleeveLength'].values.tolist())

These are the suggestions it produces:

sleevless: [ 'sleeveless', 'sleepless', 'sleeves', 'sleekness', 'sleeve', 'lossless' ]
threequarter: [ 'three quarter', 'three-quarter', 'forequarter' ]
halfsleeve: ['half sleeve', 'half-sleeve', 'sleeveless' ]
turndowncollor: No suggestions
threequater: [ 'forequarter' ]
capsleeves: [ 'cap sleeves', 'cap-sleeves', 'capsules' ]
sleeevless: [ 'sleeveless', 'sleepless', 'sleeves', 'sleekness', 'sleeve' ]
urndowncollor: [ 'landownership' ]
thressqatar: [ 'throatiness' ]
sleveless: [ 'sleeveless', 'levelness', 'valveless', 'loveless', 'sleepless' ]

From here, you can analyze the output and do the replacements yourself:

dresses_data['SleeveLength'].replace('sleevless', 'sleeveless', inplace = True)

What's the Benefit?

This is where you ask "What's the difference if it doesn't automatically fix my data?"

When you have large datasets, it can be hard to individually identify which items are misspelled. Using this method will allow you to have a list of all the items that are misspelled which can let you deal with it in a systematic way.

Obtaining Command Line Input in Java

To obtain console input for your program you can use the Scanner class

First import the relevant library


import java.util.Scanner;

Then create a variable to hold the Scanner object


Scanner input;
input = new Scanner(System.in);

Inside the parenthesis, the Scanner binds to the System input which is by default the console

The new varible input now has the ability to obtain input from the console. To do so, use any of the following methods

Method What it Returns
next() The next space seperated string from the console
nextInt() An integer if it exists from the console
nextDouble()              A double if it exists from the console
nextFloat() A float if it exists from the console
nextLine() A string up to the next newline character from the console
hasNext() Returns true if there is another token
close() Unbinds the Scanner from the console

Here is an example program where we get the user’s first name


import java.util.Scanner;

public class GetName {
  public static void main(String[] args) {
    Scanner input = new Scanner(System.in);
    System.out.print("Please enter your name: ");
    String firstName = input.next();
    System.out.println("Your first name is " + firstName); 
  }
}

Escape Sequences in Java

Sometimes you want to format your outputs. This is a quick cheatsheet containing the differerent escape sequences

 

Character Escape Sequence
Newline \n
Tab \t
Backspace \b
Double Quote            \”
Single Quote \’
Backslash \\

Java Swing Components

This post, over time, will serve as a reference to myself and others of the different UI components available in the Swing library. This post assumes a general familiarity with setting up a basic Swing application and focuses only on the individual components.

Read More →

Using System Themes In Java Swing

The default theme for Java Swing components is a cross-platform theme called “Metal”. I use the Adapta theme for GTK on Linux and this theme does not match at all what my other GUI applications look like. So here, I will describe a simple way to utlize already existent system themes in Java Swing applications.

Read More →