Urban Dictionary Analysis Tool

A fun little exercise I’ve been doing is a statistical and language analysis tool to analyse Urban Dictionary.  The idea for the project came about when it was pointed out that my own name was on UD and I realised that many of the definitions were of a sexual nature or offering praise to the holder of the name. I suspect that people are adding definitions of either their own name, or their partners or relatives. I thought it would be fun to programmatically analyse the various definitions and group them by content, maybe also ranking the most popular keywords or other interesting statistics.

The finished (though I’ll add to it overtime) product is available here: https://www.acarrick.com/urban_stats

screenshot of Urban Stats tool.

Continue reading for some technical details….

The project consists of a few Python scripts:

  • one script which does most the analysis work and also functions as a command line version
  • one script which uses Flask to provide a basic web interface
  • a second version of the Flask app which is designed for running within AWS

I used the following libraries and tools:

  • Flask and Jinja2 – high level web server and HTML templating engine written in and for Python
  • Zappa.io – a tool to package and deploy OSGI-based apps onto AWS Lambda
  • Github and Git – version control and repository hosting

Code

The code is available on Github.

The basic workflow I came up with is as follows:

  1. Upon receiving a new word (basically being invoked through the command line, or via the Flask app), check if we already have that result set cached and if it is within-date (I didn’t want to keep downloading new definitions if I last got one < 30 days ago).
  2. If not, fetch the definitions from Urban Dictionary’s unofficial API.
  3. Process the definitions and create a Named Tuple out of the results. This step currently consists of:
    1. Remove the common words from the results before analysing them to avoid bothering with “that” and “the” and, “and” etc.
    2. Work out the top 10 most frequently used words across all definitions
    3. Based on that, work out the whole sentences that contain those words, ordered by which sentences contain the most of the top 10 words in the sentence.
    4. The words longer than 9 characters
    5. The words shorter than 4 characters
    6. Naughty definitions
    7. Clean definitions
    8. (Plus a couple of other data to simplify the Jinja2 templates)
  4. Return this Named Tuple and then pass it to Jinja2 template to display its content in HTML (Or just print it’s contents at the command line.)

For step 3, I use a bunch of short methods that accept strings and return a list of words, or accept a list of words (or sentences) and returns dictionaries of those words, and their relative importance. Using short methods like this enables unit testing on the analysis code. I wanted to do this for two reasons:

  • It’s good practice to have unit tests, particularly if I want others to collaborate
  • If I’m building an analysis solution manually, I need to know that it works. I tried to use TDD here so I could verify that my code works as I write it. To prove that the naughtiness detection is working correctly, I wrote a test that should pass if the given sentence is classed as naughty and one where I know the given sentence should be classed as clean. With unit testing, I can also ensure that my code still does the same thing (or at least doesn’t break) if I try to optimise it.

To determine clean or naughty definitions I used some lists of naughty words I found and assumed that the definition is naughty if it contains a word in that list. This method has some problems though, to do false positives.

Presentation

It’s my intention to present this at Hacks/Hackers Brisbane so I’ve made (started?) a presentation on the process which is available on Google Slides.

 

Posted by Anthony