Getting Started with Natural Language Processing, for Developers (Part 1)
Not very long ago, NLP was limited to mathematical equations, research papers and very large organizations. However, at the present time, an abundance of open source software, libraries, corpora and learning material are available for the masses to harness. I had a hard time getting started with natural language processing, and I went lost in the piles of information I found online.
So I decided to put together a series of articles on NLP for developers. Thus I skipped mathematics, calculus and analytical geometry along with rocket science unless absolutely necessary. The series will focus on applications, rather than theories. However, I do encourage readers to get a technical understanding of how and why things work, the way they do. Let’s get started.
The Setup (for now)
First of all, we will get our feet wet with holy NLP water through our friend TextBlob. It is a Python library for NLP. Why Python? Because Python is very popular for the purpose of NLP, AI, ML and scientific computing in general. TextBlob has a very easy and expressive API, making it easy for others to grasp and use.
You have probably heard of NLTK and how it saved the world from impending doom. So why are we not using NLTK? We are. TextBlob is a wrapper over NLTK, abstracting away many fine-tuned details. We are not here to get intimidated by NLTK, we are here to experience NLP in action.
So, we will be needing:
A computer running Mac OS X, BSD or a variant of Linux (I don’t know how things work in Windows)
Working knowledge of Python 2.x or 3.x
Gallons, Buckets, Containers or entire ship shipping ship full of patience
Start off by installing TextBlob.
pip install textblob
We are done (for now).
Hello, NLP!
Let’s get into processing some sentences.
from textblob import TextBlob
text = "Aniruddha did not attend class because he was sick. The lecturer marked him absent."
blob = TextBlob(text)
Okay, let’s break things down. The first line imports a class, nothing fancy. The second line declares a string, nothing fancy. Ah, the third line - we are creating a TextBlob object from our text string. Our blob of text is now ready for some cool NLP! Let’s figure break the whole text down into sentences.
Breaking Down (aka Tokenization)
Let’s try this. (BTW, we are on the Python Shell)
blob.sentences
Output:
[Sentence("Aniruddha did not attend class because he was sick."), Sentence("The lecturer marked him absent.")]
There are two sentences as we can see, broken down or tokenized just as we wanted. Let’s try something more granular, let’s break down the Sentences
into Words
.
blob.sentences[0].words
That worked out quite well! Let’s get back to some grammar lessons from elementary school.
Parts of Speech Tagging
Parts of Speech (PoS) are the individual words of a sentence, they are classified into categories like nounds, pronouns, adjectives, adverbs, prepositions etc. Identifying PoS is an essential task in NLP. Let’s see how it all works.
blob.sentences[0].pos_tags
Output:
[('Aniruddha', u'NNP'),
('did', u'VBD'),
('not', u'RB'),
('attend', u'VB'),
('class', u'NN'),
('because', u'IN'),
('he', u'PRP'),
('was', u'VBD'),
('sick', u'JJ')]
Weirdo acronyms? Well, ('Aniruddha', 'NNP')
wants to tell us that, Aniruddha
is an NNP
, AKA a singular proper noun. Similarly, JJ
indicates an adjective. How do I know?
import nltk
nltk.help.upenn_tagset('NNP')
Now you can figure out the mysterious acronyms as well. Let’s try some transformations.
Transformers!
Let’s try something even cooler. Remember our sentence[1]
, Sentence("The lecturer marked him absent.")
? Let’s try to make lecturer
plural.
lecturer_word = blob.sentences[1].words[1]
lecturer_word.pluralize()
There are some similar operations in TextBlob, try them! Read the documentation.
Synonyms and related words
You could just build a very simple Dictionary of Synonyms when you have a Python interpreter. Let’s take a look at the synonyms of marked
.
marked_word = blob.sentences[1].words[2]
marked_word.get_synsets()
[Synset('tag.v.01'),
Synset('mark.v.02'),
Synset('distinguish.v.03'),
Synset('commemorate.v.01'),
Synset('mark.v.05'),
Synset('stigmatize.v.01'),
Synset('notice.v.02'),
...]
That will enough for this part. Let me know in the comments if you liked TextBlob, and/or whether you want more stuff of NLP. Thanks for dropping by.