Python for Linguists
A Gentle Introduction to the Python Language
By Deepak Kumar

Part 4: A Tour of NLTK Taggers

 

>>> import nltk
>>> from nltk.corpus import brown
>>> btw = brown.tagged_words(categories="news") >>> bts = brown.tagged_sents(categories="news")
>>> bs = brown.sents(categories="news")

The corpus above enables access to all the tagged words (>100K) in the Brown news corpus (btw), the tagged sentences (bts)(over 4600), and the sentence corpus (bs). We will use tese for the following examples. Additionally, lets have a test sentence:

s = "A man, a plan, a canal Panama."

Next, lets use the NLTK's built-in tokenizer to tokenize it into words:

>>> words = nltk.word_tokenize(s)
 ['A', 'man', ',', 'a', 'plan', ',', 'a', 'canal', ',', 'Panama', '.']

The NLTK Tagger

Like the built-in tokenizer, shown above, NLTK has a built-in tagger. Below we show how it is used:

>>> nltk.pos_tag(words)
[('A', 'DT'), ('man', 'NN'), (',', ','), ('a', 'DT'), ('plan', 'NN'), (',', ','), 
('a', 'DT'), ('canal', 'JJ'), (',', ','), ('Panama', 'NNP'), ('.', '.')]

As you can see from above, the tagger assigns each token the most viable tag, based on the tagging knowledge built into the NLTK tagger. Notice the representation of the tagged words above, it is the same as the one you get from the tagged corpus: each tagged word is a pair of the form: (word, tag). We can further use this tagger to tag any given sentence. Below we do so for the first five sentences in the Brown news corpus:

>>> for sent in bs[:5]:
         print nltk.pos_tag(sent)

You should enter it yourself and observe the output. Next, let us examine the Brown news corpus of tagged words (btw) and see what the most frequent tag is:

>>> tags = [tag for (wrd, tag) in btw]
 >>> tDist = nltk.FreqDist(tags)
 >>> tDist.max()
 'NN'

First, we created alist of all the tags (i.e. separated the words out), and then created a frequency distribution table, and then examined the tag with the largest count. It should come as no surprise that it is the "NN" tag. I.e. noun. If you plot the distribution (using the command tDist.plot(15) to plot the 15 most frequent tags), you will notice that there are nearly 13000 words in the corpus with the "NN" tag.

Default Tagger

Given that the "NN" tag itself makes up nearly 13% of the corpus, we can create a first approximation tagger. One that assigns every word the tag "NN". Agreed that this is not going to be so good, it is still worth it for several reasons. So indulge, for the moment, and we will learn how to set up the tagger, use it to tag sentences, and then also how to evaluate the goodness or accuracy of the tagger.

NLTK provides a DefaultTagger class that instantiates a tagger that, given a specific tag, will use that tag to tag every word it encounters. This is how you create it:

>>> dTagr = nltk.DefaultTagger("NN")

Now, dTagr is our tagger that will tag every word by the "NN" tag. Lets try it:

>>> dTagr.tag(words)
[('A', 'NN'), ('man', 'NN'), (',', 'NN'), ('a', 'NN'), ('plan', 'NN'), 
(',', 'NN'), ('a', 'NN'), ('canal', 'NN'), (',', 'NN'), ('Panama', 'NN'),
('.', 'NN')]

As expected, when given a list of words (i.e. a tokenized sentence), it assigns each word the "NN" tag. The tagger class, in addition to providing a tag(...) function/method, also provides an evaluate(...) method that can be used to assess the accuracy of the tagger. The accuracy is typically judged against a correctly tagged sentence. The tagger uses the sentence first to assign its own tags, and then compares it against the supplied tags (also called the gold standard) and computes the accuracy.

For the same of the example, lets compute the accuracy of the dTagr tagger against the one we obtained from the NLTK pos_tag function, which is reproduced below:

>>> nltk.pos_tag(words) 
[('A', 'DT'), ('man', 'NN'), (',', ','), ('a', 'DT'), ('plan', 'NN'), 
(',', ','), ('a', 'DT'), ('canal', 'JJ'), (',', ','), ('Panama', 'NNP'),
('.', '.')]

Notice that, in the above, only two words received the "NN" tag. Although, you could question the fact that even the word 'canal' should have received the "NN" tag. Regardless, if this is to be considered our gold standard, lets us measure its accuracy, which is simply:

#tags correctly assigned/# total number of tags*100

I.e. as a percentage of the number of tags it correctly assigned over the total number of tags. In our example above, since only two tags were correctly assigned by the dTagr (that for 'man' and 'plan') out of a total of 11 tags, we have:

>>> accuracy = 100.0*(2.0/11.0)
>>> accuracy
18.181818181818183

Or, roughly, a little over 18%. NLTK provides a built-in function, called evaluate, that enables us to do this over a large corpus:

>>> dTagr.evaluate(bts)
0.13089484257215028

That is, given the entire corpus of Brown tagged sentences (over 4600 of them), what precentage of tags would be correctly assigned if using the default tagger (dTagr) that assignes the tag "NN" to all the words? We can see that it is roughly 13% since that is the percentage of words that we saw earlier that were corerectly assigned the "NN" tag. You can also give the evaluate function, your own 'gold standard' tags. For example, for the test sentence we have been using:

>>> dTagr.evaluate([nltk.pos_tag(words)])
 0.18181818181818182

And we get the same 18.15% result. Notice above that we had to give the tagger a list ([...]) of the tagged sentence, since the evaluate function expects a list of tagged sentences.

In what follows, you will notice that for every kind of tagger we define and use, it provides the same two functions: tag(...), and evaluate(...)

Thus, while the default tagger was no so great, we were able to see how to use the tagger functions to (1) define the tagger, (2) tag a given sentence, and (3) evaluate its accuracy agains a gold standard. There is yet another use for such a default tagger which we will see later.

Regular Expression Tagger

Another tagging scheme can be defined based on the patterns of work morphology or inflections. The NLTK Regular Expression tagger accepts your model of regular expression patterns, together with the tags for each pattern and then uses them to tag words. For example, below we show a small set of regular expression patterns:

patterns = [
 (r'.*ing$', 'VBG'),              # gerunds
 (r'.*ed$', 'VBD'),               # simple past
 (r'.*es$', 'VBZ'),               # 3rd singular present
 (r'.*ould$', 'MD'),              # modals
 (r'.*\'s$', 'NN$'),              # possesive nouns
 (r'.*s$', 'NNS'),                # plural nouns
 (r'^-?[0-9]+(.[0-9]+)?$', 'CD'), # cradinal numbers
 (r'.*', 'NN')                    # nouns (default)
 ]

These simple patterns can be supplied to create a customied Regular Expression tagger as follows:

>>> reTagr = nltk.RegexpTagger(patterns)
>>> reTagr.tag(bs[1011])
[('U.S.', 'NN'), ('Dist.', 'NN'), ('Judge', 'NN'), ('Charles', 'VBZ'), 
('L.', 'NN'), ('Powell', 'NN'), ('denied', 'VBD'), ('all', 'NN'),
('motions', 'NNS'), ('made', 'NN'), ('by', 'NN'), ('defense', 'NN'),
('attorneys', 'NNS'), ('Monday', 'NN'), ('in', 'NN'), ("Portland's", 'NN$'),
('insurance', 'NN'), ('fraud', 'NN'), ('trial', 'NN'), ('.', 'NN')]

Above, we also show how to tage any given sentence (sentence #1011 in the Brown news corpus). The correct tags for it are shown below:

>>> bts[1011]
[('U.S.', 'NP-TL'), ('Dist.', 'NN-TL'), ('Judge', 'NN-TL'), ('Charles', 'NP'), 
('L.', 'NP'), ('Powell', 'NP'), ('denied', 'VBD'), ('all', 'ABN'),
('motions', 'NNS'), ('made', 'VBN'), ('by', 'IN'), ('defense', 'NN'),
('attorneys', 'NNS'), ('Monday', 'NR'), ('in', 'IN'), ("Portland's", 'NP$'),
('insurance', 'NN'), ('fraud', 'NN'), ('trial', 'NN'), ('.', '.')]

Again, we can evaluate the accuracy of this tagger over the entire Borwn news tagged corpus:

>>> reTagr.evaluate(bts)
0.20326391789486245

So this one is about 20% accurate. You may want to define your own specific rules and expand the set of patterns to improve performance. The ordering of patterns provided to the tagger is important. Patterns are applied in order to each word and the first pattern that matches is the one which supplies the tag. Thus, if all previous patterns failed, in our example, every word will receive the "NN" tag.

Unigram Tagger

The Unigram tagger uses the most frequent tag for a word. It has to be first supplied by a gold standard set of tagged corpus to build its internal model of the most frequent tags (which is essentially a frequency count). It is used as shown below:

>>> uTagr = nltk.UnigramTagger(bts)
>>> uTagr.tag(bs[1011])
[('U.S.', 'NP'), ('Dist.', 'NN-TL'), ('Judge', 'NN-TL'), ('Charles', 'NP'), 
('L.', 'NP'), ('Powell', 'NP'), ('denied', 'VBN'), ('all', 'ABN'),
('motions', 'NNS'), ('made', 'VBN'), ('by', 'IN'), ('defense', 'NN'),
('attorneys', 'NNS'), ('Monday', 'NR'), ('in', 'IN'), ("Portland's", 'NP$'),
('insurance', 'NN'), ('fraud', 'NN'), ('trial', 'NN'), ('.', '.')]
>>> uTagr.evaluate(bts) 0.9349006503968017

Study the above results carefully. Notice that on the Brown news tagged corpus (the same corpus it was "trained on") it performs with 93.49% accuracy.

Separating Training and Accuracy

Of course training and testing on the same data will (ought to) result in good performance. Typically, given a tagged gold standard corpus, statistical taggers (like the Unigram Tagger) are typically trained on a subset of the corpus, and then tested against a set of previously unseen sentences for evaluating accuracy.

Thus, let us train the tagger on 90% of the corpus, and then test it against the remaining 10%.

>>> N = int(len(bts)*0.9)
>>> N
4160
   
>>> train = bts[:N]
>>> test = bts[N:]
   
>>> len(train)
4160 >>> len(test) 463 >>> uTagr = nltk.UnigramTagger(train) >>> uTagr.evaluate(test) 0.81371474135353339

The Unigram tagger is about 81% accurate on unseen sentences from the above. Will it make a difference if you used the first 90% of the sentences to train and the last 10% to test? Try it out.

Bigram Taggers

Like the Unigram tagger, NLTK has N-Gram taggers. Below we show how to define and use the Bigram tagger:

>>> bTagr = nltk.BigramTagger(train)
>>> bTagr.evaluate(test)
0.10305990232233629

Can you guess why the performance is so low?

Cascading Taggers with Backoff

Lets us do some probing into the reasons for the bad performance of the Bigram tagger. Lets tag a sentence and observe the output:

>>> bTagr.tag(bs[111])
[('Rep.', 'NN-TL'), ('Charles', 'NP'), ('E.', 'NP'), ('Hughes', 'NP'), 
('of', 'IN'), ('Sherman', 'NP'), (',', ','), ('sponsor', 'NN'), ('of', 'IN'),
('the', 'AT'), ('bill', 'NN'), (',', ','), ('said', 'VBD'), ('a', 'AT'),
('failure', 'NN'), ('to', 'TO'), ('enact', 'VB'), ('it', 'PPO'), ('would', 'MD'),
('amount', 'VB'), ('``', '``'), ('to', 'TO'), ('making', None), ('a', None),
('gift', None), ('out', None), ('of', None), ('the', None), ("taxpayers'", None),
('pockets', None), ('to', None), ('banks', None), (',', None),
('insurance', None), ('and', None), ('pipeline', None), ('companies', None),
("''", None), ('.', None)]

Notice that a number of words were assigned the "None" tag. This is because, given the context of bigrams, there were no prior instances present that matched the pair of words being considered. In these instances, the tagger assigns the "None" tag. Words that did not exist in the training vocabulary are called out of vocabulary words (or OOV).

NLTK provides a useful mechanism to cascade another tagger to use as a back up in case it determines that it has no recommendation for a tag given its training model. This is done as follows:

>>> bdTagr = nltk.BigramTagger(train, backoff=dTagr)
>>> bdTagr.evaluate(test)
0.72012359214591848 

Using the backoff option in creating the tagger allows you to specify another tagger to use as a back up tagger. In this case, we're using our Default Tagger from above (to tag all OOV words as "NN"). This improved the accuracy to 72%. Below, we try and use the Unigram Tagger as back up.

>>> buTagr = nltk.BigramTagger(train, backoff=uTagr)
>>> buTagr.evaluate(test)
0.8229841522974185

Obviously you can cascade multiple taggers in this way. Try to specify the Default Tagger as a back up for the Unigram Tagger whcih is a back up for the Bigram Tagger. Does it improve the accuracy? By how much?

Look atthe NLTK documentation for other N-gram taggers and exprriment to see how to achieve the maximum accuracy. There is generally a trade-of between using too much context, and or cascades. Be on the look out for that.

NLTK also provides several other taggers. See the documentation for more info.

Python for Linguists, Part 5... coming soon.