Bryn Mawr College
CS 325: Computational Linguistics
Lab Assignment#5
Due in class on Thursday, November 3, 2011
Description: Using the tagging methods in NLTK presented in class and cascading, build a good tagging system for tagging any tokenized text. Provide accuracy for your tagger on at least two different texts (pick your own). For the provided test text, show the output of the tagger. Highlight the words that were mistagged.
Notes
You can use the nltk tokenizer if needed. For stochastic taggers it is a good idea to tag a sentence at a time (i.e. sentence boundaries are treated as new contexts.). You may want to use a combination of hand/program-based tokenization.
Notes for Regular Expression Tagger
from nltk import *
# RE Tagger
patterns = [(r'^-?[0-9]+(.[0-9]+)?$', 'CD'), # cardinal numbers
(r'(The|the|A|a|An|an)$', 'AT'), # articles
(r'.*able$', 'JJ'), # adjectives
(r'.*ness$', 'NN'), # nouns from adjectives
(r'.*ly$', 'RB'), # adverbs
(r'.*s$', 'NNS'), # plural noun
(r'.*ing$', 'VBG'), # gerund
(r'.*ed$', 'VBD'), # simple past
(r'.*es$', 'VBZ'), # 3rd person singular present
(r'.*ould$', 'MD'), # modal
(r'.*\'s', 'NN$'), # possessive noun
(r'.*', 'NN') # noun default
]
reTagr = nltk.RegexpTagger(patterns)