Bryn Mawr College
CS 325: Computational Linguistics
Lab Assignment#5
Due in class on Thursday, November 3, 2011

Description: Using the tagging methods in NLTK presented in class and cascading, build a good tagging system for tagging any tokenized text. Provide accuracy for your tagger on at least two different texts (pick your own). For the provided test text, show the output of the tagger. Highlight the words that were mistagged.

Notes

  1. You can use the nltk tokenizer if needed. For stochastic taggers it is a good idea to tag a sentence at a time (i.e. sentence boundaries are treated as new contexts.). You may want to use a combination of hand/program-based tokenization.

  2. Run your program on the texts provided here and here. :
  3. Work incrementally to accomplish the task.
  4. Try and document your thought process at each step.
  5. Once done, write down the process by which you arrived at the final solution.
  6. For the three two of your own chosing, use simple and small texts (similar to the one provided above).
  7. Hand in a report containing the outcome of your analyses of the four texts, your well commented program(s), and sample outputs as requested above. Also, write a final section on your own reflections on the exercise, the process, and how you arrived at the solution(s).

Notes for Regular Expression Tagger

    
  from nltk import *
# RE Tagger
    patterns = [(r'^-?[0-9]+(.[0-9]+)?$', 'CD'), # cardinal numbers
   (r'(The|the|A|a|An|an)$', 'AT'), # articles
   (r'.*able$', 'JJ'), # adjectives
   (r'.*ness$', 'NN'), # nouns from adjectives
   (r'.*ly$', 'RB'), # adverbs
   (r'.*s$', 'NNS'), # plural noun
   (r'.*ing$', 'VBG'), # gerund
   (r'.*ed$', 'VBD'), # simple past
   (r'.*es$', 'VBZ'), # 3rd person singular present
   (r'.*ould$', 'MD'), # modal
   (r'.*\'s', 'NN$'), # possessive noun
   (r'.*', 'NN') # noun default
   ]
   reTagr = nltk.RegexpTagger(patterns)
  

Back to CS325 home page.