Bryn Mawr College
CS 325: Computational Linguistics - Fall 2024
Assignment#2
Due before class on Wednesday,October 2
Description: This assignment is an exercise in word and sentence segmentation.
First, you will write a program to access the contents of a given text and then analyze it as follows:
- Separate the contents into sentences and words.
- Count the number of unique words used in the text.
- Count the number of sentences in the text.
- Compute the average word-length of a sentence.
- Compute the length of the shortest and the longest sentences.
NLTK's corpora and how to access them are described in the Corpora Lab, and Chapter 2 of NLTK book. Texts to try:
1. From the gutenberg corpus try to process the file: austen-emma.txt (Jane Austen's Emma)
To access the gutenberg corpus:
import nltk
nltk.download('gutenberg')
from nltk.corpus import gutenberg
Then, you can access
text = gutenberg.raw("austen-emma.txt") will return the raw text of Emma
This is the raw text you should use to do your word and sentence segmentation. Also available, for comparison purposes...
gutenberg.words("austen-emma.txt")
will return a list of words
gutenberg.sents("austen-emma.txt")
will return a list of sentences (also broken into words)
2. From the inaugural corpus try to process the file: 2021-Biden.txt (as in the Corpora Lab) using your word and segmentation from above.
inaugural.raw("2021-Biden.txt")
will return the raw text of President Biden's 2021 inaugural address
inaugural.words("2021-Biden.txt")
will return a list of words
inaugural.sents("2021-Biden.txt")
will return a list of sentences (also broken into words)
For more information an these and other details about the NLTK corpora see: Corpora Lab, and Chapter 2 of Natural Language Processing with Python (Bird, Klein, and Loper 2009)
Description of task(s):
- Design your own tokenization/segmentation scheme. For example here are some suggestions:
Remove all linebreaks ('\n') from text. You can do this using re.sub('(\n)+', ' ', rawtext) # there may be more than one blank lines...
For sentence boundaries: the characters: . (period), ! (exclamation), : (colon), and ? (question) mark sentence boundaries. You can split the text by replacing all these characters with a '\n' using a regular expression (use, re.sub())
Next, replace all occurrences of: - (dash) with a SPACE Eliminate all occurences of: , (comma), ; (semi-colon), ' (single quote), " (double quote) with NOTHING
Next split the text into sentences (use <string>.split('\n') ). Now you will have a list of sentences. Sentence segmentation is complete.
Do your counts.
To extract words/tokens from this text: All you have to do now is separate each sentence into words by splitting at a SPACE. And you will have a list of all tokens.
To create a vocabulary from the list of words you may want to normalize case (to upper or lower) before creating the vocabulary.
This is ONE possible way to do word and sentence segmentation. It will suffice for this exercise.
- Once you have the number of unique words, sentences, etc. Compare your results with those from pre-segmented data already provided by NLTK (see above).
You can also use the NLTK tokenizer to tokenize words and sentences, given a raw text:
wrds = nltk.word_tokenize(<rawtextString>)
sentences = nltk.sent_tokenize(<rawtextString>)
So, now you will have three sets of data: from your own segmentation, from that provided by pre-segmented texts in NLTK, and from what you obtain by using the tokenizers.
Perform a comparison of these. Use the same Notebook to do this. At the end of the Notebook, summarize your findings.
Notes
- You should run the program on two texts of your chosing. For instance, you can get electronic texts from Project Gutenberg's website.
- In order to test your program, create a small text file of your own. Run the programs on the larger texts only after making sure that your program is complete and correct. You will need to eliminate the 'added' text downloded from Project Gutenberg.
- What is a word? Think about this before you do anything. Arrive at a decision and write it down. Then encode it in your program. Same for what is a sentence. Then you can use the suggestions provided above.
- Work incrementally to accomplish the task.
- Remember that in this domain, the problems generally tend to be ill-defined and solutions also tend to be imperfect.
- This exercise is designed to help you face with the above reality and yet explore and come up with your own solution(s) to solving the problem. In this particular instance though, you can get help from your text or other sources.
- Try and document your thought process at each step inyour Colab Notebook.
- At the end of the Notebook, describe the outcome of your analyses of the two texts (how they compare with the pre-segmented data in NLTK, and also to the NLTK's segmentation facilities), your well commented program(s), and a sample output (just the number of words, sentences, etc.).
- Also, write a final section on your own reflections on the exercise, the process, and how you arrived at the solution(s). Is your solution general enough? For example, would it be able to extract the same information from another similar source? What changes/modifications would you require for another source?
WHAT TO SUBMIT
Once completed, send/share the link to your Notebook with the instructor via e-mail. To do this, click on the "Share" icon/button (see top right of window), in the pop-up window, change the access to "Anyone with link", copy the link and paste into the e-mail.
Back to CS325 home page.