Bryn Mawr College
CS 325: Computational Linguistics - Fall 2024
Assignment#3
Due before class on Wednesday, October 23
Description:
Part1: You are given a set of hashtags. Your goal is to segment them into tokens using an English dictionary and the hashtag segmentation algorithm. Here are some examples:
#computerscience -> computer scienceUse the hashtag segmentation algorithm discussed in class. You will need to provide a lexicon of words. Here are three lexicons to try:
1. NLTK words
import nltk NLTKdict = nltk.corpus.words.words()2. Linux dictionary of words: available in (on your Linux computer): /use/share/dict/words or download here/use.
3. A lexicon published by a company that rhymes with frugal. Click here to download/use.
Try the segmentation with all three lexicons on the hashtags provided here testHashtags.txt
HINT: You will need to convert the lexicon into lowercase before using.
Next, study your output from hashtags in testHashtags.txt and answer the following question:
When the algorithm is incorrect, what types of failures do you see and what causes each? List the types you observe with an example for each. In your report, identify at least 3 different types of errors.
Part 2: We now want to objectively evaluate the performance of of the hashtag segmentation algorithm (and the lexicon).
Your can use the minimum edit distance algorithm (from class) as a metric to compute a Word Error Rate (WER) between your guesses and the correct answers. WER is defined as the length normalized minimum edit distance (i.e, minimum edit distance divided by the length of the correct segmentation string). You may use the MED Python implementation shared with you in class.
Create a working WER fucntion that takes the output from your hashtag segmentation algorithm along with the correct answers (see file testWithAnswers.txt) and returns the WER for that pair of words. Show the WER for each of the test hashtags (in testhashtag.txt) and the three lexicons. Additionally, compute the average WER across all inputs (for each lexicon).
At the end of the notebook, write a short summary with your results.
ANOTHER DICTIONARY
In case you want to try yet another dictionary, here is a source: click here. See the README.md file. You may want to use either words.txt or words_alpha.txt.
WHAT TO SUBMIT
Once completed, send/share the link to your Notebook with the instructor via e-mail. To do this, click on the "Share" icon/button (see top right of window), in the pop-up window, change the access to "Anyone with link", copy the link and paste into the e-mail.