Bryn Mawr College
CS 325: Computational Linguistics
Lab Assignment#2
Due in class on Thursday, September 29. 2011
Description: Write a program to access the contents of a
given text and then analyze it as follows:
- Separate the contents into sentences and words.
- Count the number of unique words used in the text.
- Count the number of sentences in the text.
- Compute the avarage word-length of a sentence.
- Compute the length of the shortest and the longest sentences.
Notes
- You should run the program on two texts of your chosing. For instance,
you can get electronic texts from Project Gutenberg's website.
- In order to test your program, create a small text file of your own. Run
the programs on the larger texts only after making sure that your program
is complete and correct.
- You will need to eliminate the 'added' text downloded from Project Gutenberg.
- What is a word? Think about this before you do anything. Arrive at a decision
and write it down. Then encode it in your program. Same for what is a sentence...
- Work incrementally to accomplish the task.
- Remember that in this domain, the problems generally tend to be ill-defined
and solutions also tend to be imperfect.
- This exercise is designed to help you face with the above reality and
yet explore and come up with your own solution(s) to solving the problem.
In this particular instance though, you can get help from your text or other
sources.
- Try and document your thought process at each step.
- Once done, write down the process by which you arrived at the final solution.
- Hand in a report containing the outcome of your analyses of the two texts,
your well commented program(s), and a sample output. Also, write a final
section on your own reflections on the exercise, the process, and how you
arrived at the solution(s). Is your solution general enough? For example,
would it be able to extract the same information from another similar source?
What changes/modifications would you require for another source?
Back to CS325 home page.