CS 330: Algorithms: Design & Practice

Homework3: Finding Anagrams using the algorithm described by Bentley, applied to whole texts
Due: In class on Monday, February 13, 2012

Presentation By: CCC & DDD

Now that you have a way of finding anagrams from a dictionary of words we can try and find out how many anagram classes are used in some of the well known texts. To accomplish this you will need:

  1. An electronic version of the chosen text that you can process.
  2. Extracting the dictionary of words used in that text.
  3. Finding the anagram classes.

A huge number of texts are available on the web from the site for Project Gutenberg (www.gutenberg.net). You are free to pick any three texts of your choice for this assignment:

For development and debugging, you may want to use a smaller text, like Lincoln's Gettysburg Address (also available from the above site).

What to hand in

1. A description of the entire processing scheme, including any new program(s) you had to write.

2. The size of the Gettysburg Address, the number of words in it, and a listing of all the anagram classes found.

3. The size of the text, the number of words in the text, the number of unique words in the text, the number of anagram classes in each text, and a prinout of the anagram classes that are words of 6 characters or longer. Also include the total time it took to do the processing.

Notes:

Unix has several useful utilities than can help here. For example the translate utility (tr) that you may have used earlier to convert uppercase letters to lowercase can also be used as follows:

tr -cs '[:alnum:]' '[\n*]'

This command will convert a text into one word per line. You will end up with extra blank lines. To eliminate them you can do:

tr -s '\n'

Also, the Unix utilities 'sort' and 'uniq' will come in handy. Use the 'man' command and the 'info' commands for more information on these commands.

man tr
info tr

Presentation:

CCC & DDD will give presentations on this exercise in class on Monday, February 13. The presentation should include a description of the problem and the overall solution. It should also include all the other things you had to do in the implementation process.


Back to CS330 Materials