Homework4: Finding Similar Sounding Names
Due: In class on Thursday, February 21, 2008
Presentation By: Caitlin & Lauren
The idea of forming signatures to process records/lines of data is quite common in many computing applications. You saw how Bentley used this idea to discover anagram classes in a dictionary of words. In this exercise we will use this idea to explore another application: finding similar sounding names and words. The Soundex algorithm has been put to use to identify similar sounding, yet differently spelled names or words. The Soundex algorithm has several flaws. In 1990 Lawrence Phillips designed a better phonetic encoding scheme that seems to be much better at encoding not just names but any english words based on phonetic encoding rules.
Both, Soundex and Metaphone, algorithms are described using several cases or rules that are used in the encoding. Similar to the problems discussed in Bentley's Chapter on Data Structures Algorithms, the programs to implement these algorithms can be written much better by using encoding schemes for the rules or cases of the algorithm. Thus these two algorithms embody two important programming patterns: using some kind of signature encoding, and the use of properties of the algorithm itself to encode its rules.
1) Study the Python implementations provided for Soundex and Metaphone algorithms carefully and make sure you understand them. Here is the implementation: soundex.py
Next, we will use these algorithms to find all similar sounding names in a database of names (see file FemaleNames provided in ~dkumar/Homework4 for a set of 10000 names). Here is what you have to do:
2) Use the Soundex algorithm (Python function: soundex) to create (signature, name) pairs of every name in the given data file. Pipe the putout of this to Bentley's anagram finding program to find all anagram classes, retaining only those classes that have two or more names in them).
3) Repeat (2) above but this time use the Metaphone algorithm (Pyhton function: metaphone).
Answer the following questions for each algorithm:
What to hand in
1. A printout of your answers to all the questions posed in (3)
2. A printout of any programs that you wrote that was not already provided. Explain how the programs were used in the exercise.
3. A short discussion comparing the two algorithms.
Extra Credit:
Would it be possible to use the programs above to process a dictionary of English words to find all homonyms (or homophones)? Give a printout of 25 most interesting homonyms found (you decide what is interesting).
Notes:
Refer to the previous exercise for all hints and programs. They are still usable here.
Presentation:
Caitlin & Lauren will give presentations on this exercise in class on thursday, February 21. The presentation should include a description of the problem and the overall solution. It should also include all the other things you had to do in the implementation process.