CS 337: Lab 3

Dynamic Programming and String Transformations

ID

Do not put you name on the report. Instead put 2 non-human pokemon characters. (I choose charizard and squirtle.)

Overview

When a spell checker detects an error it often suggests an alternative. This lab explores one way in which that alternative can be determined; specifically dynamic programming. As an aside, this is perhaps the most misleading name for an algorithm in computer science, since it uses a static program to develop an answer. Even so, dynamic programming has a wide range of other applications. For example, Cormen notes that can be applied in computational biology to the problem of DNA sequence alignment.

Data

In part 3 of the assignment you will find the lowest cost transformation of words given on the command line to actual english words. You will obtain these words from a simple file (albeit large) that has exactly one word per line. The file is avail on Unix from:
/home/gtowell/Public/CS337/Lab03/words

Here is a complete Java class for reading the contents of a file (formatted like the "words" file) into an ArrayList. You are not required to use Java in this lab, I simply make this code available as an example. (If you use this class as a starting point, it will need adjustments depending on how you choose to use it.) You can find similar code for other PLs on the web. (Just be sure, if you copy someone else's code, that you acknowledge their work.)
    import java.io.BufferedReader;
    import java.io.FileReader;
    import java.util.ArrayList;
    
    public class GGReader {
        public ArrayList readFile(String fileName) {
            ArrayList aaa = new ArrayList<>();
            try (BufferedReader br = new BufferedReader(new FileReader(fileName))) {
                String line;
                while ((line = br.readLine()) != null) {
                    aaa.add(line.trim());
                }
                return aaa;
            } catch (Exception ee) {
                System.err.println(ee);
                return aaa;
            }
        }
        public static void main(String[] args) {
            GGReader myReader = new GGReader();
            ArrayList words = myReader.readFile(args[0]);
            System.out.println(words);
        }
    }   

Part 1

Implement the string transformation algorithm as described by Cormen on pages 121-129 of Chapter 7. Follow Cormen's algorithm exactly; including using the costs he describes. Your program should take two command line arguments: the word to be transformed and the word to transform into. The output of your program should be
    The cost table
    The "op" or actions table
    A listing of the operations taken on a minimum cost transformation (Per Cormen, the minimum cost transformation may not be unique)
    The cost
For instance: given the arguments "abcdcba" and "abracadabra" your output should look like:
Costs:
    0     2     4     6     8    10    12    14    16    18    20    22
    2    -1     1     3     5     7     9    11    13    15    17    19
    4     1    -2     0     2     4     6     8    10    12    14    16
    6     3     0    -1     1     1     3     5     7     9    11    13
    8     5     2     1     0     2     2     2     4     6     8    10
   10     7     4     3     2    -1     1     3     3     5     7     9
   12     9     6     5     4     1     0     2     4     2     4     6
   14    11     8     7     4     3     0     1     1     3     3     3
Actions:
         ia    ib    ir    ia    ic    ia    id    ia    ib    ir    ia
   da    ca    ib    ir    ca    ic    ca    id    ca    ib    ir    ca
   db    db    cb    ir    ia    ic    ia    id    ia    cb    ir    ia
   dc    dc    dc rc->r rc->a    cc    ia    id    ia    ib    ir    ia
   dd    dd    dd rd->r rd->a rd->c rd->a    cd    ia    ib    ir    ia
   dc    dc    dc rc->r rc->a    cc    ia rc->d rc->a rc->b rc->r rc->a
   db    db    cb rb->r rb->a    db rb->a rb->d rb->a    cb    ir    ia
   da    ca    da ra->r    ca    da    ca ra->d    ca    ib ra->r    ca

min cost actions: ca cb ir ia cc ia cd rc->a cb ir ca 
Cost: 3
in the Actions table above, the first letter indicates the action and the second letter indicates what to do. For example, "da" means delete the letter 'a'; "cb" means copy the letter 'b'; "ic" means insert the letter 'c'; and "rc->a" means replace the letter 'c' with the letter 'a'. Your "op" table may look somewhat different than mine depending on how you choose to abbreviate things, but the operations should be the same.

Part 2

When considering misspellings, the costs of letter replacements are not constant. That is, some mistakes are very common on a keyboard and therefore should have low cost for replacement, and others are uncommon and so should have higher costs. For this assignment we will use a very simple, and wildly incorrect, replacement cost function as follows:
Given: chA: a character
       chB: a character 
Return: an integer indicating the cost of transforming chA into chB 

func cost(chA, chB)
    let iA = int(chA)  // int gives the ASCII value of a character 
        iB = int(chB) 
    return sqrt(abs(iA-iB))+1
In addition, change the costs of the other transformations to:
    Copy = 0
    Insert = 3
    Delete = 3
Given this reformulation, below is my output for transforming "abqdddaba" into "abracadabra".
    0     3     6     9    12    15    18    21    24    27    30    33
    3     0     3     6     9    12    15    18    21    24    27    30
    6     3     0     3     6     9    12    15    18    21    24    27
    9     6     3     6     8    10    13    16    19    22    25    28
   12     9     6     9     8    10    12    13    16    19    22    25
   15    12     9    12    11    10    12    12    15    18    21    24
   18    15    12    15    14    13    12    12    14    17    20    23
   21    18    15    18    15    16    13    15    12    15    18    20
   24    21    18    21    18    19    16    18    15    12    15    18
   27    24    21    24    21    22    19    21    18    15    18    15

         ia    ib    ir    ia    ic    ia    id    ia    ib    ir    ia
   da    ca    ib    ir    ca    ic    ca    id    ca    ib    ir    ca
   db    db    cb    ir    ia    ic    ia    id    ia    cb    ir    ia
   dq    dq    dq    dq rq->a rq->c    ia rq->d    ia rq->b    ir    ia
   dd    dd    dd    dd rd->a rd->c rd->a    cd    ia    ib    ir    ia
   dd    dd    dd    dd rd->a rd->c rd->a    cd rd->a rd->b    ir rd->a
   dd    dd    dd    dd rd->a rd->c rd->a    cd rd->a rd->b    ir rd->a
   da    ca    da    da    ca    da    ca    da    ca    ib    ir    ca
   db    db    cb    db    db    db    db    db    db    cb    ir    ia
   da    ca    da    da    ca    da    ca    da    ca    da    da    ca

actions: ca cb ir rq->a rd->c rd->a cd ca cb ir ca 
Cost: 15

Part 3

In this part we will actually try to find the lowest cost misspellings. To do this, do the following
Read the words file
For each (possibly incorrectly spelled) word given on the command line
    find within the words file the lowest cost transformation 
    print the lowest cost transformation
For example, here is my output for a program written in Go:
    go run transfDict.go Needier moare ars onw comend liine
    Needier -> Seeder with cost 6
    moare -> mare with cost 3
    ars -> art with cost 2
    onw -> on with cost 3
    comend -> commend with cost 3
    liine -> ligne with cost 2
Clearly my cost function could use some work -- as could my typing since I was aiming for "Need more args on the command line". Note that the lowest cost solutions are not necessarily unique so your answers may differ from mine, but the costs should be identical.

What to hand in

Write a report describing the algorithms used in this lab and their likely application to this spelling correction problem. In the discussion of the algorithms, focus on dynamic programming; be sure that your explanation conveys a clear understanding of dynamic programming and how it is used in this Lab. Also be sure to discuss how you implementation could be enhanced to improve its application to the spelling correction problem.

The target audience of this report should be someone familiar with programming but not dynamic programming or imprecise matching.

Include, in appendices, your implementations for each of the parts 1, 2 and 3 above.

Include your results for, at least, the following inputs:

  1. Part 1: thefact factotum
  2. Part 2: thesis theorem
  3. Part 3: repurt derrrbing teh algorms usig fololwing inupts
You are certainly welcome to show the actions of you program on other inputs as well.