CS 206 -- Assignment 4
Building a sorted list
Suppose that you have a long, possibly infinite, stream of data coming to you in which the data is like the following:
3 | abbeys
|
1 | abbiamo
|
1 | abbotcy
|
3 | abbots
|
4 | abbott
|
7 | abbr
|
3 | abbreviate
|
3 | abbreviated
|
3 | abbreviates
|
3 | abbreviating
|
4 | abbreviation
|
4 | abbreviations
|
1 | abdications
|
5 | abdomen
|
PRINTWORDSASCENDING
|
3 | abdomens
|
11 | abc
|
4 | abcs
|
16 | abdominal
|
3 | abduct
|
4 | abducted
|
1 | abducting
|
6 | abduction
|
3 | abductions
|
7 | abby
|
4 | abbyy
|
PRINTCOUNTS
|
2 | abductor
|
2 | abductors
|
5 | abducts
|
PRINTWORDSDESCENDING
|
That is, ignoring for a moment the "PRINT..." lines, the data consists of rows and each row has two columns. The first is an integer and the second is a word. You may interpret this data as a word and the number of times the word occurs in some text. The problem is that the words occur in no particular. The integers will repeat many times, but each word occurs only once. Hence the number of integers will be far smaller than the number of words.
The problem is to read and store this data such that, whenever you see a line containing "PRINTWORDS" or "PRINTCOUNTS", you can quickly print out a information concerning all of the data you have seen.
There are three special lines in the dataset.
- PRINTWORDSASCENDING
- On this command, print out, in ascending order by their count, the list of words you have seen. So, in the data above, when you hit PRINTWORDSACENDING you would print:
1 | abbiamo
|
1 | abbotcy
|
1 | abdications
|
3 | abbeys
|
3 | abbots
|
3 | abbreviate
|
3 | abbreviated
|
3 | abbreviates
|
3 | abbreviating
|
4 | abbott
|
4 | abbreviation
|
4 | abbreviations
|
5 | abdomen
|
7 | abbr
|
- PRINTCOUNTS
- On the command, print the number of times each frequency has been seen. So, in the data above when you hit PRINTCOUNTS you would print
1 | 4
|
3 | 9
|
4 | 6
|
5 | 1
|
6 | 1
|
7 | 2
|
11 | 1
|
16 | 1
|
- PRINTWORDSDESCENDING
- On this command, print out, in descending order by their count, the list of words you have seen.
Your system must satisfy the following constraints:
- The time required for "PRINTWORDS..." must be O(n) where n is the number of words you have seen.
- The time required for "PRINTCOUNT" must be O(m) where m is the number of integers you have seen (m << n).
- The time to add a new word must be O(m) where m is the number of integers you have seen (m << n).
- The total amount of space required to store the data must be O(n+). (Note that when you assume that m << n then O(n+m) is the same as O(n).
- For each of the above statements, the constant terms should be close to 1. So, for instance, a solution that adds words in O(m) time but in which the constant is 10000 is not acceptable.
- The only limitation on your system should be the physical memory of the machine.
- The program must be able to read the data from a file.
- The program must accept the name of the file containing the data to be processed as a command line argument.
- In all of the above O() estimates, this should really be the worst case. It is unacceptable for the program to work on O(n) time almost all of the time but occasionally need O(n2) time.
To aid your debugging, there are several sample datafiles in /home/gtowell/cs206/a4/words? where '?' is a unix-style wildcard.
Hand in:
- a paper copy of your program
- an electronic submission of your program
Due:
March 17, 2005