CS 206 - Introduction to Data Structures

Assignment 4: Due 11:59PM Friday October 4


Read, and follow, program design principles (http://cs.brynmawr.edu/cs206/design.pdf) and code formatting standards (http://cs.brynmawr.edu/cs206/style.html) carefully. You are expected to adhere to all stated standards.

In this assignment, we'll be exploring linked lists and more complex custom-designed classes.

Input File Format

We'll be taking input from files containing lines in the following format:
rank,male-name,male-number,female-name,female-number
where the comma-separated fields have the following meanings:
rankthe ranking of the names in this file
male-namea male name of this rank
male-number number of males with this name
female-name a female name of this rank
female-numbernumber of females with this name

This is the format of database files obtained from the U.S. Social Security Administration of the top 1000 registered baby names. Each line begins with a rank, followed by the male name at that rank, followed by the number of males with that name, etc. Here is an example showing data from the year 2002:

1,Jacob,30568,Emily,24463
2,Michael,28246,Madison,21773
3,Joshua,25986,Hannah,18819
4,Matthew,25151,Emma,16538
5,Ethan,22108,Alexis,15636
6,Andrew,22017,Ashley,15342
7,Joseph,21891,Abigail,15297
8,Christopher,21681,Sarah,14758
9,Nicholas,21389,Samantha,14662
10,Daniel,21315,Olivia,14630
...
996,Ean,157,Johana,221
997,Jovanni,157,Juana,221
998,Alton,156,Juanita,221
999,Gerard,156,Katerina,221
1000,Keandre,156,Amiya,220
As you can see from the above, in 2002, there were 30,568 male babies named Jacob and 24,463 female babies named Emily, making them the most popular names used in that year. Similarly, going down the list, we see that there were 220 newborn females named Amiya, making it the 1000th most popular female baby name.

The entire data set contains a file for each year from 1990 to 2017, named names1990.csv, ..., names2017.csv. These files are in /home/gtowell/Public206/data/a4. You should use the files directly from this location. Do not make a local copy. Grading will not use local copies. (As always, for development you may do what is convenient, but this rule applies to the version you hand in.)

Specific Tasks

You will be building two linked lists to store the baby names found in a file, one for the male names and one for the female names. The linked lists should be kept in alphabetically sorted order by name. The program needs to be able to look up a name and report the following statistics:
  • For each year for which you have data
    1. rank - the rank of the name that year (for that gender)
    2. number - the number of babies given that name that year (for that gender)
    3. percentage - the percentage of babies given that name in that year (for that gender)
    4. Alphabetical rank. For instance, in 1990 Aaron is 1, Abdul is 2 and Abel is 3 for boys. In 1999, Aaliyah is 1, Abagail is 2 and Abbey is 3 for girls. (The full sorted orders are in boysort1990.csv and girlsort1999.csv. These files should be used strictly to check your program.)
    For example, for the name ``Mary'' (female), if all 28 files have been read, then you would have the following output:
    1990
    Mary: 35, 8666, 0.005432, 734
    ...
    1999
    Mary: 38, 8760, 0.005596, 727
    ...
    2017
    Mary: 126, 2381, 0.001877, 720
    
    You should design a Name class that stores all the relevant stats for a particular name. The two linked lists you are building should be from scratch; you are not allowed to use Java's built-in LinkedList class. You may use code from lectures notes or the book. Also, although generic linked lists have many advantages, your code will be simplified if you use non-generic linked lists that are locked to the Name class.

    Computing the yearly percentages requires additional data not stored in the linked lists. Consider what you need and decide where and how to store the information carefully.

    You may find it convenient to have two linked lists for every year for which you have data. Taking this approach, it might be convenient to store the linked lists in an two array lists, one for boys and one for girls.) As an alternative, you could just have one liked list for boys and one for girls. This approach make some aspects or the task easier, but some are much harder.

    Look-up via Command-line Arguments

    Your program should take command-line arguments (as you worked with in Lab 4, exercise 1). The command line arguments are as follows:.
    1. A flag, either -m or -f, which indicates a male name or a female name to look up, respectively.
    2. A name, for instance Dianna. Capitalization of the name shoudl not matter. "Dianna" should give the same results as "dianna" and "diaNNa", etc.
    3. Either the -m/-f flag or the name of a file. If -m or -f then go back to 2. If not, then this is a file name
    4. (optional) additional file names
    For example:
    java -cp bin Main -f Dianna /home/gtowell/Public206/data/a4/names1990.csv /home/gtowell/Public206/data/a4/names2000.csv
    
    will print out the ranks (alphabetic and numeric), number and percentages (as explained above) of the female name Dianna used in 1990 and 2000.

    Other possible command line input include (but are not limited to):

    java -cp bin Main -f Mary -f Amie -m DaviD /home/gtowell/Public206/data/a4/names1991.csv
    java -cp bin Main -f Devon -m Devon /home/gtowell/Public206/data/a4/names1993.csv /home/gtowell/Public206/data/a4/names1994.csv /home/gtowell/Public206/data/a4/names1995.csv /home/gtowell/Public206/data/a4/names1991.csv
    	
    You many assume that the list of filenames is always last. That is once you see a file name you will not see a person name or a -f/-m. All subsequent arguments should be assumed to be data file names. Make sure you error-check your arguments thoroughly, i.e. illegal/badly-formated/missing options. Your program should behave rationally no matter how unreasonable the input.

    If you are using eclipse for development, and your is named Assignment4, then you should be able to run your program from the command line by doing the following. First open a terminal by selecting "Applications / System Tools / MATE Terminal" from the menus in the upper left. Then

    	cd /home/YOU/cs206/Assignment4
    
    After this you should be able to use commands like the one above.

    Suggested steps:

    1. Rather than accepting command line arguments, use an array of strings that you define within the Main class. For instance, your code could look like:
      public static void main(String[] args) {
          String[] myArgs = {"-f", "Dianna", "/home/gtowell/Public206/data/a4/names1990.csv"};
          args = myArgs;
          ...
      }
      
    2. Read one file into two lists of unique names in sorted order. If you are having trouble debugging the sorted order, I suggest creating a smaller input set. One way to do this is to simply stop reading the file after 10 or 20 names.
    3. Expand your Name class to provide storage for number and rank and (if needed) alphabetic rank.
    4. Compute all the necessary totals to enable yearly percentage reporting and storing them reasonably.
    5. Enable single name lookup
    6. Enable single name lookup on a multiple files (adjusting myArgs as necessary).
    7. Enable multiple name lookup on multiple files.
    8. Enable the use of command line arguments (for good sets of arguments) as described next -- i.e., eliminate the use of myArgs.
    9. Enable checking of command line arguments to handle bad user input. Form instance, unknown files, missing arguments, etc.
    Grading will follow this order. So to maximize you grade on an incomplete assignment, be sure have step N done and correct before moving on to step N+1.

    Write-up

    Please include a write-up in your README that explains your class design and algorithms. In particular, address the following questions:
    1. What instance variables do you have in your Name class?
    2. How to you store the linked lists and how many links lists do you use?
    3. How do you organize the storage of the yearly statistics per name?
    4. How do you keep the linked lists in alphabetically sorted order?
    5. Suppose that you are required to also print out overall statistics. So the output of you program might look like:
      		1990
      		Mary: 35, 8666, 0.005432, 734
      		2017
      		Mary: 126, 2381, 0.001877, 720
      		TOTAL:
      		Mary: 110, 12477, 0.002231, 700
      		
      (I made up the total numbers). What additional data would you need to keep. In particular, suggest what you would need to do to get the total numeric rank and total alphabetic rank. (Simple average is not correct.) Do not implement this, just suggest what you would need to do to extend you program in this manner.

    Electronic Submissions

    Your program will be graded based on how it runs on the department’s Linux server, not how it runs on your computer. DO NOT INCLUDE:
    Data files that are read from the class site.

    The following steps for submission assume you are using Eclipse, and that you created a project named Assignment4 in the directory /home/YOU/cs206/

    1. Put the README file into the project directory (/home/YOU/cs206/Assignment4)
    2. Put your writeup in this same directory.
    3. Go to the directory /home/YOU/cs206
    4. Enter submit -c 206 -p 4 -d Assignment4 For more on using the submit script click here