CS 151 - Introduction to Data Structures

Linked lists of baby names

In this assignment you will read, store and merge one or more csv files containing 2000 of the most common baby names by year. Your program must store the baby names in a linked list that is sorted alphabetically.

The program needs to be able to look up a name and report the following statistics:
  1. name whose stats are reported
  2. number - the total number of babies given that name for all years
  3. The number of times in which the name appears in the lists over all years. For many common names this will be exactly the number years of data scanned. Uncommon names may appear only once or twice over all of the years. Some names may appear twice in a given year (if they are used in both the first and second data column) so the number of times the name appears can be greater than the number of years.
  4. Alphabetical rank for all names seen in all years.
Your program should run based on input from the command line and interactive user queries. For example assuming that the class Main contains the main method to be run:
UNIX> java Main  /home/gtowell/Public/151/A08/names2000.csv /home/gtowell/Public/151/A08/names2001.csv

Name:   (q to quit) aaron

Usage
Aaron
Total Babies       :19088
Times in top 1000  :2
Percentage         : 0.30284856619735295
Alphabetic Position: 2

Name:   (q to quit) sydney

Usage
Sydney
Total Babies       :19879
Times in top 1000  :2
Percentage         : 0.31539850416162923
Alphabetic Position: 1849

Name:   (q to quit) sidney

Usage
Sidney
Total Babies       :2922
Times in top 1000  :4
Percentage         : 0.04636020067208012
Alphabetic Position: 1807

Name:   (q to quit) Marlen

Usage
Marlen
Total Babies       :212
Times in top 1000  :1
Percentage         : 0.003363573765393903
Alphabetic Position: 1415

Name: q
The above formatting is meant as an example; not a requirement. Formating of the output is up to you. That said, the output must contain the required information.

Input Files

Input files are available at
	/home/gtowell/Public/151/A08/namesYEAR.csv
where YEAR is a number in the range 1990-2017

These files contain lines in the following format:
rowNumber,name1,number1,name2,number2
where the comma-separated fields have the following meanings:
  1. ignore this column
  2. name1: a name; to be complete, name1 is a name used for an assigned gender at birth, male
  3. number1: the number of times name1 was used
  4. name2: a name; to be complete, name2 is a name used for an assigned gender at birth, female
  5. number2: the number of times name2 was used.
This is the format of database files obtained from the U.S. Social Security Administration. Here is an example showing data from the year 2002:
1,Jacob,30568,Emily,24463
2,Michael,28246,Madison,21773
3,Joshua,25986,Hannah,18819
4,Matthew,25151,Emma,16538
5,Ethan,22108,Alexis,15636
6,Andrew,22017,Ashley,15342
7,Joseph,21891,Abigail,15297
8,Christopher,21681,Sarah,14758
9,Nicholas,21389,Samantha,14662
10,Daniel,21315,Olivia,14630
...
996,Ean,157,Johana,221
997,Jovanni,157,Juana,221
998,Alton,156,Juanita,221
999,Gerard,156,Katerina,221
1000,Keandre,156,Amiya,220
From the above, in 2002, the most popular baby names were Jacob, Michael, Joshua, Matthew and Emily (in that order).

The entire data set contains a file for each year from 1990 to 2017, named names1990.csv, ..., names2017.csv.

Requirements

Thoughts

Computing the overall percentage requires some additional data not stored in the linked list. Consider what you need and decide where and how to store the information carefully.

When a name is read from a data file, you must be able to handle that the name is already in the list. In such cases, rather than inserting a new item into the linked list, you should add information (the frequence of usage count and the appearance count) to the existing item.

Suggested Steps

  1. For your development purposes, rather than accepting command line arguments, use an array of strings defined within the main function. For instance, your code could look like:
    public static void main(String[] args) {
        String[] myArgs = {"/home/gtowell/Public/151/A08/names1990.csv"};
    	if (args.length == 0)
    		args = myArgs;
        ...
    }
    
    This has been discussed previously. Doing something like this will make development far quicker (if only because it will allow the use of the run button in VSC). To test with actual command line input you need only provide input on the command line.
  2. Initially just work with one input file
  3. Write a toString method that prints out the the contents of the linked list. You will use this toString method a lot in the next step.
  4. Set up your system so you can specify the number of lines to use. Then while working on getting your linked list in sorted order use only the first 2 lines (4 names). When your sorted linked list is correct for 4 names, do 3 lines (6 names), then 4, ... This may seem painful but it is a time honored approach. The idea is that by working with a really small data set you can easily spot (and repair) problems (presumably using the interactive debugger).
  5. Expand your class that holds names to provide storage for usage counts and appearance frequency. Also allow for updates of these quantities
  6. Work on the user interaction
  7. Create a system for looking up names
  8. Expand your system to work with multiple input files
  9. Document your code and make the user interactions clear (hopefully better than my sample above).

What to Hand in

Electronic Submissions

Your program will be graded based on how it runs on the department’s Linux server, not how it runs on your computer.

The following steps for submission assume that you created a project named AssignmentN in the directory /home/YOU/cs151/

  1. For this assignment N=8
  2. Put the README file into the project directory
  3. Go to the directory /home/YOU/cs151
  4. Enter /home/gtowell/bin/submit -c 151 -p N -d AssignmentN

For more on using the submit script click here