CS 151 - Introduction to Data Structures
Linked lists of baby names
In this assignment you will read, store and merge one or more csv files containing 2000 of the most common baby names by year. Your program must store the baby names in a linked list that is sorted alphabetically.
The program needs to be able to look up a name and report the following statistics:
- name whose stats are reported
- number - the total number of babies given that name for all years
- The number of times in which the name appears in the lists over all years. For many common names this will be exactly the number years of data scanned. Uncommon names may appear only once or twice over all of the years. Some names may appear twice in a given year (if they are used in both the first and second data column) so the number of times the name appears can be greater than the number of years.
- Alphabetical rank for all names seen in all years.
Your program should run based on input from the command line and interactive user queries. For example assuming that the class Main contains the main method to be run:
UNIX> java Main /home/gtowell/Public/151/A08/names2000.csv /home/gtowell/Public/151/A08/names2001.csv
Name: (q to quit) aaron
Usage
Aaron
Total Babies :19088
Times in top 1000 :2
Percentage : 0.30284856619735295
Alphabetic Position: 2
Name: (q to quit) sydney
Usage
Sydney
Total Babies :19879
Times in top 1000 :2
Percentage : 0.31539850416162923
Alphabetic Position: 1849
Name: (q to quit) sidney
Usage
Sidney
Total Babies :2922
Times in top 1000 :4
Percentage : 0.04636020067208012
Alphabetic Position: 1807
Name: (q to quit) Marlen
Usage
Marlen
Total Babies :212
Times in top 1000 :1
Percentage : 0.003363573765393903
Alphabetic Position: 1415
Name: q
The above formatting is meant as an example; not a requirement. Formating of the output is up to you. That said, the output must contain the required information.
Input Files
Input files are available at
/home/gtowell/Public/151/A08/namesYEAR.csv
where YEAR is a number in the range 1990-2017
These files contain lines in the following format:
rowNumber,name1,number1,name2,number2
where the comma-separated fields have the following meanings:
- ignore this column
- name1: a name; to be complete, name1 is a name used for an assigned gender at birth, male
- number1: the number of times name1 was used
- name2: a name; to be complete, name2 is a name used for an assigned gender at birth, female
- number2: the number of times name2 was used.
This is the format of database files obtained from the U.S. Social Security Administration. Here is an example showing data from the year 2002:
1,Jacob,30568,Emily,24463
2,Michael,28246,Madison,21773
3,Joshua,25986,Hannah,18819
4,Matthew,25151,Emma,16538
5,Ethan,22108,Alexis,15636
6,Andrew,22017,Ashley,15342
7,Joseph,21891,Abigail,15297
8,Christopher,21681,Sarah,14758
9,Nicholas,21389,Samantha,14662
10,Daniel,21315,Olivia,14630
...
996,Ean,157,Johana,221
997,Jovanni,157,Juana,221
998,Alton,156,Juanita,221
999,Gerard,156,Katerina,221
1000,Keandre,156,Amiya,220
From the above, in 2002, the most popular baby names were Jacob, Michael, Joshua, Matthew and Emily (in that order).
The entire data set contains a file for each year from 1990 to 2017, named names1990.csv, ..., names2017.csv.
Requirements
- The linked list must be kept in alphabetically sorted order by name, case insensitive. The linked list must be in sorted order at all times. (You should be able to use code you wrote in Lab.)
- The linked lists you use should be from scratch; do not use Java's built-in LinkedList class. You may use any code discussed in class or lab or in the textbook. Your linked lists may be either singly or doubly linked.
- The files to be read must be given on the command line as shown in the example above.
- The program must use a class of your design that stores all the relevant stats for a particular name.
- Your name holding class must implement the Comparable interface.
- Your linked list class must use the fact that the data items implement the Comparable interface to keep the linked items in sorted order.
- The files to be searched may only be read once.
- The program should never crash. It is OK for the program to choose to die on unexpected input or other issues. However, that should be a decision you, as the programmer made, rather than something that just happens. Use good exception handling. (For instance, suppose on the command line you give the name of a file that does not exist.)
- The interactive component of the program should have a text interface akin to that above. You need not reproduce my data format precisely.
- All code should be well-commented and follow the standard style guidelines
- Your linked list implementation must be completely generic. It should, without any code revision, be able to (for example) build a sorted list of Rabbits. (Note that the class(es) that use the linked list will certainly need a lot of code specific to handling the baby name problem. However, this should not affect your linked list code.)
- User interactions should be case insensitive
- You must have a graceful way for the user to quit the program.
- You must show the performance of your program in a "script" file that shows the output of your program on data for the years 1995-2008 (inclusive). The script file should show user interaction for the following names: ARI, Quinn, Renatta, AminA, Micah, dylan (A subset of the names of your classmates). Be sure to use capitalization as it is here.
- Answer the following question: What is the advantage of making the class that contains the baby name information implement its own compareTo method rather than just doing something that directly uses the compareTo method of the name string. (I can think of at least 3 reasons; one is that you need to do this to have a generic solution -- you should defintiely say why this is the case.) If you can't think of any other, speculate wildly. I am looking for a short paragraph.
Thoughts
Computing the overall percentage requires some additional data not stored in the linked list. Consider what you need and decide where and how to store the information carefully.
When a name is read from a data file, you must be able to handle that the name is already in the list. In such cases, rather than inserting a new item into the linked list, you should add information (the frequence of usage count and the appearance count) to the existing item.
Suggested Steps
- For your development purposes, rather than accepting command line arguments, use an array of strings defined within the main function. For instance, your code could look like:
public static void main(String[] args) {
String[] myArgs = {"/home/gtowell/Public/151/A08/names1990.csv"};
if (args.length == 0)
args = myArgs;
...
}
This has been discussed previously. Doing something like this will make development far quicker (if only because it will allow the use of the run button in VSC). To test with actual command line input you need only provide input on the command line.
- Initially just work with one input file
- Write a toString method that prints out the the contents of the linked list. You will use this toString method a lot in the next step.
- Set up your system so you can specify the number of lines to use. Then while working on getting your linked list in sorted order use only the first 2 lines (4 names). When your sorted linked list is correct for 4 names, do 3 lines (6 names), then 4, ... This may seem painful but it is a time honored approach. The idea is that by working with a really small data set you can easily spot (and repair) problems (presumably using the interactive debugger).
- Expand your class that holds names to provide storage for usage counts and appearance frequency. Also allow for updates of these quantities
- Work on the user interaction
- Create a system for looking up names
- Expand your system to work with multiple input files
- Document your code and make the user interactions clear (hopefully better than my sample above).
What to Hand in
- All code, well, documented
- The script file containing the interaction as described in the requirements section
- Readme file with the standard contents
- Answer to the question posed in the requirements section. This may be included in your readme. If not in a readme, then the only acceptable formats are text file or PDF.
- DO NOT INCLUDE: baby name data files
Electronic Submissions
Your program will be graded based on how it runs on the
department’s Linux server, not how it runs on your computer.
The following steps for submission assume that you created a project named AssignmentN in the directory /home/YOU/cs151/
- For this assignment N=8
- Put the README file into the project directory
- Go to the directory /home/YOU/cs151
- Enter /home/gtowell/bin/submit -c 151 -p N -d AssignmentN
For more on using the submit script click here