CS 206 - Introduction to Data Structures
Linked lists of baby names
In this assignment you will read, store and merge one or more cvs files containing the 1000 most common baby names for boys and girls by year. Your program should store the baby names in two independent linked lists (one for boys and one for girls) that are sorted alphabetically.
The program needs to be able to look up a name and report the following statistics:
- The sex of the name whose stats are reported
- number - the total number of babies given that name (for that gender) for all years
- percentage - the percentage of babies given that name (for that gender) for all years
- The years in which the name appeared in the list.
- Alphabetical rank (for that gender) for all names seen in all years.
Explained in more detail below, your program should run based on input from the command line and interactive user queries. For example assuming that the class Main contains the main method to be run:
UNIX> java Main /home/gtowell/Public/206/a9/names2000.csv /home/gtowell/Public/206/a9/names2001.csv
Name (q to quit): Mary
Used as a girls name
Mary
Total Babies :11921
Years in top 1000 :[2000, 2001]
Percentage : 0.004186338230563605
Alphabetic Position: 293
Name (q to quit): Devon
Used as a Boys Name
Devon
Total Babies :5695
Years in top 1000 :[2000, 2001]
Percentage : 0.0016482288847264317
Alphabetic Position: 759
Used as a girls name
Devon
Total Babies :665
Years in top 1000 :[2000, 2001]
Percentage : 2.3353031820525102E-4
Alphabetic Position: 741
Name (q to quit): Mark
Used as a Boys Name
Mark
Total Babies :9976
Years in top 1000 :[2000, 2001]
Percentage : 0.002887222362428601
Alphabetic Position: 336
Name (q to quit): Marlen
Used as a girls name
Marlen
Total Babies :212
Years in top 1000 :[2000]
Percentage : 7.444876309701235E-5
Alphabetic Position: 298
Name: q
The above formatting is meant as an example; not a requirement. Formating of the output is up to you. That said, the output must contain the required information. Note, in the example above the names are sorted in reverse alphabetic order.
Input File Format
The input files contain lines in the following format:
rank,male-name,male-number,female-name,female-number
where the comma-separated fields have the following meanings:
- rank the ranking of the names in this file (you can ignore this)
- male-name a male name of this rank
- male-number number of males with this name
- female-name a female name of this rank
- female-number number of females with this name
This is the format of database files obtained from the U.S. Social Security Administration. Here is an example showing data from the year 2002:
1,Jacob,30568,Emily,24463
2,Michael,28246,Madison,21773
3,Joshua,25986,Hannah,18819
4,Matthew,25151,Emma,16538
5,Ethan,22108,Alexis,15636
6,Andrew,22017,Ashley,15342
7,Joseph,21891,Abigail,15297
8,Christopher,21681,Sarah,14758
9,Nicholas,21389,Samantha,14662
10,Daniel,21315,Olivia,14630
...
996,Ean,157,Johana,221
997,Jovanni,157,Juana,221
998,Alton,156,Juanita,221
999,Gerard,156,Katerina,221
1000,Keandre,156,Amiya,220
From the above, in 2002, the most popular baby names were Jacob with 30,568 male babies, and Emily with 24,463 female babies. Similarly, going down the list, there were 220 newborn females named Amiya, making it the 1000th most popular female baby name.
The entire data set contains a file for each year from 1990 to 2017, named names1990.csv, ..., names2017.csv. These files are in /home/gtowell/Public/206/a9.
Requirements
- The program must use 2 separate linked lists (at least): one for the male names and one for the female names.
- The linked lists must be kept in alphabetically sorted order by name, case insensitive. The linked list must be in sorted order at all times. (You should use a variation on insertion sort.)
- The two linked lists you are building should be from scratch; you are not allowed to use Java's built-in LinkedList class. You are allowed to use any code discussed in class. Your linked lists may be either singly or doubly linked.
- The program must use a class of your design that stores all the relevant stats for a particular name.
- Your name holding class must implement the Comparable interface.
- Your linked list class should use the fact that the data items implement the Comparable interface to keep the linked items in sorted order.
- The program must accept the list of files to be searchable from the command line.
- The files to be searched may only be read once.
- The program should never crash. It is OK for the program to choose to die on unexpected input or other issues. However, that should be a decision you, as the programmer made, rather than something that just happens. Use good exception handling.
- The interactive component of the program should have a text interface akin to that above.
- All code should be well-commented and follow the standard style guidelines
- In your README, be sure to answer to question posed below.
- Your linked list implementation must be completely generic. It should, without any code revision, be able to (for example) build a sorted list of Rabbits. (Note that the class(es) that use the linked list will certainly need a lot of code specific to handling the baby name problem. However, this should not affect your linked list code.)
Thoughts
Computing the overall percentages requires some additional data not stored in the linked lists. Consider what you need and decide where and how to store the information carefully.
When a name is read from a data file, you must be able to handle that the name is already in the list. In such cases, rather than inserting a new item into the linked list, you should add information (the count and the year) to the existing item.
When implementing sorting -- via insertion sort -- first make the class holding the baby name information implement the Comparable interface. Then use the compareTo method when determining where to insert. Note that your compareTo method might be only one line long as all it (probably) needs to do is invoke the compareTo method of the underlying name.
Suggested Steps
- For your development purposes, rather than accepting command line arguments, use an array of strings defined within the main function. For instance, your code could look like:
public static void main(String[] args) {
String[] myArgs = {"/home/gtowell/Public/206/a9/names1990.csv"};
if (args.length == 0)
args = myArgs;
...
}
This has been discussed previously. Doing something like this will make development far quicker. To test with actual command line input you need only provide input on the command line.
- Initially just work with one input file
- Write a toString method that prints out the the contents of a the linked lists. You will use this toString method a lot in the next step.
- Read one file into two lists of unique names in sorted order. If you are having trouble debugging the sorted order, I suggest creating a smaller input set. One way to do this is to simply stop reading the file after 2 names. When correct with 2 names, do 3, then 4, ... This may seem painful but it is a time honored approach. It is usually easier to identify and correct bugs using this procedure than any other.
- Expand your class that holds names to provide storage for usage counts and years
- Figure out a way to extract years from the file name. The file name is the only place that this information appears.
- Work on the user interaction
- Create a system for looking up names
- Expand your system to work with multiple input files
- Document your code and make the user interactions clear (better than my sample above).
Electronic Submissions
Your program will be graded based on how it runs on the
department’s Linux server, not how it runs on your computer.
The submission should include the following items:
- README: This file should follow the format of this sample README (https://cs.brynmawr.edu/cs206/README.txt)
- Within the README file, include a brief (max 1 paragraph) reflection on what went well and/or poorly and why
- In your readme add a short paragraph on the following topic: What is the advantage of making the class that contains the baby name information implement its own compareTo method rather than just doing something that directly uses the compareTo method of the name string. (I can think of at least 3 reasons; one is that you need to do this to have a generic solution -- you should defintiely say why this is the case.) If you can't think of any other, speculate wildly.
- Source files: Every .java file used in the final version of every part of your project (including the imported files)
- Unique Data files used: If any (there should not be any)
The following steps for submission assume that you created a project named AssignmentN in the directory /home/YOU/cs206/
- For this assignment N=9
- Put the README file into the project directory
- Go to the directory /home/YOU/cs206
- Enter /home/gtowell/bin/submit -c 206 -p N -d AssignmentN
For more on using the submit script click here