Rock-a-Bye Baby Names

CS 206 - Introduction to Data Structures

Homework 10

Trees of Baby Names

Due Friday, Dec 11 prior to 11:59PM

Remember: Read, and follow, program design principles (http://cs.brynmawr.edu/cs206/design.pdf) and code formatting standards (http://cs.brynmawr.edu/cs206/style.html) carefully.

Overview

Recall that a Binary Search Tree has the property that all items in the right subtree are greater than the root of the tree and that all items in the left subtree are less than the root of the tree.

In this assignment you will be building and updating a trees so that, at all times, they have the property of being binary search trees.

Data

The data you will be workiing with is the baby name data from the previous assignment. However, for this assignment the data files have been reworked. Specifically, the data is now simply a name and a number on each line, with no gender attribution. In addition, the order of the names has been randomized. Hence the data looks like

Vivian,1176
Keshawn,380
Camryn,1753
Savanah,438
Katelynn,1388
Kailee,536
David,19767
Montana,432
Gerard,194
Paloma,245
Trenton,2063
Tyree,466

Note that since gender attribution has been removed, some names may appear in twice within the file for a given year. As in the previous assignment, the data is in files by year, from 1990 through 2017.

The data files are in /home/gtowell/Public/206/a10/snames*.csv

Task

The central task in this assignment is to build a single binary search tree that holds all of the baby name data. Unlike in the previous assignment, the only data of concern in this assignment is the name and the total number of times it has been used. Hence, if you were to see the following lines (presumably appearing in three data files):

Devon,2972
Devonte,456
Devon,371
Devon,294
Devon,2723
Devonte,369

then the data you are storing should only have two records: Devon, 6360 and Devonte, 825.

The binary search tree (BST) must be for the number of times a name has been used in the data you have already read. The BST must be correct after every update (reading each line of data). To keep it correct every time you update an intem already in the tree you must determine if the update made the tree not a BST. If it is not a BST, you need to repair the tree. (For instance, it is not a BST if the in-order successor is less then the current node. There may be other conditions, this is certainly one.) There are several ways of updating the tree to make it a BST. I recommend just deleting the updated node and then adding it back into the tree. This is rather inefficient, but it is by far easier than other approaches. (As a warning, be careful in your handling of ties.)

The fact that the BST is built for the number of name uses will make updating items in the tree awkward since you will not know that number when you read an item. Rather you will know the name and the number of new uses. Since the tree is not built to enable search by name there is little you can do oher than an exhustive search if you use only a BST based on usage count. To make this search easier, you may use any additional data structure you like. (For instance, you could have a second BST, this one ordered by name.) You are not required to use an additional data structure, but you may. Note that if you use an additional data structure you need to be careful to keep both structures up-to-date.

Specific Task

Your program should take command line input of the form:

    java Main numberToShow linesToRead file1 ... fileN

where:

Main is the name of the java class containing your main method
numberToShow is the number of names to show in order of usage. If numberToShow is greater than zero this should show names in desceding order starting with the most common. If numberToShow is less than zero, this should show names in ascending order, startng from the smallest.
linesToRead is the maximum number of lines of baby names to read from the provided files. For instance, if linesToRead is 100000 then you would read every line of every file provided since the total number of lines in all of the name files is less than 100000. On the other hand if the number is 3000, then the entire first file would be read and half of the second (each file has 2000 lines). So even if 5 file names are provided, only names from the first 2 files would be read.
file1 .. fileN the names of the files to read (Note that, as described above, given linesToRead you may not actually read anything from files other than file1.)

Here are some examples and outputs (Note that, especially when showing low counts, your results may differ slightly due to differences in handling ties)

java Main 10 7000 snames2000.csv snames2001.csv snames2002.csv
Jacob,97580
Michael,89962
Matthew,80529
Joshua,79537
Emily,75471
Christopher,69735
Nicholas,68883
Andrew,68062
Joseph,66720
Daniel,64611

java Main -5 7000 snames2000.csv snames2001.csv snames2002.csv
Dayne,148
Francesco,148
Dillion,149
Vincenzo,149
Nicklaus,150

java Main -5 1000 snames2000.csv snames2001.csv snames2002.csv
Donnell,149
Treyvon,149
Vincenzo,149
Braiden,150
Nicklaus,150

java Main 5 500 snames2000.csv snames2001.csv snames2002.csv 
Jacob,34471
Joseph,22825
Tyler,21503
William,20659
Madison,19967

java Main 5 5000 snames2000.csv snames2001.csv snames2002.csv
Matthew,80529
Emily,75471
Nicholas,68883
Jacob,67012
Daniel,64611

java Main 5 5000 snames2002.csv snames2001.csv snames2000.csv
Jacob,97580
Michael,89962
Matthew,80529
Joshua,79537
Emily,75471

Some Suggestions

Be careful of how you handle ties
It you get too frustrated with Java Generics, you may write everything in you tree code to be specific to the class you write to hold the name/value pairs. Doing so will reduce the possible grade by 7 points.
Start by just building your binary tree. Make sure insert is working perfectly. Then make sure that remove is working perfectly.
When working on updating the tree, start by reading about 3 lines from a single file, then read exactly those 3 lines again. Make sure that updates occur correctly for just those three items. Then do about 10 items, again, making sure that everything is precisely correct with 10 items. By working with really small data sets you can check everything by hand.
Be sure that you code accepts the correct number of inputs from the command line. You might ignore some inputs (see the next 2 points) but those inputs should be present.
Only after all of this is working, consider the problem of limiting the number of items shown to some number. (Limiting printout to N items is worth 5 points.)
Only after everything else is working should you work on the ascending / descending sort order. (Ascending and descending sort orders is worth 5 points.)

Electronic Submissions

Your program will be graded based on how it runs on the department’s Linux server, not how it runs on your computer.

README: This file should follow the format of this sample README (https://cs.brynmawr.edu/cs206/HW/README.txt)
Source files: Every .java file used in the final version of your project
Unique Data files used: This should be blank as the only data files you should use are as above

DO NOT INCLUDE:

Data files that are read from the class site.

The following steps for submission assume that you created a folder named Assignment10 in the directory /home/YOU/cs206/ and that all of your code, along with the README file, is inn this directory.

Put the README file into the project directory (/home/YOU/206/Assignment10)
Go to the directory /home/YOU/cs206
Enter submit -c 206 -p 10 -d Assignment9

For more on using the submit script click here