Hashtables, Sorting and Complexity

Due Thursday, December 12, before midnight.

In this assignment, we'll explore hashtables and further explore sorting. To motivate this exploration we will compare the efficiency of different methods of removing data duplication. (Rather than actually removing duplicate data we will count the number of times data is duplicated).

Data Deduplication

One important, and potentially expensive (in terms of time) task, when handling data is data deduplication. The goal of data deduplication, as the name suggests, is to make sure that each item in a data set only appears once. We'll explore three different ways this can be done and determine which one is most efficient.

The Data: Stop, Question, and Frisk

The data set we will be working with in this assignment is from the police department of the city of New York.

The New York City police department follows a practice of regularly stopping, questioning, and frisking people on the street. A 2011 lawsuit against this practice by the NYCLU successfully curbed the police department's frequent use of this practice, though concerns remain. The NYCLU's description of this practice is below: The NYPD's stop-and-frisk practices raise serious concerns over how the police treat the people they're supposed to protect. The Department's own reports on its stop-and-frisk activity confirm what many people in communities of color across the city have long known: The police continue to target, stop, and frisk people of color, especially young black and Latino men and boys, the vast majority of whom have done nothing wrong.

The NYCLU's 2011 lawsuit, along with other litigation, helped curb the NYPD's stop-and-frisk program. Today, the number of stops are just a fraction of what they were at their peak, though troubling racial disparities remain. The NYCLU continues to monitor, analyze, and shape stop-and-mushfrisk practices through data analyses, advocating for legislation to bring more transparency to policing practices, and working in coalition with community partners.

More information about the data can be found via the NYCLU's website: here or here. or via the NYPD

We'll be looking at the data collected in 2011 at the height of the NYPD's practice of stop-and-frisk. Each row in the data represents a single stop by a police officer, and the attributes (columns) include information about the stop (e.g., whether a weapon was found), information about the person stopped (e.g., age, race, etc.), and information about the stop location (e.g., the address). The dataset is in /home/gtowell/Public206/a10/2011.csv or on the NYPD website along with the codebook describing what is included in the data

Column definitions for the csv file are in /home/gtowell/Public206/a10/2011SQFFileSpec.[xlsx,pdf] in excel and pdf forms.

Determining Equality

The deduplication goal we will consider here is to collect a list of individuals who were stopped during the year. One way to think about this is that we would like to know which people were stopped multiple times by the NYPD in 2011.

In order to determine if two rows contain the same person, we need to develop a function that checks the information that should be unique to each person and compares it to determine equality. We need to decide what information this should be carefully, or deduplication will fail.

Requirements

Create an object to hold a single row from the CSV. This object does not need to hold all fields in the CSV - you should store only the fields that will be necessary in order to deduplicate the data.
Implement the Comparable interface. Make the compareTo function in the object you just created returns 0 if and only if the compared items are the same person according to the way you have decided to check uniqueness. (Recall that 0 means equal for the compareTo function.)
Describe and justify how you are determining uniqueness in your README. In other words, what columns are you using, and why?

Data Storage

You'll need a class to store the full data set.

Requirements

Make a class to store the full data set. Your constructor for the class should take the name of a CSV file as input and should do the work of reading in the data from the CSV and parsing it into an ArrayList containing the objects you developed in the previous section.
Your class will likely have one instance variable, the ArrayList that stores the data.
You could also have a(n) variable(s) that give(s) the fields in the CSV that you will read and retain for deduplication.

Deduplication Methods

You will use three data deduplication methods to the data set class you created in the previous section. Each of these methods should return a count of the number of duplicate items; in other words people who were frisked more than once. For instance, consider the following list of integers:

	1,2,3,4,5,2,3,1,2,3,4,5,4,5,6,7,8,9

The count of the number of items that appear more than once is 14. The count of the number of unique items that appear more than once is 5. We are interested in the second number.

All Pairs Deduplication

One way to find the number of duplicates in a data set is to compare each item to each other item in the data set, counting duplicates as you go. Make sure not to count the comparison of an item to itself as a duplicate. We will call this the ``all pairs" method of data deduplication. That is start with item 1, go through all of the other items looking for duplications. Then do the same for items 2, 3, 4, ... Be careful not to count individuals twice (or more). Try to be efficient. For instance, you need not search for duplications of an item that has already been identified as a duplicate.

Hash Table Deduplication

Another way to find the number of duplicates is to create a key for each item in the data set and insert these items into a hashtable, with the count of the number of times this item is seen as the value. The choice of key will determine whether two items are considered to be duplicates, so choose it carefully.

Rather than implementing you own hashtable, use the java HashMap class for this task.

Sorting for Deduplication

The last method for duplication counter that we'll look at involves sorting the data in order to check for duplicates. Note that the comparison method you use will play an important role in determining if you correctly find the duplicates. First, we'll need some functions that allow us to sort and search. Steps:

Java has a library with an already implemented sort method. Write a method that uses Collections.sort() to sort your data according to your compareTo method. Note that you may need to extend your compareTo function to support sorting.
Implement a method for counting the number of individuals in the sorted data. This method should be efficient; taking advantage of the fact that the data is sorted.

Efficiency Analysis

Using the methods you developed you will, in this section, consider their efficiency both experimentally anf theoretically. The answers to these questions should be given in your README file.

Requirements

Answer the following questions:

How much time does each duplication counting method take when run on the stop and frisk data set?
What is Big-O complexity analysis for each of the three deduplication methods. Justify your answer.
What does the Big-O analysis indicate about the runtime for each method if you double the size of the data base? If you halve the size of the database? Give experimental data that supports this analysis. (You need only give experimental data for halving the dataset.)

Command Line Input

You program should take exactly two input from the command line. The first input should be a 1, 2 or 3 where

1 == use the all pairs method
2 == use the hash based method
3 == use the sorting based method

The second input should be the name of a data file. The data file used in testing may not be the same as the data file you have been provided. The testing data will exactly following the format of the sample data but may be otherwise completely different. For instance, the data may be from a different year. The output of the program should be:

Records in data set:2000
Attributes checked:X,Y,Z 
Number of individuals who were frisked more than once: 356

Note that "attributes checked" a statement about what what attributes your program used for equality determination.

What to submit

Your program will be graded based on how it runs on the department’s Linux server, not how it runs on your computer. The submission should include the following items:

README: This file should follow the format of this sample README (https://cs.brynmawr.edu/cs206/README.txt)
- Within the README file, include a brief (max 1 paragraph) reflection on what went well and/or poorly and why
  Also within the readme include.
  1. Timing results for each method
  2. Complexity analysis as described above
  3. Expected times based on complexity analysis
Source code for all of your work.

The following steps for submission assume you are using Eclipse, and that you created a project named AssignmentN in the directory /home/YOU/cs206/

For this assignment N=10
Put the README file into the project directory (/home/YOU/cs206/AssignmentN)
Go to the directory /home/YOU/cs206
Enter submit -c 206 -p N -d AssignmentN

For more on using the submit script click here