Hashtables, Sorting and Complexity

Due Thursday, December 12, before midnight.

In this assignment, we'll explore hashtables and further explore sorting. To motivate this exploration we will compare the efficiency of different methods of removing data duplication. (Rather than actually removing duplicate data we will count the number of times data is duplicated).

Data Deduplication

One important, and potentially expensive (in terms of time) task, when handling data is data deduplication. The goal of data deduplication, as the name suggests, is to make sure that each item in a data set only appears once. We'll explore three different ways this can be done and determine which one is most efficient.

The Data: Stop, Question, and Frisk

The data set we will be working with in this assignment is from the police department of the city of New York.

The New York City police department follows a practice of regularly stopping, questioning, and frisking people on the street. A 2011 lawsuit against this practice by the NYCLU successfully curbed the police department's frequent use of this practice, though concerns remain. The NYCLU's description of this practice is below: The NYPD's stop-and-frisk practices raise serious concerns over how the police treat the people they're supposed to protect. The Department's own reports on its stop-and-frisk activity confirm what many people in communities of color across the city have long known: The police continue to target, stop, and frisk people of color, especially young black and Latino men and boys, the vast majority of whom have done nothing wrong.

The NYCLU's 2011 lawsuit, along with other litigation, helped curb the NYPD's stop-and-frisk program. Today, the number of stops are just a fraction of what they were at their peak, though troubling racial disparities remain. The NYCLU continues to monitor, analyze, and shape stop-and-mushfrisk practices through data analyses, advocating for legislation to bring more transparency to policing practices, and working in coalition with community partners.

More information about the data can be found via the NYCLU's website: here or here. or via the NYPD

We'll be looking at the data collected in 2011 at the height of the NYPD's practice of stop-and-frisk. Each row in the data represents a single stop by a police officer, and the attributes (columns) include information about the stop (e.g., whether a weapon was found), information about the person stopped (e.g., age, race, etc.), and information about the stop location (e.g., the address). The dataset is in /home/gtowell/Public206/a10/2011.csv or on the NYPD website along with the codebook describing what is included in the data

Column definitions for the csv file are in /home/gtowell/Public206/a10/2011SQFFileSpec.[xlsx,pdf] in excel and pdf forms.

Determining Equality

The deduplication goal we will consider here is to collect a list of individuals who were stopped during the year. One way to think about this is that we would like to know which people were stopped multiple times by the NYPD in 2011.

In order to determine if two rows contain the same person, we need to develop a function that checks the information that should be unique to each person and compares it to determine equality. We need to decide what information this should be carefully, or deduplication will fail.

Requirements

Data Storage

You'll need a class to store the full data set.

Requirements

Deduplication Methods

You will use three data deduplication methods to the data set class you created in the previous section. Each of these methods should return a count of the number of duplicate items; in other words people who were frisked more than once. For instance, consider the following list of integers:
	1,2,3,4,5,2,3,1,2,3,4,5,4,5,6,7,8,9
The count of the number of items that appear more than once is 14. The count of the number of unique items that appear more than once is 5. We are interested in the second number.

All Pairs Deduplication

One way to find the number of duplicates in a data set is to compare each item to each other item in the data set, counting duplicates as you go. Make sure not to count the comparison of an item to itself as a duplicate. We will call this the ``all pairs" method of data deduplication. That is start with item 1, go through all of the other items looking for duplications. Then do the same for items 2, 3, 4, ... Be careful not to count individuals twice (or more). Try to be efficient. For instance, you need not search for duplications of an item that has already been identified as a duplicate.

Hash Table Deduplication

Another way to find the number of duplicates is to create a key for each item in the data set and insert these items into a hashtable, with the count of the number of times this item is seen as the value. The choice of key will determine whether two items are considered to be duplicates, so choose it carefully.

Rather than implementing you own hashtable, use the java HashMap class for this task.

Sorting for Deduplication

The last method for duplication counter that we'll look at involves sorting the data in order to check for duplicates. Note that the comparison method you use will play an important role in determining if you correctly find the duplicates. First, we'll need some functions that allow us to sort and search. Steps:

Efficiency Analysis

Using the methods you developed you will, in this section, consider their efficiency both experimentally anf theoretically. The answers to these questions should be given in your README file.

Requirements

Answer the following questions:

Command Line Input

You program should take exactly two input from the command line. The first input should be a 1, 2 or 3 where The second input should be the name of a data file. The data file used in testing may not be the same as the data file you have been provided. The testing data will exactly following the format of the sample data but may be otherwise completely different. For instance, the data may be from a different year. The output of the program should be:
Records in data set:2000
Attributes checked:X,Y,Z 
Number of individuals who were frisked more than once: 356
Note that "attributes checked" a statement about what what attributes your program used for equality determination.

What to submit

Your program will be graded based on how it runs on the department’s Linux server, not how it runs on your computer. The submission should include the following items:

The following steps for submission assume you are using Eclipse, and that you created a project named AssignmentN in the directory /home/YOU/cs206/

  1. For this assignment N=10
  2. Put the README file into the project directory (/home/YOU/cs206/AssignmentN)
  3. Go to the directory /home/YOU/cs206
  4. Enter submit -c 206 -p N -d AssignmentN

For more on using the submit script click here