CS 330: Algorithms: Design & Practice

Homework1: Sorting large data files
Due: In class on Tuesday, January 28, 2008

Presentation By: Jessica Billings & Priscy Pais

For each of the programs provided in this exercise (bitsortgen.c, qsortints.c, sortints.c, and the system sort), carefully study them first and make sure you understand them well. This will also help you learn (or re-learn) the languages and programming techniques used. You can download/copy the programs from ~dkumar/cs330/Homework1/...

1. Use the bitsortgen.c program (See files in ~dkumar/cs330/Homework1) to generate datasets of sizes 100000, 300000, 500000, 1000000, 2000000, 3000000, 4000000, 5000000 non repeating positive integers in the range [0, 5000000). Store these numbers in individual data files: say data100k, data300k, ..., data5M.

Time each run, record it and plot the times on a graph.

2. Sort the data generated above using (a) The system provided sort program (sort -n), (b) A program that uses the quicksort algorithm (see file qsortints.c), (c) A program that uses the C++ <set> library module (see file sortints.cpp). For all these runs, make sure the data is output to a file.

Sort the data sets generated in (1) using the three schemes (a), (b), and (c). Time the runs and record the times in a table.

 

Dataset Size
bitsortgen
sort(a)
sort(b)
sort(c)
100K
       
300k
       
500k
       
1M
       
2M
       
3M
       
4M
       
5M
       

You may need to run the program several times to get an average time for each size. Record the average times only for each size and then plot them.

Extra Credit

Rewrite bitsortgen and sortints in Python and provide timing data as above. How do these compare to the results of the C programs?

What to hand in

1. The times recorded in 1. and a plotof the times. What can you say about the time complexity of the number generating algorithm?

2. Plot on a separate graph a comparison of all the methods: (a), (b), and (c). What can you say about the time complexity of each of these methods?

Notes:

To compile C programs do:

> gcc file.c -o file

To run the resulting executable:

> ./file

To compile a C++ program do:

> g++ file.cpp -o file

You should only record the user cpu times. To get the CPU time used by a program you can use the GNU time command as follows:

/usr/bin/time <some unix command>

By default this prints out a number of statistics (see the manual pages for more information). In order to just get the CPU time you should first set the value of an environment variable TIME, as follows:

export TIME="CPU %U"

Then use the time command as described above. The time reported is in seconds.

Notice that the files will get rather large. So you may not want to store all the files all at the same time. Work with each data set in batches, removing older data sets before moving on to the next one.

Presentation:

The designated person(s) will give a presentation on this exercise in class on Tuesday, January 23. Check the presentation schedule to ensur ethat you know when you will be presenting. Details of the presentation will be provided in class. Your instructor can provide further guidance in preparing your presentation.

Your presentation should include a description of the three sorting techniques used. Also, it should include a description of the algorithm for generating unique random numbers. Your presentation should discuss the performance of all the methods.


Back to CS330 Materials