Bryn Mawr College
CMSC 380: Information Retrieval
Prof. Geoffrey Towell
Lecture Hours: MW 1:10-2:30
Room: Park 336
Lab: Park 231 Mo 2:40PM - 4:00PM (Attendance in Lab is REQUIRED)
Office Hours: T 10-11AM, W 3-4pm, or by appointment. Also, if I am in my office and the door is open, you are welcome to come in.
- Introduction to Information Retrieval by Manning, Raghavan and Schutze. Cambridge, 2008. Should be available in the campus bookstore.
Supporting Text (may be on reserve in Collier)
- Managing Gigabytes by Witten, Moffat and Bell. Morgan Kaufmann, 1999.
Information Retrieval (IR) is the process of retrieving relevant text-based information in response to a user's textual query. IR was one of the first and remains one of the most important problems in the domain of natural language processing. Web search is he application of information retrieval to the web. It is the way in which most people interact with IR systems. In this course, we will cover basic and advanced techniques for building text-based information systems, including the following topics: Efficient text indexing, Boolean and vector-space retrieval models, Evaluation and interface issues, IR techniques for the web, including crawling, link-based algorithms, and metadata usage, Document clustering and classification, Approaches to ranking retrieved texts.
Assignments may be written in any programming language. As a general rule I will not closely grade program code. I will, however, read it and expect to be able to understand what I read. Therefore, the code should be commented to the level that an independent, intelligent, and motivated person can review and understand what was done and -- potentially -- extend or fix the program. As a general rule, comments should be written at the level such that, if you picked up your own code 2 years from now you could understand what you did and how the program works.
There will be two introductory assignments that are intended to get everyone on a common footing with respect to the topic area. After that, the class will be broken into groups each of whom may get a different assignment. Each group will present their work to the class on completion of their assignment. There will be 2-3 of these group assignments.
- Homework 1 Due prior to midnight Feb 2.
- Homework 2 Text revised slightly Feb 15 (I intentended this change to be made on Feb 11 but failed) Due -- in paper -- before noon Feb 18 . Questions and answers Last updated, noon, Feb 16.
- Homework 3 Due March 16. Not topics, but some guidelines expectations of the oral and written reports.
- Homework 4There are a set of "due dates" described in the document. The first, for topic selection in April 8.
Please note that while you have a 15 minute slot, your presentation should be 5 to 10 minutes. If you only have 5 minutes of stuff to say, only speak for 5 minutes. (I expect that the solo crawler projects will tend towards 5 minutes)
Pre-recored presentation are acceptable. Use the procedure below for getting the pre-recorded presentation to me. If you pre-record you must be available during your timeslot for questions. I will play your recording.
- copy the recording to the CS servers
- make a directory (e.g., HW4) and put the presentation into that directory
- UNIX> /home/gtowell/bin/submit -c 380 -p 4 -d HW4
- Jan 27 Lab 1
- Jan 27 Lab 2 For this lab you should not write any new code. The idea is to go though you corpus by hand, writing stemming rules.
- Feb 10 Lab 3 This lab was just creation of groups and selectionof topics for the first group assignment.
- Feb 17 Lab 4 Hand Crawling.
- Feb 24 Lab 5 Robot exclusions and forms.
- March 2Lab 6 A simple shopping cart.
- March 24 Lab 7 Testing remote access and copying using ssh and scp
I would prefer to have all of my lecture materials linked here. However, I may copy matrials without proper attribution. Therefore, I cannot make them web available. However, most will be available through the department servers at:
In that directory lecture slides will be availble with obvious names.
- Jan 27: Slides for lectures 1 and 2 are available in the above directory. For lecture 2, the class finished at slide 18. We will pick up from there in lecture 3. filename:1--Intro.pdf, 2--Index.pdf
- Jan 29: Slides for lecture 3 are available. I finished today at slide 35. filename:3--RetrievalModels.pdf
- Feb 3: Slides for lecture 4 are available. I finished today at slide 13. (2/5 update: the slides really are there now) filename:4--Retrieval.pdf
- Feb 5: Slides for lecture 5 are available. I finished today at slide 24. filename:5--BeyondVS.pdf
- Feb 10: Slides for lecture 6 are available. I finished today at slide 17. filename:6--Large.pdf
- Feb 12: Slides for lecture 7 are available. I finished today at slide 17. filename:7--Evaluating.pdf
- Feb 14: In addition to the sides, we discussed parts of Brin and Page's article
- Feb 17 Introduction to spidering the web. filename:8--WEB.pdf
- Feb 19: before class read How much of the internet is fake? Lecture slides are available at:9--Spidering.pdf. We finished today at slide 14.
- Feb 24: Slides for lecture 10 are available. filename:10--Crawling.pdf
- Feb 26: Slides for lecture 11 are available. filename:11--connections.pdf
- March 2: Slides for lecture 12 are available. filename:12--links.pdf
- March 23: Lecture 13 is available on Moodle. Slides for lecture 13 are available. filename:TnT.pdf
- March 25: Lecture 14 is available on Moodle. This lecture is on Hierarchical clustering. There are many videos online about this if you think I am less than clear. Also, this lecture follows the text chpter 17.1 -- 17.5.
- Slides for lecture 14 are available. filename:14--Cluster1.pdf
- Data for the mini homework is available here
- Full example of complete link merging here
- Full example of single link merging here
- The dendrograms for the complete link and single link examples
- March 30: Slides for lecture 15 are available. filename:15--Cluster2.pdf
- April 1: Slides for lecture 16 are available. filename:16--Cluster2.pdf For more information about Scatter/Gather clck here
- April 6: Slides for lecture 17 are available. filename:17--Senses.pdf For more information about word sense disambiguation clck here
- April 8: Slides for lecture 18 are available. filename:18--Senses.pdf For more information about word sense disambiguation for this lecture clck here
Zaynab accepted my challenge and wrote a proof that the L-infinity norm is indeed equal to the max absolute value of the difference between two vectors. Here is her proof.
No sooner had I published Zaynab's proof that I received a second one. This from Kejing. Here it is
Further extra credit is available to anyone who can find a flaw or otherwise improve on either of these proofs. (I did not find flaws but that does not mean flaws do not exist.)
- April 13: Slides for lecture 19 are available. filename:19--Categorization.pdf
- April 15: Slides for lecture 20 are available. filename:20--CategorizationB.pdf. No mini homework today. Just work on Homework 4.
- April 20: Slides for lecture 21 are available. filename:21--InformationExtraction.pdf. Again, No mini homework today. Just work on Homework 4 and signup for a time slot.
- April 22: Slides for lecture 22 are available. filename:22--Recommenders.pdf. Again, No mini homework today; and since this is my last lecture, mini-homeworks are done! Just work on Homework 4 and signup for a time slot (as of 9am on 4/22 I see 21 people have signed up for slots. That leaves 5 missing.
Attendance and active participation are
expected in every class. Participation includes asking questions,
contributing answers, proposing ideas, and providing constructive
As you will discover, I am a proponent of two-way communication
and I welcome feedback during the semester about the course. I am available to answer questions, listen to concerns, and
talk about any course-related topic (or otherwise!). Come to
office hours! This helps me get to know you. You are welcome to
stop by and chat.
stay in touch with me, particularly if you feel stuck on a topic
or assignment and can't figure out how to proceed. Often a quick
e-mail, or face-to-face conference can reveal solutions
to problems and generate renewed creative and scholarly energy. It
is essential that you begin assignments early.
At the end of the semester, final grades will be calculated as a
weighted average of all grades according to the following weights. (These weights are subject to change, without notice.)
Exam 1: 20%
Exams will be in class (or possibly take-home). If take-home then the time to complete will be no more than 2 hours. Closed book, closed notes, no electronic devices unless otherwise instructed.
Exam 2: 20%
Lab Attendance: 5%
Many assignments will be done in small groups (2-3) and will finish with a 10-15 minute presentation in class. The report will be a significant portion of the assignment grade. More, the portion of the grade will vary depending on the quality of the presentation. That is, an average presentation will not change the grade. An outstanding presentation could improve the project grade a lot. Conversely a poor presentation will significantly reduce the grade.
Incomplete grades will be given only for verifiable medical
illness or other such dire circumstances.
ALL work submitted for grading should be entirely YOUR OWN (or that of a group if you are working in a group). Sharing of programs, code snippets, etc. is not permitted under ANY circumstances.
Submission, Late Policy, and Making Up Past Work
No assignment will be
accepted after it is past due.
No past work can be "made up" after it is due.
No regrade requests will be entertained one week after the graded work is returned in class.
There will be two exams in this course. The exams will be
closed-book and closed-notes (unless otherwise instructed) . The exams
will cover material from lectures, homeworks, and assigned
I encourage you to discuss the material and work together to
understand it. Here are some thoughts on collaborating with other
If you have any questions as to what types of collaborations are
allowed, please feel free to ask.
- The readings and lecture topics can be group work. Please discuss
the readings and associated topics with each other. Work
together to understand the material. I highly recommend forming
a reading group to discuss the material -- I will explore many
ideas and it helps to have multiple people working together to
- It is fine to discuss the topics covered in the homeworks, to
discuss approaches to problems, and to sketch out general
solutions. However, you MUST write up the homework answers,
solutions, and programs individually without sharing specific
solutions, mathematical results, program code, etc. If you
made any notes or worked out something on a white board with
another person while you were discussing the homework, you
shouldn't use those notes while writing up your answer.
- Under ABSOLUTELY NO circumstances should you share computer
code with another student, printed, electronic, or otherwise. Similarly, you are not
permitted to use or consult code found on the internet for any
of your assignments.
- Exams, of course, must be your own individual work.
Students requesting accommodations in this course because of the impact of disability
are encouraged to meet with me privately early in the semester with a verification letter.
Students not yet approved to receive accommodations should also contact Deb Alder,
Coordinator of Accessibility Services, at 610-526-7351 in Guild Hall, as soon as possible,
to verify their eligibility for reasonable accommodations. Early contact will help avoid
unnecessary inconvenience and delays.
This class may be recorded.
Created on January 2020. Subject to constant revision.