CS 380 -- Syllabus

CS 380 -- Information Retrieval and Web Search

The topics covered, in their approximate order, will be:

Introduction: Chapter 1.
Goals and history of IR. The impact of the web on IR.
IR Models, Indexing: Chapters 1 & 6.
article on the history of search engines
Boolean and vector-space retrieval models; ranked retrieval; text-similarity metrics; TF-IDF (term frequency/inverse document frequency) weighting; cosine similarity.
Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval: Chapters 2, 4, 5, 6.
Simple tokenizing, stop-word removal, and stemming; inverted indices; efficient processing with sparse vectors;
Experimental Evaluation of IR: Chapter 8.
Performance metrics: recall, precision, and F-measure; Evaluations on benchmark text collections.
Query Operations and Languages: Chapters 9 and 3.
Relevance feedback; Query expansion; Query languages.
Text Representation: Section 5.1 and Chapter 10.
Word statistics; Zipf's law; Porter stemmer; morphology; index term selection; using thesauri. Metadata and markup languages.
Text Clustering: Chapters 16 & 17.
Clustering algorithms: agglomerative clustering; k-means; expectation maximization (EM). Applications to web search and information organization.
Web Search: Chapters 19, 20, & 21.
Search engines; spidering; metacrawlers; directed spidering; link analysis (e.g. hubs and authorities, Google PageRank).
Text Categorization: Chapters 13 & 14.
Categorization algorithms: Rocchio, nearest neighbor, and naive Bayes. Applications to information filtering and organization.
Synonomy and Polysemy. Read Corpus-based statisrical sense resolution. I will also use the considerably longer paper Disambiguating Highly ambiguous words
Information Extraction and Integration:
Extracting data from text; semantic web; collecting and integrating specialized information on the web.