CS 380 -- Information Retrieval and Web Search

The topics covered, in their approximate order, will be:
  1. Introduction: Chapter 1.

    Goals and history of IR. The impact of the web on IR.

  2. IR Models, Indexing: Chapters 1 & 6.
    article on the history of search engines

    Boolean and vector-space retrieval models; ranked retrieval; text-similarity metrics; TF-IDF (term frequency/inverse document frequency) weighting; cosine similarity.

  3. Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval: Chapters 2, 4, 5, 6.

    Simple tokenizing, stop-word removal, and stemming; inverted indices; efficient processing with sparse vectors;

  4. Experimental Evaluation of IR: Chapter 8.

    Performance metrics: recall, precision, and F-measure; Evaluations on benchmark text collections.

  5. Query Operations and Languages: Chapters 9 and 3.

    Relevance feedback; Query expansion; Query languages.

  6. Text Representation: Section 5.1 and Chapter 10.

    Word statistics; Zipf's law; Porter stemmer; morphology; index term selection; using thesauri. Metadata and markup languages.

  7. Text Clustering: Chapters 16 & 17.

    Clustering algorithms: agglomerative clustering; k-means; expectation maximization (EM). Applications to web search and information organization.

  8. Web Search: Chapters 19, 20, & 21.

    Search engines; spidering; metacrawlers; directed spidering; link analysis (e.g. hubs and authorities, Google PageRank).

  9. Text Categorization: Chapters 13 & 14.

    Categorization algorithms: Rocchio, nearest neighbor, and naive Bayes. Applications to information filtering and organization.

  10. Synonomy and Polysemy. Read Corpus-based statisrical sense resolution. I will also use the considerably longer paper Disambiguating Highly ambiguous words
  11. Information Extraction and Integration:

    Extracting data from text; semantic web; collecting and integrating specialized information on the web.