Goals and history of IR. The impact of the web on IR.
Boolean and vector-space retrieval models; ranked retrieval; text-similarity metrics; TF-IDF (term frequency/inverse document frequency) weighting; cosine similarity.
Simple tokenizing, stop-word removal, and stemming; inverted indices; efficient processing with sparse vectors;
Performance metrics: recall, precision, and F-measure; Evaluations on benchmark text collections.
Relevance feedback; Query expansion; Query languages.
Word statistics; Zipf's law; Porter stemmer; morphology; index term selection; using thesauri. Metadata and markup languages.
Clustering algorithms: agglomerative clustering; k-means; expectation maximization (EM). Applications to web search and information organization.
Search engines; spidering; metacrawlers; directed spidering; link analysis (e.g. hubs and authorities, Google PageRank).
Categorization algorithms: Rocchio, nearest neighbor, and naive Bayes. Applications to information filtering and organization.
Extracting data from text; semantic web; collecting and integrating specialized information on the web.