COMPSCI 590R Applied Information Retrieval

Information Retrieval (IR) is the theory and practice that underlies technologies such as search engines. It deals with models and methods for representing, indexing, searching, browsing, and summarizing information in response to a person's information need. You should be able to program in Java (or some other closely related language).

Prerequisites: COMPSCI 320 and either COMPSCI 383, 446, or 585, all with a grade of C or better. 3 credits.

This page lists the topics of the course, the textbook, provides an overview of grading, discusses plagiarism and collaboration, and lists the course policies.

Topics of the course

This course covers the design and implementation of Information Retrieval systems to solve a variety of problems, using a model based computational framework. The framework is:

  1. Select a set of features that have computable values, such as the count of words appearing in a document
  2. Using those computable feature values compose those values into a score using a model, such as adding up all of the feature values
  3. Using that score, draw some inference, such as that a higher score means a better document
  4. Using that inference, perform some action, such as presenting a document to a user in response to a web search

This is an applied, programming intensive course.

Weekly Topics

  1. Overview of Information Retrieval and our computation framework
    Provide the basis for expressing a variety of information processing activities within the model based computational framework. Includes vocabulary, definitions, and examples. Defines the framework, data structures, and associated algorithms.
  2. Text Processing
    Perform feature selection on text documents. Identify processing strategies for enabling easy storage of the values for selected features.
    Quiz (Indexing)
    Programming assignment on vocabulary lookup table storage.
  3. Indexing
    Building inverted list representations. Compressing data.
    Programming assignment on inverted list construction and storage
  4. Indexing
    Manipulating inverted lists. Other computable features and their storage.
    Programming assignment on using inverted lists to access document feature values
  5. Retrieval Models
    Computing count based models
    Quiz (Indexing)
  6. Retrieval Models
    Computing probabilistic models
    Programming assignment on retrieval model computations
  7. Structured Queries and Evidence Combination
    Computing more complex models using additional features, such as phrase operations. Computing combinations of probabilities to model different logical operations.
    Programming assignment on evidence combination
    Midterm exam scheduled this week
  8. IR Evaluation
    Evaluation metrics and their computation.
    Quiz (Retrieval Models)
  9. Learning to Rank
    Use evaluation metrics, other features, and machine learing techniques to reorder a ranked list of elements, such as documents retrieved for a web search.
    Quiz (Evaluation)
  10. Clustering
    Classification and supervised machine learning techniques for sorting objects into bins
    Quiz(Learning to Rank)
  11. Clustering
    Clustering and unsupervised machine learning techniques for sorting objects into bins
    Programming assignment on clustering of documents using the framework
  12. Beyond Bag of Words
    Adding additional features, more complex models, and employing the framework on other domains, such as images, video, music.
  13. Wrap Up
    Tie it all together and review for the final
  14. Final exam scheduled in Finals week

Textbook

The following text is required for this course:

Grading

Your final grade in this class will be based upon, projects, quizzes, a midterm exam, and a final exam. The relative contributions of the parts are:

All assignments are due as indicated on the assignment and the Web page. All assignments will be performed on the Moodle site for the class.

Late assignments will be accepted in accordance with University policy regarding excused absences.

Course policies

Accommodation Statement
The University of Massachusetts Amherst is committed to providing an equal educational opportunity for all students. If you have a documented physical, psychological, or learning disability on file with Disability Services (DS), you may be eligible for reasonable academic accommodations to help you succeed in this course. If you have a documented disability that requires an accommodation, please notify me within the first two weeks of the semester so that we may make appropriate arrangements.

Academic Honesty Statement
Since the integrity of the academic enterprise of any institution of higher education requires honesty in scholarship and research, academic honesty is required of all students at the University of Massachusetts Amherst. Academic dishonesty is prohibited in all programs of the University. Academic dishonesty includes but is not limited to: cheating, fabrication, plagiarism, and facilitating dishonesty. Appropriate sanctions may be imposed on any student who has committed an act of academic dishonesty. Instructors should take reasonable steps to address academic misconduct. Any person who has reason to believe that a student has committed academic dishonesty should bring such information to the attention of the appropriate course instructor as soon as possible. Instances of academic dishonesty not related to a specific course should be brought to the attention of the appropriate department Head or Chair. Since students are expected to be familiar with this policy and the commonly accepted standards of academic integrity, ignorance of such standards is not normally sufficient evidence of lack of intent (http://www.umass.edu/dean_students/codeofconduct/acadhonesty/).

Collaboration, Plagiarism, and Intellectual honesty

Your work must be your own. For anything other than exams, you are welcome to discuss general issues with other students, but the answer, the writing, and the final result that you hand in must be your own effort. Discussing or sharing answers to specific problems is considered dishonest. If you have questions about what is honest, please ask! One suggestion is never to write down anything while you're talking with someone about class work since that will require you to come up with the result again on your own later. You are strongly encouraged to cite your sources if you received extraordinary help from any person or text (including the Web), other than lecture content or the textbook. Computer Science Department policy specifies that the penalty for cheating is (1) a final course grade of "F" and (2) possible referral to the Academic Dishonesty Committee.

For any material you hand in, you must appropriately indicate when you are using work of others. If you use verbatim or only slightly altered text, you must clearly indicate (quotation marks, indented text, etc.) that you are quoting another source and what that source is. If you refer to work done by others, even if you do not quote it, you should include a reference to the original source. It does not matter if that work was published or not: if it is work other than your own, you are obligated to make it clear that you are using that person's work. Plagiarism will not be tolerated in this class. Plagiarism is a type of cheating and will be treated accordingly: the penalty for cheating is (1) a final course grade of "F" and (2) possible referral to the Academic Dishonesty Committee. The campus writing program provides more information about plagiarism.

You may (but probably won't) be using copyright-protected software as part of the class. Federal law and license agreements between the University and various software producers prohibit copying this software for any purpose. Such activity will be regarded as a form of cheating and will be dealt with as such.