Text similarity spans a spectrum, with broad
topical similarity near one extreme and document identity at the other.
Intermediate levels of similarity -- resulting from summarization,
paraphrasing, copying, and stronger forms of topical relevance -- are
useful for applications such as information flow analysis and
question-answering tasks.
In this paper, we explore mechanisms for measuring such intermediate
kinds of similarity, focusing on the task of identifying where a
particular piece of information originated.
We consider both sentence-to-sentence and document-to-document
comparison, and have incorporated these algorithms into RECAP,
a prototype information flow analysis tool.
Our experimental results with RECAP indicate that new
mechanisms such as those we propose are likely to be more appropriate than
existing methods for identifying the intermediate forms of similarity.