No categories

Search & Internet Marketing Manager SEO BLOG

Elias Kai Google-Kai.com

TFIDF + SEO read more

July 12th, 2008 by Elias Kai elias.kai

Good to read if you are in the SEO business:
Understanding Inverse Document Frequency (IDF) by Dr. E. Garcia

“IDF is simply neither a pure heuristic, nor the theoretical mystery
many have made it out to be. We have a pretty good idea why
it works as well as it does.” –Stephen E. Robertson

In 1972, the late Karen Sparck Jones (August 26, 1935 – April 4, 2007) published in Journal of Documentation the global term weighting scheme that was later known as Inverse Document Frequency (IDF) . Then where there are N documents In the collection, the weight of a term which occurs n times is f(N) - f(n) + 1.”

A Comparison of Document, Sentence, and Term Event Spaces:

The vector based information retrieval model identifies relevant documents by comparing query terms with terms from a document corpus.
The most common corpus weighting scheme is the term frequency (TF) x inverse document frequency (IDF), where TF is the number of times a term appears in a document, and IDF reflects the distribution of terms within the corpus (Salton
and Buckley, 1988). Ideally, the system should assign the highest weights to terms with the most discriminative power.

Findings:
As users continue to demand information systems that provide sub-document
retrieval, the need to model language at the subdocument level becomes increasingly important. The key findings from this study are: (1) The raw document frequencies are considerably different to the sentence and term frequencies. The lack of a direct
correlation between the document and sub-document raw spaces, in particular
around the areas of important terms, suggest that it would be difficult to identify
a linear transformation between the document to sub-document spaces. In
contrast, the raw term frequencies correlate well with the sentence frequencies.
(2) IDF, ISF and ITF are highly correlated; however, simply replacing IDF with the
ISF or ITF would result in a weighting scheme where the corpus weight dominated
the weights assigned to query and document terms.
(3) IDF was surprisingly stable with respect to random samples at 10% of the total
corpus. The average IDF values based on only a 20% random stratified sample
correlated almost perfectly to IDF values that considered frequencies in the entire
corpus. This finding suggests that systems in a dynamic environment, such as
the Web, need not update the global IDF values regularly (see (4)).
(4) IDF values based on different journal samples did not correlate well to the
global IDF. Further work is required to understand when frequencies should
consider alternative subsets of a corpus.
(5) The language used in abstracts appears to be systematically different from the
language used in the body of a full-text scientific document across all three language
models. This suggests that further work is required to understand how the corpus-weighting schemes that are well studied on abstracts will perform in a full-text setting.

Leave a Comment

Please note: Comment moderation is enabled and may delay your comment. There is no need to resubmit your comment.