There are various factors used in determining which pages are most relevant to a search query. Some content-based factors are Term Frequency, Inverse Document (Page) Frequency, and Page Length; while Link Structure is a hyper-link based factor.
Quick Definitions:
- Term Frequency (TF) measures how often a term (word) is found in a specific page. It is simply the number of times a given term appears in a page. The higher the TF of a page for the query terms, the higher the page’s score.
- Inverse Document Frequency (IDF) is a measure of the general importance of the term, or in other words, of how rare a term is in a page collection. It is obtained by dividing the number of all pages by the number of pages containing the term. Very common terms ("the", "and" etc.) will have a very low IDF and are therefore often excluded from search results. These low IDF words are commonly referred to as "stop words". The higher the IDF of a query term, the more important it is for the score of the result pages.
- Page Length (number of words in a page) tells us that shorter pages are more likely to be relevant to the term than longer pages, because shorter pages are more specific. The smaller a page, the higher its score.
- Link Structure (PageRank) is a useful factor to rank search results, independent of the textual content of the page. It estimates the “importance” of a website or individual page in the internet. This importance of a page increases when other important pages point to it via hyperlinks. If a particular page A is linked by an important page B, then it means that A must be important as well. The higher the PageRank of a page, the higher its score.
Project Deliverables:
- Part A: Present evidence of the above four factors in Google searches or searches by other search engines in the form of a written report and possibly screenshots. For instance, the evidence for IDF can be an example (consisting of a query and a couple of results from Google) that supports the fact that infrequent query words are more dominant in the ranking of the result pages.
- Part B: Suggest two or more new factors that could be used to determine which pages are more relevant to a search query. Check if you can find evidence (through query examples) that Google (or another search engine) uses these factors.
Submission Instructions
You will have two weeks to complete and submit the project. Please send your deliverables in a single file (txt, doc, docx or pdf) as an attachment to FIU-outreach@googlegroups.com by December 3rd, 2009.
The student with the best project report from each class will be awarded a brand new iPod Touch!
The student with the best project report from each class will be awarded a brand new iPod Touch!