Collaboratively Searching the WebAgustin Schapira's Masters' Thesis
Old stuff (1999).-
It is the title of my Master's Thesis project at the Department of Computer Science, UMASS. My goal in this project is to build a search engine that improves the quality of its results over time. How? Simple. Just by remembering which documents the users select among those returned by the engine. At the beginning, the search engine will return a mixture of good and bad documents for a given query, like any other search engine on the Web. However, it will record which documents the users select among those results, and it will thus be able to learn which are the good ones. As it learns that, it will begin to return only the good documents and discard the others, thus improving the quality of the results.
OK. But why "collaborative"?
Because the search experience of one user is enhanced by the information provided by previous users. In that sense a user benefits from the actions of other users, and thus the process in collaborative.
My hypothesis is that given a certain query, the same subset of documents (among those returned by a search engine) will be systematically selected by most users. If this is true, then it should be possible to improve the search engine's precision by using relevance information provided by previous users. To test this hypothesis, I will build a search engine that records the users' responses to the documents retrieved with traditional IR methods and learns to return only relevant documents for a given query. A person using the system will benefit from the experience of previous users, and thus the search process will be collaborative.
1. Responding to user queries
To begin with, we should be able to retrieve documents from the Web for a given query. We thus need a module that is capable of receiving a query from the user, looking for pages that are related to that query, rank those pages according to their relevance to the query, and present the results to the user. In other words, we need to build a search engine. Or...
Or we could build a meta search engine, an engine that receives a query, submits it to an existing engine like Altavista or Yahoo, and then gives the user the results returned by those engines. Less work, indeed! As a matter of fact, this is the approach followed by many, including the popular Metacrawler, developed at the University of Washington by Oren Etzioni and his team.
2. Monitoring what the users select
Ok, so we have the results for a given query. We now need to monitor what documents (from those returned for the query) the users like best. We need this in order to learn which are gooddocuments and which are not so good.
One way to do this is to modify the way the documents in the result
pages are hyper-linked to the real documents in the way. All the
search engines give, for each document that is retrieved, a short
description of its content and its title, hyper-linked to the document
in the Web. For instance, if you type the query
"soccer" into Altavista, you get a results page
that contains the following reference:
History of Soccer Soccer and History. A site devoted to the academic study of the world's greatest game. Looking for scores, stats, news? Check out Futbol y Mas Futbol,...
Under the title of History of Soccer there is a link to
http://acs5.bu.edu:8001/~palegi/ , where the page
We could replace that link with a pointer to a script that runs on our own server; that link contains not only the address of the script, but also information about the document that should be presented after the user clicks there. Something like
log_selection is a script that logs the information
about the request and then returns a
Redirect command to
the user's browser, instructing it to go and fetch the page
http://acs5.bu.edu:8001/~palegi/ (which is what the user
wanted in the first place!)
This mechanism allows us to maintain a table of the documents that the users prefer for each query, and use that information to optimize the results in the next step.
3. Improving the quality of the results
Once we have a record of which documents are frequently selected for a query, the next step is to wait for someone to type the same query and then return more of those good documents instead of the bad, rarely-selected ones. There are many possible ways to do this, but probably the simplest is to just re-rank the documents according to that information.
To do that, we just go and grab that documents returned by the search engine that we are using, in step 1. Then for each document returned by the search engine, we check how often it has been selected in the context of the current query. If it has been widely chosen by the users, then we take it up in the list of results; if users don't seem to particularly like it, we take it to the bottom of the list. In more technical terms, we re-rank the documents with a function that takes into account the ranking in the page returned by the search engine and also the number of times the document has been selected.
4. Making it all flow graciously
The success of the system will depend heavily on the volume of its traffic. Of course, if nobody uses the system, then it won't be able to learn which documents should be returned for a given query. That means that we need to attract a lot of people to our site, and respond to their queries fast enough so that they keep coming to do their searches. How can we build a system that does all of the above and yet is responsive enough?
The first step is to use the resources efficiently. If, for each request that we receive, we spawn a new process (in the way that standard CGI scripting does), we'll be wasting a lot of time. Everytime a CGI script is executed, the Web Server has to ask the operating system to create a new process, a very expensive operation. If that process does it work very quickly and then exits, then a large percentage of the time is devoted simply to creating the process.
Instead of executing a CGI script, we could build the system using server-side scripting. Server-side scripting consists of code written in a very simple language and embedded inside the HTML code for each page. When the Web Server receives a request for a page that contains server-side scripts, it opens the page, executes the code, and returns the result of executing that code to the user. Of course, the server must understand the language in which the code was written. There are several such languages that different servers understand. A good combination in terms of efficiency and richness of the language for the Apache Web Server is PHP. Other options are server-side Perl, Tcl, or Python (all very simple languages, but more than enough for what you might want to do inside an HTML file!). For a complete discussion of server-side scripting and other related techonolgies, take a look at Sites That Are Really Programs, by Phil Greenspun.
A large part of the system will deal with storing and retrieving information about queries, documents, and user selections. We will need to access that information very efficiently, and thus it will be convenient to store it in a relational database. Since we'll be using PHP as our server-side language, we will use MySQL as our relational database.
To verify the hypothesis, I will implement the system proposed in previous sections and run experiments to answer the following set of questions:
The experiments outlined here will allow me to verify the hypothesis. If it is true, then I expect to observe a large number of re-rankings for a given query, a pattern where relevant documents crawl to the top of the page, and more and more documents selected by the users from the top of the results page. That will imply that the users are seeing some improvement in the ordering of the documents.
My thesis report is now available for online reading or for printing in PDF format.