Collaboratively Searching the Web

Collaboratively Searching the Web

Agustin Schapira's Masters' Thesis
Old stuff (1999).-

Collaboratively searching the Web? What is that?

It is the title of my Master's Thesis project at the Department of Computer Science, UMASS. My goal in this project is to build a search engine that improves the quality of its results over time. How? Simple. Just by remembering which documents the users select among those returned by the engine. At the beginning, the search engine will return a mixture of good and bad documents for a given query, like any other search engine on the Web. However, it will record which documents the users select among those results, and it will thus be able to learn which are the good ones. As it learns that, it will begin to return only the good documents and discard the others, thus improving the quality of the results.

OK. But why "collaborative"?

Because the search experience of one user is enhanced by the information provided by previous users. In that sense a user benefits from the actions of other users, and thus the process in collaborative.

Overview of the Project

With the currently available search engines, users are alone in their exploration of the WWW. To find information about a particular topic, they have to use their favorite search engine and then browse through long pages of results in the hope of finding relevant documents. It does not matter if hundreds of other users have already run the same query in the past and have already filtered the good documents from the bad ones; the search engine does not remember those judgments and blindly keeps returning the same mixture of relevant and non-relevant documents.

My hypothesis is that given a certain query, the same subset of documents (among those returned by a search engine) will be systematically selected by most users. If this is true, then it should be possible to improve the search engine's precision by using relevance information provided by previous users. To test this hypothesis, I will build a search engine that records the users' responses to the documents retrieved with traditional IR methods and learns to return only relevant documents for a given query. A person using the system will benefit from the experience of previous users, and thus the search process will be collaborative.

Architecture

How do we go about building such a system?

1. Responding to user queries

To begin with, we should be able to retrieve documents from the Web for a given query. We thus need a module that is capable of receiving a query from the user, looking for pages that are related to that query, rank those pages according to their relevance to the query, and present the results to the user. In other words, we need to build a search engine. Or...

Or we could build a meta search engine, an engine that receives a query, submits it to an existing engine like Altavista or Yahoo, and then gives the user the results returned by those engines. Less work, indeed! As a matter of fact, this is the approach followed by many, including the popular Metacrawler, developed at the University of Washington by Oren Etzioni and his team.

2. Monitoring what the users select

Ok, so we have the results for a given query. We now need to monitor what documents (from those returned for the query) the users like best. We need this in order to learn which are gooddocuments and which are not so good.

One way to do this is to modify the way the documents in the result pages are hyper-linked to the real documents in the way. All the search engines give, for each document that is retrieved, a short description of its content and its title, hyper-linked to the document in the Web. For instance, if you type the query "soccer" into Altavista, you get a results page that contains the following reference:

History of Soccer
Soccer and History. A site devoted to the academic study of the
world's greatest game. Looking for scores, stats, news?  Check out
Futbol y Mas Futbol,...

Under the title of History of Soccer there is a link to http://acs5.bu.edu:8001/~palegi/ , where the page resides.

We could replace that link with a pointer to a script that runs on our own server; that link contains not only the address of the script, but also information about the document that should be presented after the user clicks there. Something like

http://www.our-search-engine.com/log_selection?query=soccer&link=http://acs5.bu.edu:8001/~palegi/.

log_selection is a script that logs the information about the request and then returns a Redirect command to the user's browser, instructing it to go and fetch the page http://acs5.bu.edu:8001/~palegi/ (which is what the user wanted in the first place!)

This mechanism allows us to maintain a table of the documents that the users prefer for each query, and use that information to optimize the results in the next step.

3. Improving the quality of the results

Once we have a record of which documents are frequently selected for a query, the next step is to wait for someone to type the same query and then return more of those good documents instead of the bad, rarely-selected ones. There are many possible ways to do this, but probably the simplest is to just re-rank the documents according to that information.

To do that, we just go and grab that documents returned by the search engine that we are using, in step 1. Then for each document returned by the search engine, we check how often it has been selected in the context of the current query. If it has been widely chosen by the users, then we take it up in the list of results; if users don't seem to particularly like it, we take it to the bottom of the list. In more technical terms, we re-rank the documents with a function that takes into account the ranking in the page returned by the search engine and also the number of times the document has been selected.

4. Making it all flow graciously

The success of the system will depend heavily on the volume of its traffic. Of course, if nobody uses the system, then it won't be able to learn which documents should be returned for a given query. That means that we need to attract a lot of people to our site, and respond to their queries fast enough so that they keep coming to do their searches. How can we build a system that does all of the above and yet is responsive enough?

The first step is to use the resources efficiently. If, for each request that we receive, we spawn a new process (in the way that standard CGI scripting does), we'll be wasting a lot of time. Everytime a CGI script is executed, the Web Server has to ask the operating system to create a new process, a very expensive operation. If that process does it work very quickly and then exits, then a large percentage of the time is devoted simply to creating the process.

Instead of executing a CGI script, we could build the system using server-side scripting. Server-side scripting consists of code written in a very simple language and embedded inside the HTML code for each page. When the Web Server receives a request for a page that contains server-side scripts, it opens the page, executes the code, and returns the result of executing that code to the user. Of course, the server must understand the language in which the code was written. There are several such languages that different servers understand. A good combination in terms of efficiency and richness of the language for the Apache Web Server is PHP. Other options are server-side Perl, Tcl, or Python (all very simple languages, but more than enough for what you might want to do inside an HTML file!). For a complete discussion of server-side scripting and other related techonolgies, take a look at Sites That Are Really Programs, by Phil Greenspun.

A large part of the system will deal with storing and retrieving information about queries, documents, and user selections. We will need to access that information very efficiently, and thus it will be convenient to store it in a relational database. Since we'll be using PHP as our server-side language, we will use MySQL as our relational database.

Evaluation

As said before, the underlying hypothesis in this project is that, given a query, users will systematically select the same subset of documents among the results returned by a search engine. If that assumption is true, then it should be possible to improve the quality of a search engine by giving more importance to the systematically-selected documents than to the others. In other words, re-ranking the documents according to that relevance information should be a viable way to improve precision in a search engine, all based on the indirect collaboration among users.

To verify the hypothesis, I will implement the system proposed in previous sections and run experiments to answer the following set of questions:

Does the usage of relevance information really lead to a re-ranking of the documents? It might well be the case that the documents are already returned in the right order by the search engine. In that case, the relevance information would provide no advantage. To measure this, I will keep track of the number of times re-rankings occur. I will also analyze how that number evolves over time: does the system keep always re-ranking documents or does it eventually reach a stable state where it presents the same ordering of documents for a given query? We would expect a pattern where very few re-orderings are made at the beginning, then things tend to change as we collect more data, and they finally stabilize around a certain value.
The next question has to do with the effectiveness of re-ranking the documents. If re-rankings do actually occur, how good are they? We would expect a tendency to find relevant documents moving to the top of the list and non-relevant documents going to the bottom. I will analyze this in a controlled experiment, where I will select a query or a set of queries and run them several times, selecting the good documents, and checking how fast relevant documents crawl to the top of the page.
Finally, I will try to find out if good results in a controlled setting are also observed in the real world. It is possible that some "cross-talk" occurs: two groups of people run the same query and consider two disjoint sets of documents as relevant to their needs. I will investigate how well the system adapts to this situation. In experiments with the system running in the real world, it won't be possible to measure directly the improvement in the results. Consequently, I will measure the improvement through indirect variables:
- How many documents are selected that are at the top of the page and how many that are at the bottom? How does this evolve over time? As time goes by, we'd expect people to select more from the top ranked documents than the bottom ranked documents.
- How highly ranked is the first/second document that people select? How does this evolve over time? As time goes by, we'd expect people to select the highly ranked documents first.
- How many documents do users select for a given query? How does this evolve over time? If we assume that more and more relevant documents are presented to the user, then it might be reasonable to expect more selections/query. Although this might not be a very good measure, since some users could feel satisfied by just one relevant document and not select the others.

The experiments outlined here will allow me to verify the hypothesis. If it is true, then I expect to observe a large number of re-rankings for a given query, a pattern where relevant documents crawl to the top of the page, and more and more documents selected by the users from the top of the results page. That will imply that the users are seeing some improvement in the ordering of the documents.

Related information

My thesis report is now available for online reading or for printing in PDF format.