Hello, and welcome to Paper of the Day (Po’D): Probably Approximately Correct Search Edition. Today’s paper presents an analysis of an interesting approach to document search: I. J. Cox, R. Fu and L.K. Hansen, “Probably Approximately Correct Search”, Proc. of the Int. Conf. on Theoretical Information Retrieval, 2009. In the interest of extreme brevity, here is my one line description of the work in this paper:

We can probably find several matches to a query by independently randomly sampling a large number of documents over a large number of computers, thereby reducing the complexity of the search, and maintenance of the system.

This paper proposes the idea of “non-deterministic search” (PACS), and analyzes its performance with respect to several parameters. The idea is to randomly distribute portions of a set of documents over many computers (not necessary disjoint or complete). Each computer then is responsible for searching its own local index and responding to a query. This approach avoids problems that come with deterministic and “virtually-centralized” ones, like Google, where adding more machines and documents involves non-trivial adjustments. Instead, the authors propose PACS, an information retrieval system where the search indexes are built randomly, computers are queried randomly, and thus responses to queries are thus random. While in the deterministic case the response to a query will always be the same for a given index, in the non-deterministic case each response will be different. The question is, over how many independent computers do we need to randomly distribute an index such that the results to our query are probably approximately correct? (The authors refer to the probably approximately correct learning theory of Les Valiant.)

Considering the WWW — where there are O(10^10) pages — if we have 300,000 computers and each independently samples 0.1% of the WWW at random, then the probability that a particular webpage is not among those sampled is e^-300 (see the paper). Thus, we can say our 300,000 computers have *probably* indexed the entire WWW. (Of course, this also means each computer contains 10 million documents, which can create some memory problems. If each document has about 1000 different terms, and each term consists of 20 bytes, then each machine has 191 GB of data to search.) Now, to address a particular query, we don’t want to communicate with all 300,000 machines, because that would require high data throughput. So, PACS instead selects a subset of computers to query. If we select 1000 machines at random to query for a particular document, then with probability .63 it will be the one returned. If we are interested in the top-10 results, then with probability .88 five of the 10 documents returned by PACS will match the 10 returned by the deterministic setting. The probability that none of the 10 documents will match is only .0001. Thus, here we have the *approximately correct* in PACS.

This is an interesting way to think about and perform search.

Each run of the algorithm will return different results, but each will be approximately correct.

While PACS can use equivalent resources as Google, it can trade accuracy for throughput in an information retrieval system.

It also simplifies the procedure for adding machines to the system,

and make adjustments to the indexes on each machine.

I think I see this in some sense as moving complexity from the “Google core” toward the network edges.

While the lack of accuracy might frighten away users of PACS, this paper shows a nice straightforward application of probability to studying an interesting problem.