This Year's posts

Archive for September, 2007

on being a search engineer

Saturday, September 8th, 2007

One of the main reasons I like programming computers is that building nontrivial software requires significant amounts of creative problem solving. For me, analyzing, understanding, modeling, explaining and solving these problems is immensely fun. It’s why I do what I do.

Web Search, even if it isn’t search of the whole web (as my job at Technorati isn’t, since we only index blogs and related content), is a challenging problem because the domains and ranges of the functions are enormous. What do I mean by that? Let me explain a little bit about how I think about problems.

When approaching a problem that I’d like to solve, I try and attack it from several different approaches, which hopefully leads to understanding the problem later. For programming problems, this usually means trying to model the problem in several different paradigms.

For search I tend to think in terms of Functional Programming. When I refer to Functional Programming, I think of it in rather broad, ambiguous terms: data in lists, code is data, every operation is a function that maps inputs to outputs and there is no encapsulated state.

For Web Search, the overall function maps the Web and a query to a list. Let’s call the web search function search. In pseudocode, this would be:

results = search(Web, query)

This is the function I mentioned at the beginning. It has an enormous domain (input), the entire Web. The results list is even more complicated. For a given query across the Web, a user will likely want only 10 results. This means the number of possible results is a perumtation of the number of documents on the web and the number of results wanted, which would be:

P(n,r) = n! / (n - r)!

where n = number of documents on the web and r is the number of results. Assuming 20 billion documents on the web, and 10 results, this gives us

P(n,r) = 20,000,000,000! / 19,999,999,990!

which is:

204,799,999,436,800,000,675,839,999,535,360,000,201,
949,439,942,268,480,010,934,175,998,654,480,000,102,
028,607,995,748,544,000,072,576,000,000,000,000

or

2.048000 x 10^113

Which, you’ll notice, is larger than one Googol.

The function search can be decomposed. In functional programming, decomposition of functions is the act of breaking functions down in to smaller parts (functions), which, when chained together, give the same result. If, for example, you have the function foo, which can be broken down into two parts, bar and baz where baz operates on the result of bar, you could say that baz(bar()) is the decomposition of foo(). This gives you something like:

foo = baz(bar())

We need to decompose search because there are orthogonal parts of search which, for reasons of scale should be implemented separately. The first decomposition would be to filter the Web variable down to the set which you wish to operate on. For us at Technorati, this function would be essential “is this a blog which is also not spam?”. For Google it might just be the identity function (meaning it does no filtering).

Our pseudo code would now be:

results = search(filter(Web), query)

At this point our search function can be implemented with a match, which takes a list of pages and a query and returns all pages that match the query. So, we have:

results = match(filter(Web), query)

There’s another reduction we can do to the Web variable– to build an inverted index. But, before I explain what an inverted index is, or how it fits into this analysis, let’s explore a more naive approach.

One approach we could take to implementing search is this algorithm:

  1. get a list of all pages on the web
  2. apply our filter to each page
  3. apply match to each page
  4. sort the results

For reasons that I hope are obvious, this is a rather naive approach. Steps 1 and 2 amount to crawling the entire Web, which even with large amounts of infrastructure, would easily take days or weeks. We want results in milliseconds. Also, in addition to crawling the entire Web, we would have to apply match to every document returned by our filter function, which itself could take an enormous amount of time.

The typical solution is to build an inverted index. An inverted index is alot like an index in the back of a book. It’s a list of words (or in search-speak, terms) which point to a list of pages. The reason it’s called an inverted index is that building usually requires an actual inversion (you could think of it as a sparse matrix inversion)).

Assume we have a function analyze, which given a page returns the list of words on the page. Like so:

words = analyze(page)

If we apply analyze to a list of pages (using map) we’d be given a list of lists of words (one list for each page).

page_words_list = map(pages, analyze)

To build an inverted index we need to flip this around so that we have a list for each word, which points to a list of pages.

Imagine we have a function invert that looks like this (in ruby):


    def invert(pages)
      index = Hash.new(0)
      for page in pages
        for word in analzye(page)
          index[word] = [] unless index.has_key?(word)
          index[word] << page unless (index[word].include?(page))
        end
      end
      return index
    end

Now that we have an inverted index we can ditch the match function and replace it with an inverted index look up:

hits = lookup(index, query)

which, given a query will return all of the documents that match the query (called hits). The algorithm for doing this is relatively simple (assuming that we only implement simply OR queries):


def lookup(index, query)
  hits = []
  for term in analyze(query)
    if index.include?(term)
      hits += index[term]
    end
  end
  return hits.flatten.uniq
end

index here is essentially a hash table. It can easily be implemented in sublinear time.

In addition to returning the documents, lookup also returns a score for each document. Discussion of scoring is out of scope for this essay, maybe I’ll talk about that later. Anyway, the reason we need scores is that we need some way to sort the hits into a results list.

So, now we have sort that takes a list of hits and sorts them by score:

results = sort(lookup(index, query))

At this point, the overall picture looks something like this:


results = sort(
            lookup(
              invert(
                filter(
                  map(Web, analyze)
                )
              ),
              query
            )
          )

Of course, this entire exercise doesn’t really lead to an implementation, but it can give you hints about how to build a search engine. First off, it can help you figure out where parallelism can be naturally added.

For each of the parts in the process that apply a function to a list of data, filtering the list down to a shorter list can be parallelized by partioning the input list and then joining the resulting lists back together. This holds for lookup, invert and filter.

Another insight that this analysis can bring is that there are a number of places in the process that affect the quality of the results. If you don’t filter the Web down to the proper subset (this includes removing spam), you’ll have poor search results. If you can’t analyze your documents well, you won’t have good search results. I could go on, but I won’t. The point is that there are lots of knobs to turn in this system. And this is what I like about it.

The source of my enjoyment when working on search is that it’s a big problem which requires creative thinking to do well. There are also lots of places in the general process in which innovations can be introduced. And, finally, you don’t have to worry about theoretical scalability limits (as much as with other problems) for many parts, because they can be parallelized.