Archive for 2007

on being a search engineer

Saturday, September 8th, 2007

One of the main reasons I like programming computers is that building nontrivial software requires significant amounts of creative problem solving. For me, analyzing, understanding, modeling, explaining and solving these problems is immensely fun. It’s why I do what I do.

Web Search, even if it isn’t search of the whole web (as my job at Technorati isn’t, since we only index blogs and related content), is a challenging problem because the domains and ranges of the functions are enormous. What do I mean by that? Let me explain a little bit about how I think about problems.

When approaching a problem that I’d like to solve, I try and attack it from several different approaches, which hopefully leads to understanding the problem later. For programming problems, this usually means trying to model the problem in several different paradigms.

For search I tend to think in terms of Functional Programming. When I refer to Functional Programming, I think of it in rather broad, ambiguous terms: data in lists, code is data, every operation is a function that maps inputs to outputs and there is no encapsulated state.

For Web Search, the overall function maps the Web and a query to a list. Let’s call the web search function search. In pseudocode, this would be:

results = search(Web, query)

This is the function I mentioned at the beginning. It has an enormous domain (input), the entire Web. The results list is even more complicated. For a given query across the Web, a user will likely want only 10 results. This means the number of possible results is a perumtation of the number of documents on the web and the number of results wanted, which would be:

P(n,r) = n! / (n - r)!

where n = number of documents on the web and r is the number of results. Assuming 20 billion documents on the web, and 10 results, this gives us

P(n,r) = 20,000,000,000! / 19,999,999,990!

which is:

204,799,999,436,800,000,675,839,999,535,360,000,201,
949,439,942,268,480,010,934,175,998,654,480,000,102,
028,607,995,748,544,000,072,576,000,000,000,000

or

2.048000 x 10^113

Which, you’ll notice, is larger than one Googol.

The function search can be decomposed. In functional programming, decomposition of functions is the act of breaking functions down in to smaller parts (functions), which, when chained together, give the same result. If, for example, you have the function foo, which can be broken down into two parts, bar and baz where baz operates on the result of bar, you could say that baz(bar()) is the decomposition of foo(). This gives you something like:

foo = baz(bar())

We need to decompose search because there are orthogonal parts of search which, for reasons of scale should be implemented separately. The first decomposition would be to filter the Web variable down to the set which you wish to operate on. For us at Technorati, this function would be essential “is this a blog which is also not spam?”. For Google it might just be the identity function (meaning it does no filtering).

Our pseudo code would now be:

results = search(filter(Web), query)

At this point our search function can be implemented with a match, which takes a list of pages and a query and returns all pages that match the query. So, we have:

results = match(filter(Web), query)

There’s another reduction we can do to the Web variable– to build an inverted index. But, before I explain what an inverted index is, or how it fits into this analysis, let’s explore a more naive approach.

One approach we could take to implementing search is this algorithm:

  1. get a list of all pages on the web
  2. apply our filter to each page
  3. apply match to each page
  4. sort the results

For reasons that I hope are obvious, this is a rather naive approach. Steps 1 and 2 amount to crawling the entire Web, which even with large amounts of infrastructure, would easily take days or weeks. We want results in milliseconds. Also, in addition to crawling the entire Web, we would have to apply match to every document returned by our filter function, which itself could take an enormous amount of time.

The typical solution is to build an inverted index. An inverted index is alot like an index in the back of a book. It’s a list of words (or in search-speak, terms) which point to a list of pages. The reason it’s called an inverted index is that building usually requires an actual inversion (you could think of it as a sparse matrix inversion)).

Assume we have a function analyze, which given a page returns the list of words on the page. Like so:

words = analyze(page)

If we apply analyze to a list of pages (using map) we’d be given a list of lists of words (one list for each page).

page_words_list = map(pages, analyze)

To build an inverted index we need to flip this around so that we have a list for each word, which points to a list of pages.

Imagine we have a function invert that looks like this (in ruby):


    def invert(pages)
      index = Hash.new(0)
      for page in pages
        for word in analzye(page)
          index[word] = [] unless index.has_key?(word)
          index[word] << page unless (index[word].include?(page))
        end
      end
      return index
    end

Now that we have an inverted index we can ditch the match function and replace it with an inverted index look up:

hits = lookup(index, query)

which, given a query will return all of the documents that match the query (called hits). The algorithm for doing this is relatively simple (assuming that we only implement simply OR queries):


def lookup(index, query)
  hits = []
  for term in analyze(query)
    if index.include?(term)
      hits += index[term]
    end
  end
  return hits.flatten.uniq
end

index here is essentially a hash table. It can easily be implemented in sublinear time.

In addition to returning the documents, lookup also returns a score for each document. Discussion of scoring is out of scope for this essay, maybe I’ll talk about that later. Anyway, the reason we need scores is that we need some way to sort the hits into a results list.

So, now we have sort that takes a list of hits and sorts them by score:

results = sort(lookup(index, query))

At this point, the overall picture looks something like this:


results = sort(
            lookup(
              invert(
                filter(
                  map(Web, analyze)
                )
              ),
              query
            )
          )

Of course, this entire exercise doesn’t really lead to an implementation, but it can give you hints about how to build a search engine. First off, it can help you figure out where parallelism can be naturally added.

For each of the parts in the process that apply a function to a list of data, filtering the list down to a shorter list can be parallelized by partioning the input list and then joining the resulting lists back together. This holds for lookup, invert and filter.

Another insight that this analysis can bring is that there are a number of places in the process that affect the quality of the results. If you don’t filter the Web down to the proper subset (this includes removing spam), you’ll have poor search results. If you can’t analyze your documents well, you won’t have good search results. I could go on, but I won’t. The point is that there are lots of knobs to turn in this system. And this is what I like about it.

The source of my enjoyment when working on search is that it’s a big problem which requires creative thinking to do well. There are also lots of places in the general process in which innovations can be introduced. And, finally, you don’t have to worry about theoretical scalability limits (as much as with other problems) for many parts, because they can be parallelized.

Mercurial

Sunday, August 26th, 2007

Ed: To make up for the long drought in blogging, I’m gonna try and push out some drafts that I’ve written over the last year or so, but never published. This is the first of them.

CVS really annoys me and I think it’s outlived its usefulness. I’m not familiar with the environment it was built for, but I suspect that that environment existed sometime before I was born. Here are a few things that annoy me about CVS:

  • It doesn’t recursively update directories. This means, for example, that if up do cvs up in a working copy, it won’t update nested directories, nor will it create directories which have been added by someone else. You have to do cvs up -dR to get the expected behavior.
  • You can’t delete directories. The best you can do is to delete all of the entries from the repository, delete the directory then be careful to not recreate the directory in your working copy. This means that if you do the above (cvs up -dR), you will end up with the empty directories coming back over and over again.
  • Its slow. Really, it is. You won’t notice it until you use something else. The fact that it’s written in C does not guarantee that it will be fast.
  • It doesn’t understand changesets, only changes to individual files. Most changes I do to code involves little bits of change across several files. CVS has no way of associating those changes with each other. This also means that there is potential to create race conditions, where two people are checking in a series of files at once, which conflict with each other and potentially leave the repository in an inconsistent state. Source code management tools are supposed to help keep a consistent history of the code, not destroy it. This is s serious bug.
  • Merging with CVS is a special form of torture. I didn’t realize how painful this is until I started using a tool that is built as if merging is important.
  • CVS is difficult to install.
  • I can’t work offline. For example, if I’m writing some code on a plane, I have to wait until I land to do my checkins. No matter than I made a bunch of different changes that should all be their own changesets.

For a long time, I thought Subversion was the answer. That was until I tried Mercurial. Subversion offers only incremental improvements on CVS, while Mercurial and similar tools offer new ways of working that have made me more productive.

I tried Mercurial based on a suggestion from Dan Connolly. I was struggling to set up subversion on microformats.org and he’d begun using it on his own projects. I was converted almost immediately.

Subversion certainly has advantages over CVS– it can perform better, it’s web native and makes branching and merging easier. But, they both suffer from the same client-server model. The client server model doesn’t reflect real world practices when it comes to software development.

For example, when I was working on front end code at Technorati (I work on mostly infrastructure stuff these days), there would often be times when I needed to share a bit of work I was doing with another developer so that they could finish up a feature. My code wasn’t ready to go into the mainline of code because it was incomplete.

What we usually ended up doing was copying files across on the development machine. Then when the other engineer finished the feature, they’d check in all of the changes, which has some problems.

The problem is that my work is mixed in with their work. No longer do we have a clean change history.

If we had a tool for easily sharing changesets with each other without committing them to the central repository, we could work much more smoothly and cleanly.

So, enter Distributed Source Code Management systems, of which mercurial is an example. I’m not going to go into their design or history too deeply. Suffice it to say that with DSCMs, each working copy is a full repository and a peer to all other repositories. The peers can share changesets easily.

A result of this architecture is that essentially every change is a branch and every change must be merged. So, merging is really easy. Branching is even easier.

And this is the key difference. Once merging and branching became trivially easy, I found a million uses for it. Suddenly I could branch for every feature without pain. I now find that for a given project I usually have five or six branches lying around (each as their own working copy) and I have no problems keeping them in sync with the main line.

When it comes down to it, I like Mercurial for the freedom it gives me. I’m free to work offline, to branch and merge easily without subjecting others to the same organization and to easily create and share repositories.

If you believe that freedom helps people get their jobs done, then ditch CVS and Subversion.

Conversations

Thursday, August 9th, 2007

Conversations I seem to have all the time.

Scene: Anywhere.

  1. Them Where are you from?
  2. Me Kansas City.
  3. Them Ah, Kansas.
  4. Me Actually, Missouri.
  5. Them What?
  6. Me Kansas City is in Missouri
  7. Them Really? Wait, isn’t it pronounced Miz-ur-uh?
  8. Me Nevermind.

Scene: a loud place, like a coffee shop.

  1. Them What’s your name?
  2. Me Ryan
  3. Them Brian?
  4. Me Rrrryan
  5. Them Bryant?
  6. Me With an R.
  7. Them Brian with an R? ?
  8. Me No B
  9. Them Huh?
  10. Me Nevermind.

SxSW Chronicles 1

Saturday, March 10th, 2007

I got into Austin yesterday afternoon and have barely stopped since then. Geek Spring Break is in full action.

After checking into my hotel and picking up my badge I hit up the How to Rock SxSW panel. Despite finishing a bottle of Jack Daniels on stage (or maybe “because of”) the panel was fun. Not sure how much I learned, since I’m friends with most of the panelists.

On the party circuit last night, I hit up Break Bread with Brad. It moved to Buffalo Billiards this year (which is huge, by the way) and was a good time. Got to see a bunch of people that I hadn’t seen since this time last year.

Apparently I had a lot to drink last night (no surprises there), because this morning was a bit rough. Today’s panel action was not bad. I made it to “Under 18: Blogs, Wikis and Online Social Networks for Youth“, “Grids Are Good and How to Design with Them“, “Ruining the User Experience: When JavaScript and Ajax Go Bad” and Ghost in the Machine: Spirituality Online.

I’ve enjoyed the panels because I’ve been listening to topics that I’m not an expert on. I recommend that strategy to anyone.

T: “He’s not creepy, he’s just odd.”

Monday, February 19th, 2007

Creepiness is oddness integrated over time.