This Year's posts

Archive for February, 2008

Friday Five: Upbeat

Friday, February 29th, 2008

I picked these songs because they have an kind of upbeat energy to them. Sorta. Whatever, the just sound good together.


Track Listing

  1. Time to Pretend – MGMT – Oracular Spectacular
  2. So It Goes – The Broken West – I Can’t Go On I’ll Go On
  3. Easy on Yourself – The Drive-by Truckers – A Blessing and Curse
  4. Ode to LRC – Band of Horses – Cease to Begin
  5. Narcocorrido – Okkervil River – Black Sheep Boy Appendix

Introducing Conveyor

Tuesday, February 26th, 2008

For the last month or so, I’ve been working, along with the guys at Minimal Loop(note, that website is blank), on a new open source project called Conveyor.

What is Conveyor? Well, that’s a good question.

One way of describing it is as a “distributed, rewindable, virtual queue server”. It speaks HTTP and will soon have a peer-to-peer replication mode. It can be treated like a queue, but because it doesn’t actually get rid of any data, you can rewind the queue to any point in the past. And you can treat it like a group of virtual queues and it appears like a queue to several sets of consumers, because the “queues” are really just iterators.

A good catchphrase is: “Like TiVo for your data”. It records, it pauses and it rewinds a broadcast stream.

Here’s a bit of the motivation:

Many people in the web industry are coming to the realization that the era of one-size-fits all databases is over– at least for large websites. The future for large websites’ data storage is likely a collection of special purpose data stores: GFS/MapReduce for batch jobs, inverted indexes for search and fast retrieval of small result sets and relational databases for smaller datasets which need online analysis. BigTable and SimpleDB-like things fit in there somewhere too.

The question remains though, how do you tie these together?

In my (limited) experience with storing the same data in multiple data stores, its useful to treat one of the data stores as primary and the others as derivative of that primary store. So, for example, you might keep your primary data in MySQL, but build inverted indexes with Lucene. You can usually tolerate your search indexes being a little out of date with the database, just so long as they aren’t too far out, in the same way that your MySQL slaves can be out of sync with the master, but not too far.

In this case, Conveyor can be used like an application-agnostic version of MySQL binlogs, which can be replayed to write data into multiple, diverse data stores.

Another use case is a multi-stage web crawler. You have a component that fetches pages and stores them in a cache. Another component takes those pages out the cache and parses them, which is passed to another stage that stores it in a database and writes a log of the changed data. See where I’m going with this?

Conveyor is useful at each stage of this architecture– you get queue semantics to distribute work, you get get rewindablility to deal with bugs in your code without re-running jobs from scratch and the virtual-ness of the queues means that your stages can branch with very little overhead and without redoing any previous work. Want to add a new data store later? Just write a Conveyor client that starts at the beginning of the queue or initialize it from a snapshot (making sure you know where in the stream of data that snapshot came from) and let it catch up.

Anyway, Conveyor is still a rough work in progress. It’s very alpha and not many people are using it yet (read: there are probably undiscovered bugs).

If you’d like to try it, it’s a simple as sudo gem install conveyor (if you use rubygems) or you can browse over to the rubyforge page and download a tarball. Conveyor depends on Thin, daemons and and json.

Then to run it you just do conveyor <data dir> where <data dir > is the directory where you want conveyor to store its data.

Update: I forgot to mention that there’s also a mailing list and irc channel.

Friday Five: Round 2

Friday, February 8th, 2008

Round 2 of digital mixtaping. Enjoy.

Here it is: Round 2.

Track listing:

  1. This is not a Love Song Nouvelle Vagues/t
  2. I Never Wanted You Headphoness/t
  3. Baby, We’ll Be Fine The NationalAlligator
  4. Science vs. Romance Rilo KileyTake Offs & Landings
  5. Parentheses The BlowPaper Television


Saturday, February 2nd, 2008

….validatin’ ur HTMLz

I’ve been working on a side project for awhile, called The basic idea is that you signup, give it a few URLs you care about and it’ll check them about every 24 hours to see if the HTML is valid. You can then get a feed of the results. A simple idea, so simple I’m surprised no one’s done it before.

For now, the site is somewhat limited: you can only have 5 URLs to be checked, and it will only check them on schedule (you can’t force it to update), but I plan on lifting some of these restrictions and adding new types of checks in the future, and there’ll probably be a subscription premium version, too. I releasing this now because I want to get more feedback on the ideas and design.

So, go check it out, then tell me what you think.