For the last month or so, I’ve been working, along with the guys at Minimal Loop(note, that website is blank), on a new open source project called Conveyor.
What is Conveyor? Well, that’s a good question.
One way of describing it is as a “distributed, rewindable, virtual queue server”. It speaks HTTP and will soon have a peer-to-peer replication mode. It can be treated like a queue, but because it doesn’t actually get rid of any data, you can rewind the queue to any point in the past. And you can treat it like a group of virtual queues and it appears like a queue to several sets of consumers, because the “queues” are really just iterators.
A good catchphrase is: “Like TiVo for your data”. It records, it pauses and it rewinds a broadcast stream.
Here’s a bit of the motivation:
Many people in the web industry are coming to the realization that the era of one-size-fits all databases is over– at least for large websites. The future for large websites’ data storage is likely a collection of special purpose data stores: GFS/MapReduce for batch jobs, inverted indexes for search and fast retrieval of small result sets and relational databases for smaller datasets which need online analysis. BigTable and SimpleDB-like things fit in there somewhere too.
The question remains though, how do you tie these together?
In my (limited) experience with storing the same data in multiple data stores, its useful to treat one of the data stores as primary and the others as derivative of that primary store. So, for example, you might keep your primary data in MySQL, but build inverted indexes with Lucene. You can usually tolerate your search indexes being a little out of date with the database, just so long as they aren’t too far out, in the same way that your MySQL slaves can be out of sync with the master, but not too far.
In this case, Conveyor can be used like an application-agnostic version of MySQL binlogs, which can be replayed to write data into multiple, diverse data stores.
Another use case is a multi-stage web crawler. You have a component that fetches pages and stores them in a cache. Another component takes those pages out the cache and parses them, which is passed to another stage that stores it in a database and writes a log of the changed data. See where I’m going with this?
Conveyor is useful at each stage of this architecture– you get queue semantics to distribute work, you get get rewindablility to deal with bugs in your code without re-running jobs from scratch and the virtual-ness of the queues means that your stages can branch with very little overhead and without redoing any previous work. Want to add a new data store later? Just write a Conveyor client that starts at the beginning of the queue or initialize it from a snapshot (making sure you know where in the stream of data that snapshot came from) and let it catch up.
Anyway, Conveyor is still a rough work in progress. It’s very alpha and not many people are using it yet (read: there are probably undiscovered bugs).
If you’d like to try it, it’s a simple as
sudo gem install conveyor (if you use rubygems) or you can browse over to the rubyforge page and download a tarball. Conveyor depends on Thin, daemons and and json.
Then to run it you just do
conveyor <data dir> where <data dir > is the directory where you want conveyor to store its data.
Update: I forgot to mention that there’s also a mailing list and irc channel.