Introducing Conveyor

For the last month or so, I’ve been working, along with the guys at Minimal Loop(note, that website is blank), on a new open source project called Conveyor.

What is Conveyor? Well, that’s a good question.

One way of describing it is as a “distributed, rewindable, virtual queue server”. It speaks HTTP and will soon have a peer-to-peer replication mode. It can be treated like a queue, but because it doesn’t actually get rid of any data, you can rewind the queue to any point in the past. And you can treat it like a group of virtual queues and it appears like a queue to several sets of consumers, because the “queues” are really just iterators.

A good catchphrase is: “Like TiVo for your data”. It records, it pauses and it rewinds a broadcast stream.

Here’s a bit of the motivation:

Many people in the web industry are coming to the realization that the era of one-size-fits all databases is over– at least for large websites. The future for large websites’ data storage is likely a collection of special purpose data stores: GFS/MapReduce for batch jobs, inverted indexes for search and fast retrieval of small result sets and relational databases for smaller datasets which need online analysis. BigTable and SimpleDB-like things fit in there somewhere too.

The question remains though, how do you tie these together?

In my (limited) experience with storing the same data in multiple data stores, its useful to treat one of the data stores as primary and the others as derivative of that primary store. So, for example, you might keep your primary data in MySQL, but build inverted indexes with Lucene. You can usually tolerate your search indexes being a little out of date with the database, just so long as they aren’t too far out, in the same way that your MySQL slaves can be out of sync with the master, but not too far.

In this case, Conveyor can be used like an application-agnostic version of MySQL binlogs, which can be replayed to write data into multiple, diverse data stores.

Another use case is a multi-stage web crawler. You have a component that fetches pages and stores them in a cache. Another component takes those pages out the cache and parses them, which is passed to another stage that stores it in a database and writes a log of the changed data. See where I’m going with this?

Conveyor is useful at each stage of this architecture– you get queue semantics to distribute work, you get get rewindablility to deal with bugs in your code without re-running jobs from scratch and the virtual-ness of the queues means that your stages can branch with very little overhead and without redoing any previous work. Want to add a new data store later? Just write a Conveyor client that starts at the beginning of the queue or initialize it from a snapshot (making sure you know where in the stream of data that snapshot came from) and let it catch up.

Anyway, Conveyor is still a rough work in progress. It’s very alpha and not many people are using it yet (read: there are probably undiscovered bugs).

If you’d like to try it, it’s a simple as sudo gem install conveyor (if you use rubygems) or you can browse over to the rubyforge page and download a tarball. Conveyor depends on Thin, daemons and and json.

Then to run it you just do conveyor <data dir> where <data dir > is the directory where you want conveyor to store its data.

Update: I forgot to mention that there’s also a mailing list and irc channel.

10 Responses to “Introducing Conveyor”

  1. Internet Alchemy » links for 2008-02-27 Says:

    […] the ryan king » Introducing Conveyor A “distributed, rewindable, virtual queue server�?. It speaks HTTP and will soon have a peer-to-peer replication mode. It can be treated like a queue, but because it doesn’t actually get rid of any data, you can rewind the queue to any point in the p (tags: open source queues scalability http) […]

  2. Otis Gospodnetic Says:

    Funny, earlier today I wondered what Ryan was up to. Now I know.

    Sounds *very* useful!

    Can you elaborate on “…a *group* of virtual queues…”? Meaning a “set”/several queues or? What makes them virtual?

    It speaks HTTP….good, will you be writing some Ruby/Java/Groovy/whatever clients, too?

    The storage file format is custom stuff, it seems?

    Is rewinding time stamp based? That is, how do you specify how far back you want to rewind to?

  3. ryan Says:

    Otis-

    I’m glad you think it sounds useful. I suspect it also sounds a bit familiar. :)

    The virtual-ness of the queues is in that they are really just iterators in a never-ending list. I started off calling them queues and queue groups, but I think it may be more clear to call them just iterators.

    There’s a Ruby client in the distribution already. Matt Ericson has a Java client. Hopefully we can get him to open-source it.

    Yeah, the storage is all custom and documented here.

    The rewinding is currently by id (sequence number), but you should be able to rewind to a timestamp soon (I’ll probably have that done by next week).

  4. Otis Gospodnetic Says:

    Oh, a plain-text storage, ok. Yes, it does sound familiar – didn’t want to say it. :)
    +1 for Matt!

  5. links for 2008-03-01 « Bloggitation Says:

    […] the ryan king ” Introducing Conveyor distributed, rewindable, virtual queue server. Conveyor depends on Thin, daemons and and json. (tags: ruby programming cluster database sqs) […]

  6. a work on process » links for 2008-03-03 Says:

    […] the ryan king » Introducing Conveyor "One way of describing it is as a “distributed, rewindable, virtual queue server�?. It speaks HTTP and will soon have a peer-to-peer replication mode. It can be treated like a queue, but because it doesn’t actually get rid of any data, you can r (tags: conveyor http messagequeue) […]

  7. Nodalities » Blog Archive » This Week’s Semantic Web Says:

    […] Introducing Conveyor – a “distributed, rewindable, virtual queue server” […]

  8. Yemek Tarifleri Says:

    The virtual-ness of the queues is in that they are really just iterators in a never-ending list

  9. örgüler Says:

    Is rewinding time stamp based? That is, how do you specify how far back you want to rewind to?

  10. spor haberleri Says:

    Introducing Conveyor “One way of describing it is as a “distributed, rewindable, virtual queue server�?. It speaks HTTP and will soon have a peer-to-peer replication mode.