Archive for the 'General' Category

Introducing Conveyor

Tuesday, February 26th, 2008

For the last month or so, I’ve been working, along with the guys at Minimal Loop(note, that website is blank), on a new open source project called Conveyor.

What is Conveyor? Well, that’s a good question.

One way of describing it is as a “distributed, rewindable, virtual queue server”. It speaks HTTP and will soon have a peer-to-peer replication mode. It can be treated like a queue, but because it doesn’t actually get rid of any data, you can rewind the queue to any point in the past. And you can treat it like a group of virtual queues and it appears like a queue to several sets of consumers, because the “queues” are really just iterators.

A good catchphrase is: “Like TiVo for your data”. It records, it pauses and it rewinds a broadcast stream.

Here’s a bit of the motivation:

Many people in the web industry are coming to the realization that the era of one-size-fits all databases is over– at least for large websites. The future for large websites’ data storage is likely a collection of special purpose data stores: GFS/MapReduce for batch jobs, inverted indexes for search and fast retrieval of small result sets and relational databases for smaller datasets which need online analysis. BigTable and SimpleDB-like things fit in there somewhere too.

The question remains though, how do you tie these together?

In my (limited) experience with storing the same data in multiple data stores, its useful to treat one of the data stores as primary and the others as derivative of that primary store. So, for example, you might keep your primary data in MySQL, but build inverted indexes with Lucene. You can usually tolerate your search indexes being a little out of date with the database, just so long as they aren’t too far out, in the same way that your MySQL slaves can be out of sync with the master, but not too far.

In this case, Conveyor can be used like an application-agnostic version of MySQL binlogs, which can be replayed to write data into multiple, diverse data stores.

Another use case is a multi-stage web crawler. You have a component that fetches pages and stores them in a cache. Another component takes those pages out the cache and parses them, which is passed to another stage that stores it in a database and writes a log of the changed data. See where I’m going with this?

Conveyor is useful at each stage of this architecture– you get queue semantics to distribute work, you get get rewindablility to deal with bugs in your code without re-running jobs from scratch and the virtual-ness of the queues means that your stages can branch with very little overhead and without redoing any previous work. Want to add a new data store later? Just write a Conveyor client that starts at the beginning of the queue or initialize it from a snapshot (making sure you know where in the stream of data that snapshot came from) and let it catch up.

Anyway, Conveyor is still a rough work in progress. It’s very alpha and not many people are using it yet (read: there are probably undiscovered bugs).

If you’d like to try it, it’s a simple as sudo gem install conveyor (if you use rubygems) or you can browse over to the rubyforge page and download a tarball. Conveyor depends on Thin, daemons and and json.

Then to run it you just do conveyor <data dir> where <data dir > is the directory where you want conveyor to store its data.

Update: I forgot to mention that there’s also a mailing list and irc channel.

Friday Five: Round 2

Friday, February 8th, 2008

Round 2 of digital mixtaping. Enjoy.

Here it is: Round 2.

Track listing:

  1. This is not a Love Song Nouvelle Vague - s/t
  2. I Never Wanted You Headphones - s/t
  3. Baby, We’ll Be Fine The National - Alligator
  4. Science vs. Romance Rilo Kiley - Take Offs & Landings
  5. Parentheses The Blow - Paper Television

inursite…

Saturday, February 2nd, 2008

….validatin’ ur HTMLz

I’ve been working on a side project for awhile, called inursite.com. The basic idea is that you signup, give it a few URLs you care about and it’ll check them about every 24 hours to see if the HTML is valid. You can then get a feed of the results. A simple idea, so simple I’m surprised no one’s done it before.

For now, the site is somewhat limited: you can only have 5 URLs to be checked, and it will only check them on schedule (you can’t force it to update), but I plan on lifting some of these restrictions and adding new types of checks in the future, and there’ll probably be a subscription premium version, too. I releasing this now because I want to get more feedback on the ideas and design.

So, go check it out, then tell me what you think.

Risk

Tuesday, January 29th, 2008

I don’t usually write personal things here, but I felt I needed to make an exception for this.

Awhile back Matt suggested I “write down the things that are most important to [me] going forward”.

I tried to think of concrete things that mattered like “I want to work on projects like X” or “I want to change the world to make it more like Y”, but to be honest I don’t really know what X and Y are. At least, X and Y seem to change from week to week.

There’s the problem though– I have a million different things I’d like to see happen, but I haven’t been making them happen, but instead just wishing they would fall into place without any risk taking on my part. The worst part is that I’ll sometimes be disappointed when things don’t go my way, even though I didn’t do anything to make it happen. I forget that the world doesn’t owe me anything.

After obtaining a degree, no matter how small, of success, its easy to forget what it was that got you there. For me, I forgot how many risks I had to take to get where I am.

As Mark Twain said:

Don’t go around saying the world owes you a living. The world owes you nothing. It was here first.

Ok, enough of the personal, watch out for a few announcements of the technology type here soon.

Cleaning Out

Wednesday, January 2nd, 2008

Over the course of the last year, I’ve had many ideas for blog posts that, for whatever reason haven’t been written. As inspired by Tim Bray [1] [2], I’m just gonna dump them all here in one post. I figure a short quip is better than silence. If anyone seems to take interest in them, maybe I’ll write up more thoughts on the subject.

Scalability

Scalability isn’t about growing, it isn’t about getting bigger, it’s about working at various sizes. It’s about making things constant despite changes in N. It’s about making N less relevant.

Redo ‘Friends and Neighbors on the Web’ using XFN

An interesting paper that should be updated to use XFN-based graphs.

Confounding Things and People

People complain that XFN confounds people and URLs. However, people do this in the real world, too– witness vanity license plates. Sometimes the plates refer to the car, sometimes the person, sometimes both, sometimes neither.

Parsing Microformats

I gave a presentation on parsing microformats, but never blogged about it or shared it with the mf community online. I think there’s some good material in there that I should develop more.