This Month's posts

Archive for April 10th, 2005

Tags and Term extraction

Sunday, April 10th, 2005

Recently, Yahoo released a new web service feature, which allows one to send Y! a bit of text and have Y! return the most significant terms in the text. It seems pretty obvious that they’re using something along the lines of TFIDF(another definition), which means they’re returning the most ’statistically signifcant’ terms.

Jonas Luster, international man of mystery and Wordpress Inc* employee #1, has created a WordPress plugin which uses the term extraction service to add Technorati tags to his posts- a really cool feature- but misguided.

First of all, I think it is functionally unneccessary. The plugin isn’t adding anything in terms of value or content. Why? Because any consumer of the content could do the exact same term extraction. In other words, these tags are noise, which leads me to my second point.

Secondly, even Jonas himself worries that this method could dilute or even pollute the Technorati tag ecosystem. I think its obvious from the tags the plugin is generating on his blog that the extraction terms are not nearly as useful as his tags. For example, is this post really either beautiful or about beautiful? Surprisingly, Dave Sifry, CEO of Technorati is excitied about the new plugin.

A core problem with this approach to classifying text is that the method is text-only. It seems to me that this is the same problem that pre-Google search engines had- they didn’t consider any data outside the text. Google came along and used link graphing as a method for relevance ranking. I like tagging because it is more ‘data outside the text’ that can be exploited.

Third, I thought the whole idea of tags was to have human-generated metadata. Am I wrong? Seriously, collaborative tagging (a la del.icio.us and Flickr) are a revolution in community created metadata. Likewise, Technorati tagging is a significant improvement in author-created metadata.

This brings me back to my first point- there’s nothing meta about machine-generated keywords (especially if others can use a similar generator) because its information derived directly from the data.

Jonas admits that this is really just an experiment:

Indeed, Yahoo! Terms are not how we see the world, they’re how Yahoo!, a machine, sees it. Let’s leave them in, for a while, and see what pans out :)

Jonas: I want to know how you see the world. I’m tired of hearing how Yahoo and company view the world and if I care about how they view the world, I’ll ask them. Oh, and continue the experiment, but with all due respect, I hope it fails. :-)

* Or is it Wordpress Foundation now?