Tags and Term extraction

Recently, Yahoo released a new web service feature, which allows one to send Y! a bit of text and have Y! return the most significant terms in the text. It seems pretty obvious that they’re using something along the lines of TFIDF(another definition), which means they’re returning the most ‘statistically signifcant’ terms.

Jonas Luster, international man of mystery and WordPress Inc* employee #1, has created a WordPress plugin which uses the term extraction service to add Technorati tags to his posts- a really cool feature- but misguided.

First of all, I think it is functionally unneccessary. The plugin isn’t adding anything in terms of value or content. Why? Because any consumer of the content could do the exact same term extraction. In other words, these tags are noise, which leads me to my second point.

Secondly, even Jonas himself worries that this method could dilute or even pollute the Technorati tag ecosystem. I think its obvious from the tags the plugin is generating on his blog that the extraction terms are not nearly as useful as his tags. For example, is this post really either beautiful or about beautiful? Surprisingly, Dave Sifry, CEO of Technorati is excitied about the new plugin.

A core problem with this approach to classifying text is that the method is text-only. It seems to me that this is the same problem that pre-Google search engines had- they didn’t consider any data outside the text. Google came along and used link graphing as a method for relevance ranking. I like tagging because it is more ‘data outside the text’ that can be exploited.

Third, I thought the whole idea of tags was to have human-generated metadata. Am I wrong? Seriously, collaborative tagging (a la del.icio.us and Flickr) are a revolution in community created metadata. Likewise, Technorati tagging is a significant improvement in author-created metadata.

This brings me back to my first point- there’s nothing meta about machine-generated keywords (especially if others can use a similar generator) because its information derived directly from the data.

Jonas admits that this is really just an experiment:

Indeed, Yahoo! Terms are not how we see the world, they’re how Yahoo!, a machine, sees it. Let’s leave them in, for a while, and see what pans out :)

Jonas: I want to know how you see the world. I’m tired of hearing how Yahoo and company view the world and if I care about how they view the world, I’ll ask them. Oh, and continue the experiment, but with all due respect, I hope it fails. :-)

* Or is it WordPress Foundation now?

6 Responses to “Tags and Term extraction”

  1. jluster.org’s webvergnügen » Maybe a solution… Says:

    [...] » Maybe a solution… { 10 Apr 2005 01:48 pm } Ryan King has some valid points. We disagree on the purpose of tagging, [...]

  2. Denis de Bernardy Says:

    I think there definitly is something to do with the plugin. But I would reprocess the results some bit before using the raw results.

    I see one use, too: Given a properly extracted keyword set, that I hope I’ll get from my keyword autoextract project, there would be room to create an amazon autoextract plugin, for instance.

  3. ryan Says:

    Thanks for stopping by…

    I agree that the plugin is useful and interesting. The point I was trying to make is that the data that term extraction gives us is different from tags and therefore should not be mixed.

    Note, Jonas has updated his solution based on my criticisms and I would call it progress.

  4. the laboratory » Blog Archive » Tags vs. Yahoo Term Extraction Says:

    [...] plugin lets you tag your posts with terms extracted via the new service. I found this via Ryan King who has some pretty good points to make why this isn’t [...]

  5. Anton’s Stuff » Blog Archive » Yahoo Terms Extraction Says:

    [...] hed plugin and see about using it on my domains. Another article about the Y! service at: http://theryanking.com/blog/archives/2005/04/10/tags-and-term-extraction/ [...]

  6. the ryan king » Taking the Folk out of Folksonomy Says:

    [...] eir algorithms. Auto-tagging is not folkonomy. Jonas and I have been over this, remember? This post pretty much sums up my arguments against auto-tagging, though [...]