p>I’ve recently become quite annoyed by the current offerings I’ve found available in web sever log analysis:
- Urchin, which [my webhost] uses, generally sucks- mostly because it does a really poor job of interpreting user agent strings.
- AwStats is really difficult to install, at least if you don’t have root on a machine (as is my situation using a shared host).
- Analog seems to suck, too- there’s a Mac port, but its quite ugly and not particularly easy to use.
- I’m unaware of any other good options.
For what its worth, I actually kinda like AWStats, meaning that it pretty much worked, unlike my experience with Urchin. So, after playing around with all of these, here’s what I’ve decided I want in a web server logfile analysis package:
- The ability to categorize requests based on specific criteria. I’m thinking about the user agent string, but I’m sure I’ll think of other things later that I want to be able to classify.
- The ability to run ad-hoc queries against the data.
- The ability to create graphs of ad-hoc queries.
Are there any open-source projects out there that qualify?
As it is, I’m working on my own implementation, which takes logfiles, parses them out and sticks them in a mysql db. In the process, the user agent string is analyzed to classify the agent into different categories. What do I mean by ‘categories’? Well what I really want is the ability to create a hierarcy like this:
- Browser
- Aggregators
- Robots
I offer these just to give you an idea of what kind of data I want to have. My first thought was to create a taxonomy like this. However, after thinking about this more, it may be more flexible to have a flat taxonomy (like tags), which can then be bundled into groups (del.icio.us bundles are definitely my inspiration on this one). For example, I could mark something with these tags
MSIE60, WinXP
or these
MSIE55, Win2k
Then I could create bundles like this:
IE->{MSIE60, MSIE55}
Windows->{Win2k, WinXP}
Ideally this system would have a rule-based system for classifying user agents, which would mean that adding a new agent would be as simple as adding a new rule for it (rather than writing or re-writing code for it). The goal of all of this is to have an extensible system for converting useragent strings into a format which is more easily processable with SQL statements.
I’m putting this out there, because I want some feedback on this. Any thoughts?