Web server log analysis

p>I’ve recently become quite annoyed by the current offerings I’ve found available in web sever log analysis:

  • Urchin, which [my webhost] uses, generally sucks- mostly because it does a really poor job of interpreting user agent strings.
  • AwStats is really difficult to install, at least if you don’t have root on a machine (as is my situation using a shared host).
  • Analog seems to suck, too- there’s a Mac port, but its quite ugly and not particularly easy to use.
  • I’m unaware of any other good options.

For what its worth, I actually kinda like AWStats, meaning that it pretty much worked, unlike my experience with Urchin. So, after playing around with all of these, here’s what I’ve decided I want in a web server logfile analysis package:

  • The ability to categorize requests based on specific criteria. I’m thinking about the user agent string, but I’m sure I’ll think of other things later that I want to be able to classify.
  • The ability to run ad-hoc queries against the data.
  • The ability to create graphs of ad-hoc queries.

Are there any open-source projects out there that qualify?

As it is, I’m working on my own implementation, which takes logfiles, parses them out and sticks them in a mysql db. In the process, the user agent string is analyzed to classify the agent into different categories. What do I mean by ‘categories’? Well what I really want is the ability to create a hierarcy like this:

  • Browser
    • MSIE
      • 5.5
      • 6.0
    • Webkit
      • Safari
    • Gecko
      • Firebird
        • 1.0
      • Camino
  • Aggregators
    • NetNewsWire
    • Shrook
  • Robots
    • Search Engines
      • Googlebot
    • Blog bots
      • Technibot

I offer these just to give you an idea of what kind of data I want to have. My first thought was to create a taxonomy like this. However, after thinking about this more, it may be more flexible to have a flat taxonomy (like tags), which can then be bundled into groups (del.icio.us bundles are definitely my inspiration on this one). For example, I could mark something with these tags

MSIE60, WinXP

or these

MSIE55, Win2k

Then I could create bundles like this:

IE->{MSIE60, MSIE55}
Windows->{Win2k, WinXP}

Ideally this system would have a rule-based system for classifying user agents, which would mean that adding a new agent would be as simple as adding a new rule for it (rather than writing or re-writing code for it). The goal of all of this is to have an extensible system for converting useragent strings into a format which is more easily processable with SQL statements.

I’m putting this out there, because I want some feedback on this. Any thoughts?

4 Responses to “Web server log analysis”

  1. James Says:

    That sounds like it could be very useful, particularly if the packages were simple to define. I use webalizer right now, but there are various limitations of it that I’m getting frustrated with.

    My first stop when checking stats is usually the referrer logs. It’d be good to have an easier method of removing referrer spammers, which a database-backed stats system would definitely help with, and I’d love to have an easier way of grouping referrers (domain, partial URL).

    If you want a hand testing, let me know.

  2. ryan Says:

    >That sounds like it could be very useful, particularly if the packages were
    > simple to define. I use webalizer right now, but there are various limitations
    > of it that I’m getting frustrated with.

    You feel my pain, then.

    >My first stop when checking stats is usually the referrer logs. It’d be good to
    >have an easier method of removing referrer spammers, which a database-
    > backed stats system would definitely help with, and I’d love to have an
    >easier way of grouping referrers (domain, partial URL).

    Yeah. I also plan on trying to run some heuristics on the processing of the logs. (At this point I’m only planning on dealing with logfile in batch mode from the cli). Here are the heuristics I think would be useful:

    * keyword blacklist
    * domain blacklist
    * going and checking the page to see if it actually links to me (this might be sufficient in and of itself, but maybe not).

    > If you want a hand testing, let me know.

    Will do.

  3. Pete Prodoehl Says:

    You know, I just posted about this very issue today. (See Web Site Reporting.

    As far as Analog, are you using the command line version for the Mac? (Meaning not the version you stick in your /Applications folder, but the one you run via the terminal.) You can really get complex in configuring Analog, and I’d guess it could do what you want if you were willing to keep tweaking and use the mailing list/newsgroups for help.

  4. ryan Says:

    I was trying the /Applications version and it thoroughly sucked.

    I have a feeling that I could get one of the stats packages to work, if I worked really hard at it, but I think they all lack in extensiblity.