Antilog

Antilog is a module to read and query Apache log files.

The current version is not very sophisticated (yet), but can grow over time. One possible addition could be, reading a series of log files, and presenting the data much like it does now, but for multiple days. I'm now sure how efficient it will be with memory...

Usage is simple. Here's a quick example:

Create an instance of RefLogReader:

>>> import antilog
>>> reflog = antilog.RefLogReader('access.log')

How many records did it read?

>>> len(reflog.data)
8033

What are the top 5 files requested?

>>> d = reflog.get_top_n('url', 5)
>>> reflog.pprint(d)
1281  /weblog/totm.rss
 854  /snakeheader.jpg
 620  /images/sponsorme.png
 561  /weblog/new.css
 518  /weblog/valid-rss-bbulger.png

(Two important methods here: get_top_n, which gets the top N for a given field, and pprint, which displays results in a nice format.)

In much the same way, we can get the top 5 referrers:

>>> d = reflog.get_top_n('referrer', 5)
>>> reflog.pprint(d)
554  http://www.zephyrfalcon.org/weblog/
542  http://zephyrfalcon.org/weblog/arch_d7_2003_11_22.html
374  http://zephyrfalcon.org/weblog/arch_d7_2003_11_29.html
359  http://zephyrfalcon.org/weblog/
185  http://zephyrfalcon.org/weblog/index.html

However, this includes "referrals" from my own site... probably not what I want to see. To fix this, I can pass a function that filters unwanted URLs:

>>> def isnotlocal(url):
...     if url.startswith("http://"):
...         url = url[7:]
...     return not (url.startswith("zephyrfalcon.org") \
...              or url.startswith("www.zephyrfalcon.org"))
...     
>>> d = reflog.get_top_n('referrer', 5, filterfunc=isnotlocal)
>>> reflog.pprint(d)
124  http://www.pythonware.com/daily/
 30  http://angra3594.fc2web.com/index.html
 29  http://www.cafeconleche.org/
 20  http://www.ibiblio.org/xml/
 17  http://www.google.com/search?q=torrents&hl=en&lr=&ie=UTF-8...

Much better. Also note the get_unique_values method, which is the basis for get_top_n. It returns a list of tuples (value, frequency), for a given field. For example, return codes:

>>> reflog.get_unique_values('return_code')
[('200', 6134), ('206', 13), ('301', 5), ('304', 1706), ('404', 167),
('403', 1), ('401', 7)]

6134 requests were met with response code 200. 167 yielded a 404, etc. Note that this information can be used to create our own result set to pass to get_top_n:

>>> invalid = [e for e in reflog.data if e.return_code.startswith('4')]
>>> len(invalid)
175

We now have a list of invalid requests. Let's pass it to get_top_n to inspect it:

>>> d = reflog.get_top_n('url', 5, entries=invalid)
>>> reflog.pprint(d)
144  /favicon.ico
  6  /stats/
  3  /weblog/arch_d7_2003_11_15.htm
  3  /download/)
  2  /weblog/arch_d7_2003_11_22.htm

Most of the invalid requests were for /favicon.ico. (I think I've fixed that problem since. :-)

Despite its limitations, I find antilog quite useful (as far as inspecting a daily log file goes). Suggestions for more useful methods are always welcome.