cd ~

Parsing httpd access logs

Introduction

This post is inspired by a reddit post I responded to on the r/sysadmin subreddit that has since been deleted. A young junior administrator was asking for help writing an awk script to parse httpd access.log files with:

script.awk access.log access.log.1 # etc..

The script needed to output the top 10 connections by traffic amount in KB from ip addresses and sites as well as top 10 connections requested from ip addresses and sites. I tossed the reddit user a quick perl script as I’m not as familiar with awk, then decide I might as well write it in the handful of languages that I tend to end up using daily.

The following sections will discuss scripts that output the following:

Top 10 traffic (KB)
-------------------
1	3175	192.168.2.16
2	671	192.168.2.189

Top 10 connections
------------------
1	4	192.168.2.16
2	1	192.168.2.189

Top 10 website traffic (KB)
---------------------------
1	1606	http://safebrowsing.clients.google.com/safebrowsing/downloads?
2	879	http://62.67.184.68/eset_eval/update.ver
3	690	http://tracker.tfile.me/announce.php?
4	671	http://www.viewnetcam.com/registration/bin/ip6_update.php

Top 10 website connections
--------------------------
1	2	http://safebrowsing.clients.google.com/safebrowsing/downloads?
2	1	http://tracker.tfile.me/announce.php?
3	1	http://www.viewnetcam.com/registration/bin/ip6_update.php
4	1	http://62.67.184.68/eset_eval/update.ver

The file being parsed is in the following format:

1321331001.341 690 192.168.2.16 TCP_MISS/200 1522 GET http://tracker.tfile.me/announce.php? - DIRECT/89.188.122.214 text/plain
1321331075.503 800 192.168.2.16 TCP_MISS/200 1502 POST http://safebrowsing.clients.google.com/safebrowsing/downloads? - DIRECT/72.14.204.100 application/vnd.google.safebrowsing-update
1321332020.993 671 192.168.2.189 TCP_MISS/200 331 POST http://www.viewnetcam.com/registration/bin/ip6_update.php - DIRECT/204.236.228.62 text/html
1321332047.331 879 192.168.2.16 TCP_MISS/200 3974 GET http://62.67.184.68/eset_eval/update.ver - DIRECT/62.67.184.68 application/octet-stream
1321332812.684 806 192.168.2.16 TCP_MISS/200 1645 POST http://safebrowsing.clients.google.com/safebrowsing/downloads? - DIRECT/72.14.204.102 application/vnd.google.safebrowsing-update

The second column is the size in KB of the connection, the third column is the source ip, the seventh column is the website making the request. Please note that the code in this article may not work with your access logs as the format tends to be different from system to system. This is simply the example format I will be parsing against.

Parsing httpd access logs with awk

Writing this script with awk wasn’t too difficult as this is pretty much what awk was created for. The problem though is the reverse sort that is required after summarizing the results. Only gawk (gnu awk) has built in support for sorting and since I wasn’t interested in writing my own sort, the script I wrote required gawk. Other awk languages usually require one to write their own sort. Another feature of gawk (which I’m unsure is possible in other awk versions) is that every array is considered an associative array. Essentially, gawk arrays behave similarily to hash tables but not quite… See the code below:

#!/usr/bin/gawk -f
# Requires:
#     GNU awk (gawk)
# Description:
#     Gets top 10 ip and website traffic results
# Examples:
#    parse-access-logs.awk access.log access.log.1
#    parse-access-logs.awk access.log.*

function print_top_10(results){
    # prints top 10 results
    # Args:
    #   associative array: associative array of results

    count = 1
    asort(results,sorted,"@val_num_desc")
    for (i in sorted){
	if ( count == 10 ){
	    break
	}
	for (j in results){
  	    if ( results[j] == sorted[i] ){
	        print count "\t" results[j] "\t" j
		delete results[j]
	        count++
		break
	    }
        }
    }
    print
}

!/^$/{
    top_10_traffic[$3]            += $2;
    top_10_ip_connected[$3]       += 1;
    top_10_websites_traffic[$7]   += $2;
    top_10_websites_connected[$7] += 1;
}

END{
    print "Top 10 traffic (KB)"
    print "-------------------"
    print_top_10(top_10_traffic)
    print "Top 10 connections"
    print "------------------"
    print_top_10(top_10_ip_connected)
    print "Top 10 website traffic (KB)"
    print "---------------------------"
    print_top_10(top_10_websites_traffic)
    print "Top 10 website connections"
    print "--------------------------"
    print_top_10(top_10_websites_connected)
}

First, we apply a regex filter to skip all blank lines before even processing with !/^$/ on awk’s main function. Next we update our associative arrays as required. For traffic, we increment our existing stored value in our associative array with the value read from the access.log files. For connections, we simply increment the value in our associative arrays by 1. In the END function we print out our formatters and call the print_top_10 function, sending each of our associative arrays without our stored results to sort and print the top 10.

The print_top_10 function relies on gnu awk’s asort extension to sort the values in the associative array passed to the print_top_10 function and create a new array of those sorted values. Next we proceed to loop over the values and search for them in the original associative array. This uses a simple nested loop/brute force technique. After finding a match, we proceed to print it and delete it from the original associative array, then break back to the outer loop to continue to the next value in the sorted associative array. We delete the value as there’s a chance that duplicate values may exist in our original associative array which would cause us to reprint our previous line multiple times instead of the next line with the same value. Lastly, we break the outer loop when our counter reaches 10. Repeat for each associative array.

Parsing httpd access logs with perl

The perl version of this script was relatively simple as expected. It’s been a while since I sorted a hash in perl so I had to quickly look it up. Lucky, it’s relativley expressive and straight forward. The code is very similar to that of gawk except that perl has a more robust sort builtin than gawk, as well as more powerful support for hashes and arrays. Never the less, the code looks very similar:

#!/usr/bin/env perl
use warnings;
use strict;
use v5.10;

# Description:
#     Gets top 10 ip and website traffic results
# Examples:
#    parse-access-logs.pl access.log access.log.1
#    parse-access-logs.pl access.log.*

sub print_top_10{
    # prints top 10 results
    # Args:
    #    hashref: hashref of counted/summed connections/traffic
    
    my $hr = shift;
    my @sorted_keys = sort {$hr->{$b} <=> $hr->{$a} } keys %{ $hr } ;
    my $count = 0;
    for my $k (@sorted_keys[0..9]){
        last unless $k;
        say join("\t", $count+=1, $hr->{$k}, $k);
    }
    say '';
}

my %top_10_traffic;
my %top_10_ip_connected;
my %top_10_websites_traffic;
my %top_10_websites_connected;

while (<>){
    chomp;
    next if /^$/;
    my @F = split;
    $top_10_traffic{$F[2]}            += $F[1];
    $top_10_ip_connected{$F[2]}       += 1;
    $top_10_websites_traffic{$F[6]}   += $F[1];
    $top_10_websites_connected{$F[6]} += 1;
};

say 'Top 10 traffic (KB)';
say '-------------------';
print_top_10 \%top_10_traffic;
say 'Top 10 connections';
say '------------------';
print_top_10 \%top_10_ip_connected;
say 'Top 10 website traffic (KB)';
say '---------------------------';
print_top_10 \%top_10_websites_traffic;
say 'Top 10 website connections';
say '--------------------------';
print_top_10 \%top_10_websites_connected;

With strict enabled, we need to pre-declare our variables. Unlike awk, we need to explictly filter blank likes with next if /^$/ and manually split each line with my @F = split. By default, strings are split in perl with the \s+ regular expression or match a blank space one or more times. The while loop creates new keys and values in our hashes and increments them as required. The print_top_10 function does the same thing as the awk version but works a bit differently. We sort the keys of our hash ref we sent to the function by comparing their associated hash values $hr->{$b} <=> $hr->{$a} and store them in the @sorted_keys array. To avoid errors, we break out of the loop (last) if the key ($k) is not some form of true. We then proceed to print the results, joined with the \t character. The script can look more awk-like with strict and warnings disabled. I don’t recommend this since perl can be quite buggy without strict but here’s a shorter version that doesn’t require pre-declaration with my and all the safe stuff we don’t need ;).

#!/usr/bin/env perl
use v5.10;

# Description:
#     Gets top 10 ip and website traffic results
# Examples:
#    parse-access-logs.pl access.log access.log.1
#    parse-access-logs.pl access.log.*

sub print_top_10{
    # prints top 10 results
    # Args:
    #    hashref: hashref of counted/summed connections/traffic
    
    my $hr = shift;
    my @sorted_keys = sort {$hr->{$b} <=> $hr->{$a} } keys %{ $hr } ;
    my $count = 0;
    for my $k (@sorted_keys[0..9]){
        last unless $k;
        say join("\t", $count+=1, $hr->{$k}, $k);
    }
    say '';
}

while (<>){
    chomp;
    next if /^$/;
    my @F = split;
    $top_10_traffic{$F[2]}            += $F[1];
    $top_10_ip_connected{$F[2]}       += 1;
    $top_10_websites_traffic{$F[6]}   += $F[1];
    $top_10_websites_connected{$F[6]} += 1;
};

say 'Top 10 traffic (KB)';
say '-------------------';
print_top_10 \%top_10_traffic;
say 'Top 10 connections';
say '------------------';
print_top_10 \%top_10_ip_connected;
say 'Top 10 website traffic (KB)';
say '---------------------------';
print_top_10 \%top_10_websites_traffic;
say 'Top 10 website connections';
say '--------------------------';
print_top_10 \%top_10_websites_connected;

Parsing httpd access logs with ruby

I don’t typically use ruby except on the command line to parse YAML and JSON. It’s rare I to pull this one out of my toolkit since perl5 usually fits my needs and is typically installed universally on all unix and unix-like operating systems. Since ruby is so similar to perl, I figured I might as well quickly slap together a ruby version. It looks similar to perl except we can’t increment our hash values until they’ve already been set. There’s a few ways to do this but I figured I’d just catch the exceptions and handle them. See the code below:

#!/usr/bin/env ruby
# Description:
#     Gets top 10 ip and website traffic results
# Examples:
#    parse-access-logs.rb access.log access.log.1
#    parse-access-logs.rb access.log.*

def print_top_10(results)
    # prints top 10 results
    # Args:
    #    results: hash of counted/summed connections/traffic
    sorted = results.sort_by{|k, v| v}.reverse
    count = 0    
    sorted.each do |k, v|
        last if count == 10
        puts "#{count+=1}\t#{v}\t#{k}"
    end
    puts
end
    
top_10_traffic = Hash.new
top_10_ip_connected = Hash.new
top_10_websites_traffic = Hash.new
top_10_websites_connected = Hash.new
ARGF.readlines.each do |line|
    next if line =~ /^$/
    f = line.split()
    begin
        top_10_traffic[f[2]] += f[1].to_i
    rescue NoMethodError
        top_10_traffic[f[2]] = f[1].to_i
    end
    begin
        top_10_ip_connected[f[2]] += 1
    rescue NoMethodError
        top_10_ip_connected[f[2]] = 1
    end
    begin
        top_10_websites_traffic[f[6]] += f[1].to_i
    rescue NoMethodError
        top_10_websites_traffic[f[6]] = f[1].to_i
    end
    begin
        top_10_websites_connected[f[6]] += 1
    rescue NoMethodError
        top_10_websites_connected[f[6]] = 1
    end
end

puts 'Top 10 traffic (KB)'
puts '-------------------'
print_top_10 top_10_traffic
puts 'Top 10 connections'
puts '------------------'
print_top_10 top_10_ip_connected
puts 'Top 10 website traffic (KB)'
puts '---------------------------'
print_top_10 top_10_websites_traffic
puts 'Top 10 website connections'
puts '--------------------------'
print_top_10 top_10_websites_connected

Some things to note here. Since we’re parsing a file, the lines are read in as strings. Ruby will not attempt to auto-cast the strings to integers and will fail to increment our values unless we manually cast them with the .to_i method. The split line is stored in the variable f instead of F since F would be recognized as a constant and ruby would get upset. Additionally, ruby is able to create new sorted hashes by sorting their values via some object-oriented black magic results.sort_by{|k, v| v}.reverse. This is really cool but also really weird :).

Parsing httpd access logs with bash

The shell script (bash) version was extremely challenging. I didn’t want to use bash’s support for hashes (supported in later versions of bash) as they aren’t very portable. Instead, we use the concept of a db var that stores our results and increments. This was really tricky with the need for subshells and variable exports…

#!/bin/bash
# Description:
#     Gets top 10 ip and website traffic results
# Examples:
#    parse-access-logs.sh access.log access.log.1
#    parse-access-logs.sh access.log.*

mod_db () {
    # modify space delimited db
    # Args:
    #     stream db: stream of "db" lines
    #                ex. format
    #                192.168.0.1 432
    #     str  line: line to insert into the db
    # Returns:
    #     stream db: db stream
    db="$1"
    k="$2"
    v="$3"

    found=0
    while read record
    do

	record_k=$(echo "${record}" | cut -d" " -f2)
        record_v=$(echo "${record}" | cut -d" " -f1)
        [ "${record_k}" == "${k}" ] \
            && found=1 \
            && break
    done <<EOF
${db}
EOF

    regex=$(echo "${k}" | sed 's/\./\\./g') # Trying to avoid buggy matches...
    [ ${found} -eq 1 ] \
        && sum_v=$(( $record_v + $v )) \
   	&& export db=$(echo "${db}" | grep -v "${regex}") \
        && export db=$(echo "${db}"; echo ""; echo "${sum_v}" "${k}"; echo "") 

    [ ${found} -eq 0 ] \
        && export db=$(echo "${db}"; echo ""; echo "${v} ${k}"; echo "") 

    echo "${db}" | grep -v '^$'
}

while read line
do
    echo "${line}" | grep '^$' && continue    
    top_10_traffic=$(mod_db "${top_10_traffic}" \
                   $(echo "${line}" | cut -d" " -f3) \
                   $(echo "${line}" | cut -d" " -f2) \
    )
    top_10_ip_connected=$(mod_db "${top_10_ip_connected}" \
                        $(echo "${line}" | cut -d" " -f3) \
                        1 \
    )
    top_10_websites_traffic=$(mod_db "${top_10_websites_traffic}" \
                            $(echo "${line}" | cut -d" " -f7) \
                            $(echo "${line}" | cut -d" " -f2) \
    )
    top_10_websites_connected=$(mod_db "${top_10_websites_connected}" \
                              $(echo "${line}" | cut -d" " -f7) \
                              1 \
    )
done < "${1:-/dev/stdin}"

echo 'Top 10 traffic (KB)'
echo '-------------------'
echo "${top_10_traffic}" \
    | sort -n -r | nl | head -n 10 | sed 's/^ *//' | sed 's/ /\t/g'
echo ""
echo 'Top 10 connections'
echo '------------------'
echo "${top_10_ip_connected}" \
    | sort -n -r | nl | head -n 10 | sed 's/^ *//' | sed 's/ /\t/g'
echo ""
echo 'Top 10 website traffic (KB)'
echo '---------------------------'
echo "${top_10_websites_traffic}" \
    | sort -n -r | nl | head -n 10 | sed 's/^ *//' | sed 's/ /\t/g'
echo ""
echo 'Top 10 website connections'
echo '--------------------------'
echo "${top_10_websites_connected}" \
    | sort -n -r | nl | head -n 10 | sed 's/^ *//' | sed 's/ /\t/g'
echo ""

I won’t go into details for how this one works since it’s slow and awful. It’s ksh93 compatible though!

Parsing httpd access logs with python

Lastly, as this is blog.parseltongue.io, here is the python version. While I would typically jump to perl for something like this, this wasn’t very hard to achieve with python. This code is python2.7 and python 3.x compatible. See the code below:

#!/usr/bin/env python3
# Description:
#     Gets top 10 ip and website traffic results
# Examples:
#    parse-access-logs.py access.log access.log.1
#    parse-access-logs.py access.log.*
import fileinput
from re import match

def print_top_10(results):
    # prints top 10 results
    # Args:
    #    results: dictionary of results

    count = 1
    for l in sorted([(v,k) for (k,v) in results.items()], reverse=True):
        if count == 10: break
        print("{0}\t{1}\t{2}".format(count,l[0],l[1]))
        count+=1
    print("")

if __name__ == '__main__':
    top_10_traffic = dict()
    top_10_ip_connected = dict()
    top_10_websites_traffic = dict()
    top_10_websites_connected = dict()

    for line in fileinput.input() :
        if match('^$',line): continue
        f = line.split()
        try:
            top_10_traffic[f[2]] += int(f[1])
        except KeyError:
            top_10_traffic[f[2]] = int(f[1])
    
        try:
            top_10_ip_connected[f[2]] += 1
        except KeyError:
            top_10_ip_connected[f[2]] = 1
    
        try:
            top_10_websites_traffic[f[6]] += int(f[1])
        except KeyError:
            top_10_websites_traffic[f[6]] = int(f[1])
    
        try:
            top_10_websites_connected[f[6]] += 1
        except KeyError:
            top_10_websites_connected[f[6]] = 1
    
    print('Top 10 traffic (KB)')
    print('-------------------')
    print_top_10(top_10_traffic)
    print('Top 10 connections')
    print('------------------')
    print_top_10(top_10_ip_connected)
    print('Top 10 website traffic (KB)')
    print('---------------------------')
    print_top_10(top_10_websites_traffic)
    print('Top 10 website connections')
    print('--------------------------')
    print_top_10(top_10_websites_connected)

We import fileinput in order to get the behaviour the the diamond operator <> in perl or ARGF.read in ruby. It allows one to pass files or streams to the script. We skip blank lines by checking if the line matches the ^$ regex as with all other versions of this script. Using a regex may not actually be required in python or ruby and one may be able to simply compare the line being processed to the string '' but I tend to use the regular expression ^$ out of habit… As with the ruby version, I lazily rely on try and except to increment the hash values. As with other versions we call the print_top_10 function and send our hashes. The function reverse-sorts the dictionary passed to it using pythons sorted function to create a new list (could have been a tuple) of two value tuples. It then prints the top 10 results, breaking out of the loop when the count reaches 10.