Internet proxy log analysis preprocessing

Proxy logs need a bit of work done to them before you can start analysing the content. This is of course assuming you don’t have a fancy product to do all this work for you ;). First, you need to work out the regular expression that defines a line in the proxy log to parse it into a nicer format such as CSV. A lot of the CSV columns can probably be removed; the most useful columns are URL, date & time, user agent string (to work out what browser the user was using for example) and request status code (to work out if the user was able to access the content or if it was blocked, unavailable etc).

Whilst internet history from browsers only contain the websites the user visited, proxy logs contain every single request made by the user’s IP address. This includes requests for each image on a page, the CSS files, JavaScript files and all the advertisements on that page. Analysing proxy log data can be very tricky, as a single user request to a URL can generate tens of requests on the proxy (possibly hundreds if there the adverts or other content are dynamically changing) which can skew the results to show abnormally high internet usage. To accurately analysis proxy logs, you will need to remove the chaff from the wheat, and then group the remaining requests left by domain name.

I have done a fair few proxy log analysis cases and I always try and filter out request URLs ending in css, js, jpg, png etc and then remove URLS that are known advertisement websites. This is easily done as there are plenty of browser plugins which remove advert content, such as AdBlock. These plugins rely on lists of adverts to block, such as this one: https://easylist-downloads.adblockplus.org/easylist.txt. Using this list, and also removing non-html traffic should remove most of the cruft in the proxy logs.

Some proxy logs will provide the hostname of the request. Although useful, you really want the domain name, not the hostname. For example, if you want to see how much time is spent on Google, the hostname column will contain maps.google.com, news.google.com, mail.google.com etc. Ideally you want the domain name, which is just google.com. Writing a script to extract domain names is not that easy. You want the word that appears before the top level domain (e.g. com, gov, org, net) or country code top level domain (e.g. uk, ch, be). Some URLs have both country codes and top level domains.

I’ve written a script below to extract the domain name – I believe it works for most cases, I can’t think of a weird URL that wouldn’t work – please comment if you can think of one that breaks this! The script makes use of Pythons urlparse, and the Wikipedia list of top level domain names, which recently had the addition of ‘xxx’ for adult content websites.

def get_domain_name(url):
	from urlparse import urlparse
	
	parsed = urlparse(url)
	hostname = parsed.hostname 
	
	if hostname is None:
		return url
	
	toplevel = ['aero', 'arpa', 'asia', 'biz', 'cat', 'com', 'coop', 
		    'edu', 'gov', 'info', 'int', 'jobs', 'mil', 'mobi', 
		    'museum', 'name', 'net', 'org', 'pro', 'tel', 'travel',
		    'xxx']
		
	host_parts = hostname.split('.')	
	domain_ending = domain_ending1 = ""

	if host_parts[-1] in toplevel or len(host_parts[-1]) == 2:
		# ends with toplevel domain name or a 2 letter country code
		domain_ending = "." + host_parts[-1] 
		host_parts = host_parts[0:-1]
		if len(host_parts) > 1 and \
		(len(host_parts[-1]) == 2 or host_parts[-1] in toplevel):
			# second to last part is in toplevel (e.g. org.uk) or 2 letter
			# code (e.g. co.uk) -- only if url has 2 or more parts left
			domain_ending1 = "." + host_parts[-1]   
			host_parts = host_parts[0:-1]    		
			
	domain = host_parts[-1]
	return domain + domain_ending1 + domain_ending

url = 'http://en.wikipedia.org/wiki/List_of_Internet_top-level_domains'
print get_domain_name(url)

Internet proxy log analysis preprocessing

Published by lowmanio

Leave a comment Cancel reply

Share this:

Related

Published by lowmanio

Leave a comment Cancel reply