Proxy logs need a bit of work done to them before you can start analysing the content. This is of course assuming you don’t have a fancy product to do all this work for you ;). First, you need to work out the regular expression that defines a line in the proxy log to parse it into a nicer format such as CSV. A lot of the CSV columns can probably be removed; the most useful columns are URL, date & time, user agent string (to work out what browser the user was using for example) and request status code (to work out if the user was able to access the content or if it was blocked, unavailable etc).
I have done a fair few proxy log analysis cases and I always try and filter out request URLs ending in css, js, jpg, png etc and then remove URLS that are known advertisement websites. This is easily done as there are plenty of browser plugins which remove advert content, such as AdBlock. These plugins rely on lists of adverts to block, such as this one: https://easylist-downloads.adblockplus.org/easylist.txt. Using this list, and also removing non-html traffic should remove most of the cruft in the proxy logs.
Some proxy logs will provide the hostname of the request. Although useful, you really want the domain name, not the hostname. For example, if you want to see how much time is spent on Google, the hostname column will contain maps.google.com, news.google.com, mail.google.com etc. Ideally you want the domain name, which is just google.com. Writing a script to extract domain names is not that easy. You want the word that appears before the top level domain (e.g. com, gov, org, net) or country code top level domain (e.g. uk, ch, be). Some URLs have both country codes and top level domains.
I’ve written a script below to extract the domain name – I believe it works for most cases, I can’t think of a weird URL that wouldn’t work – please comment if you can think of one that breaks this! The script makes use of Pythons urlparse, and the Wikipedia list of top level domain names, which recently had the addition of ‘xxx’ for adult content websites.
def get_domain_name(url): from urlparse import urlparse parsed = urlparse(url) hostname = parsed.hostname if hostname is None: return url toplevel = ['aero', 'arpa', 'asia', 'biz', 'cat', 'com', 'coop', 'edu', 'gov', 'info', 'int', 'jobs', 'mil', 'mobi', 'museum', 'name', 'net', 'org', 'pro', 'tel', 'travel', 'xxx'] host_parts = hostname.split('.') domain_ending = domain_ending1 = "" if host_parts[-1] in toplevel or len(host_parts[-1]) == 2: # ends with toplevel domain name or a 2 letter country code domain_ending = "." + host_parts[-1] host_parts = host_parts[0:-1] if len(host_parts) > 1 and \ (len(host_parts[-1]) == 2 or host_parts[-1] in toplevel): # second to last part is in toplevel (e.g. org.uk) or 2 letter # code (e.g. co.uk) -- only if url has 2 or more parts left domain_ending1 = "." + host_parts[-1] host_parts = host_parts[0:-1] domain = host_parts[-1] return domain + domain_ending1 + domain_ending url = 'http://en.wikipedia.org/wiki/List_of_Internet_top-level_domains' print get_domain_name(url)