Applied Math & Computer Science Lab
Data Analysis, Optimization & Mathematical Modeling, Artificial Intelligence, Neural Net For Everyday Life Applications
Artificial Intelligence/Data Mining Links Webmaster Resources AMCSL Forum: Web Mining Submit Link Archive
Web Log Data Preparation

Web Log Data Preparation for Data Mining

Introduction

In order to apply data mining algorithms very offen web log data need to be preprocessed. This section will show how some typical preprocessing tasks can be done using script programming.

Filtering, Extraction and Other Operations

Web log data line has format like this: 64.111.11.11 - - [31/Oct/2004:21:45:03 -0800] "GET /cgi-bin/log/source/vs/vs_main.cgi HTTP/1.1" 200 1924 "http://www.sitename.com/cgi-bin/ai/osp.cgi" "Mozilla/4.7 [en](Exabot.com)"
It has a lot of data. And offen we will need to filter and remove extra lines such as a robot visit. We are interesting in such data as visited pages, referers (page the user came from) or some others. For example in some situations we need just extract only all referers that lead to particular group of pages for each product(service/topic) category.
Also when we count page we might want to count two different links as the same page. For example page.aspx?product=A and page.aspx?product=B in some situations maybe useful just count as page.aspx?product_group=1. This will require modification of links. To implement all of the above we need simple script which go through each line and divide this line in the fields. The following section will show one more preprocessing task and will provide the script that can be used for this task or can be ajusted for your particular need.

Converting Web Log for Path Analysis

If we want to do analysis of user navigation patterns we need data in the format where each line is sequence of clicked pages and represents single user session (visit). However in the web log each line is one click or hit. Thus the web log file should be modified. This is a lit more complicated task than previous. The following simple script can help to accomplish this. The script can be easy modified to accomplish also any of previous tasks such as extraction of all needed info for referers.
Just a few words about this script. It uses perl programming language and this allows easy to separate line into tokens using regular expressions. I am using hash data structure to save page and time of last page visited for each ip. Thus if new page for given ip is visited later than 30 min then this is a new visit. The previous path is outputed to file, and new path is strated for this ip. It has now only this new page. If it less than 30 min than new page just added to path for the same ip. The time is also updated. I am using space to separate pages. For calculating time difference i am using Time::Local module. This module come by default with installation of perl so no need for any additional installation. Function timegm of this module converts time to epoch seconds and then the difference can be easy calculated.

Conclusion

This script will allow easy prepare needed data for specific data mining.

Comments, Suggestions: AMCS Forum

Prev: Web Log Data Analysis

References
1. Source code for perl script to convert web log
2. Perl Cookbook Chapter 3.Dates and Times. Chapter 20. Web Automation