Applied Math & Computer Science Lab
Data Analysis, Optimization & Mathematical Modeling, Artificial Intelligence, Neural Net For Everyday Life Applications
Artificial Intelligence/Data Mining Links Webmaster Resources AMCSL Forum: Web Mining Submit Link Archive
Web Log Data Analysis

Web Log Data Analysis

Introduction

Web path analysis gives an excellent way to get knowledge about how visitors use your site, what are the most common paths through your site. Data mining methods can go far beyond this in generating interesting rules. Such rules can predict which users are likely to click particular next link or make a purchase. This article will show how data mining can be appllied and implemented for path analysis.

Calculating of Frequent Paths

Let assume you already have the web log converted in data file. Each line of this file is one visit and represents the sequence of visited pages. For example the following lines
3 5 7
5 7 8 5
2 3 5 7 9
3 5 7 9
4 3 5 7 9
mean that the first visit had 3 clicks , pages 3,5,7. I am using numbers to represent page names but it could be in the text format too. The very interesting question in web analytics is what is the most frequent path. The calculating frequency of each path is simply counting of number of occurrences. It's simple.
To get the frequency of particular path we need just iterate through the file and count. And then divide the number of occurrences of this path by total number of all paths.
To get all frequencies of all visited paths we need do the same for each path. For each line there may be several subpaths. For example for the first line 3 5 7 there are 2 subpaths 3 5 & 5 7 with path length = 2 pages and one path with path length =3 pages:3 5 7.
The algorithm and perl source code for calculating all frequencies can be viewed at [4], [5] . The main loop iterates through each line in the web log file creating all possible subpaths from size 1 to maximum size which is the number of clicked pages in this visit. Variable start_pos keeps position where path should start. Thus for each possible length of path we create all possible subpaths, moving start position to the left by one each time as far as we can. Obviously we can skip paths that have length equal 1.

Modifications For Analysis of User Navigation Patterns

This algorithm can be modified to fit your particular request.
Examples of applications

1. What are the most common paths. This is a typical problem in path analysis.
2. What are the least common paths. Can be also very usefull for improving navigation. We are looking here for paths with frequncy less than some threshold number.
3. What paths lead to certain exit points such as puchase link, affiliate link.
4. How different are paths that start at different referer or other entry points.
5. Comparing frequent paths for different type of users. Is there some difference. This problem introduces one more dimension - user type. So each path should be associated with user type.
6. In case the web log is too big and it takes long time to run a script the algorithm can be modified to do updating incrementally.

Support & Confidence

For such problems like 3,4,5 data mining allows to discovery association rules of form if X then Y. For example one of very usable rule could be if the user visited page 3,5,7 then next visited page will be 9.
Association rules are estimated using two variables: support and confidence. Early we considered the frequency of path as the percentage of paths in the web log where this path is a subpath. This number also called as support or coverage in data mining. [1],[2] And the frequent path can be defined as the path that have support no less than a minimum support threshold.
Confidence also can be called as accuracy. It's defined as the number of paths that have X and Y divided by the number of path that have only X.
For example if we take the rule if the user visited 3,5,7 then next click will be page 9. Confidence will be the number of paths that have 3,5,7,9 divided by the number of paths that have 3,5,7.
In our case the confidence for this rule (3,5,7 => 9 ) will be 3/4 or 0.75
The frequency of path 3,5,7,9 will be 3/5 or 0.6
The paths that have support and confidence more than some thresholds can lead to interesting rules.
The algorithm can be easy modified for counting confidence and verifying association rules. Our main data structure is hash called count that has path as the key and the number of occurrences for each path as the value.
Thus to calculate confidence for rules about exit point A we need to add
for each path xyzA that ends with A
    confidence = count{path=xyzA} / count{paths=xyz}
next

Other Approaches

Another way to discovery association rules is using Weka software. It's downloadable for free. It has GUI so no coding at all is needed. But if you want there is a source code so it can be incorporated in your own program however it requires Java programming. Another algorithm is proposed in [6]. This algorithm uses a hypertext probabilistic grammar.
In case you like perl or use perl scripting for web development then why not use perl module Data::Mining::AssociationRules from CPAN [8]. The description of this module and instructions how to use is available at [7].

Conclusion

Thus we saw how data mining can be applied for web path analysis. It can really uncover new knowlege that is not possible to get just by simple counting visited pages.

Comments, Suggestions: AMCS Forum

Next: Web Log Data Preparation for Data Mining

References
1. Han,Jiawei & Kamber,Micheline (2006) Data Mining. Concepts and Techniques. Morgan Kaufmann Publishers.
2. Witten,Ian H. & Frank,Eibe (2000) Data Mining. Practical Machine Tools and Techniques with Java Implementations. Morgan Kaufmann Publishers.
3. Weka 3: Data Mining Software in Java
4. Source code for calculating frequency script
5. Algorithm for calculating frequency script
6. Borgers,Jose & Levene,Mark Data Mining of User Navigation Patterns. Lectures Notes in Artificial Intelligence, 1836. (2000)
7. Data-Mining-AssociationRules-0.10
8. CPAN