Applied Math & Computer Science Lab
Data Analysis, Optimization & Mathematical Modeling, Artificial Intelligence, Neural Net For Everyday Life Applications
Artificial Intelligence/Data Mining Links Webmaster Resources AMCSL Forum: Web Mining Submit Link Archive
Detecting New Robots

Detecting New Robots

Visiting the website by robot for crawling in general is good thing. However when it come to data analysis the number of robots can make data not valid. If someone is using raw data for data analysis the robot visits should be excluded. The list of ip addresses of search engines spiders can be found at http://www.iplists.com While this list is big and include most of robots there are still some that will not be included.
One of the way to detect new robot is look at number of pageviews for each ip. In many situations if the robot visit the website regularly and crawl most of the pages then this number will be much higher than the number of pageviews per ip by human.
So I created the perl script to iterate through log, exclude already known robots or own ips from special text file and count the number of visits per ip. It also counts the number of views per page. The format of weblog is not server defined. Having the summary of pageviews per ip will allow detect and then exclude new robots. And this will make the data analysis more useful and accurate.

Comments, Suggestions: AMCS Forum

Prev: Web Log Data Preparation for Data Mining

References
1. Source code for perl script