MJB Data : Spiders, crawlers and bots

[ home | news | articles | contact mjb ]

MJB's spiders, robots and crawlers database

There is a vast amount of automatic content retrieval on the Internet. When it's a search engine spidering a site's content it is something that is generally welcomed and encouraged, if it's considered at all. However, the crawling methods used to gather content for search indexes can also be used by spammers to gather email addresses or to generate fake website traffic, which can be less desireable. The MJB Data spiders database contains information on some of the good, the bad and the unknown that have visited the MJB Data network.

Who cares?

'User Agent' strings can be faked quite easily and you'd have to go to quite a lot of trouble to stop someone from spidering your site if they're using the User Agent of a popular web browser. You can block IP addresses, as I have done in special cases, but experience shows that determined spammers switch hosts and User Agents frequently. However, lazy, stupid or novice content snatchers are often easy to spot through their use of User Agent and it can be worth blocking the best known of these.

Why track them?

Most people who have a website seem to want free traffic from search engines. Crawler-based search engines that create their own index of web pages, such as Google, MSN, Teoma/Ask, and others, do so using programmes known as spiders or robots that crawl around the Internet trying to find new sites while keeping up to date with existing ones.

For a webpage to be found it needs to be included in the search engines' indexes, and to be included it needs to be spidered or crawled. Log files can show you requests for pages that search engines make. However, most log file analysis reports won't show which pages are being seen by each user agent.

And the answer is...?

It really depends on who you are and what you want to know or achieve. If you're the administrator type you'll probably want to keep an up-to-date banned User Agent and IP list for use at the server-software level. If you're insterested more in marketing and search engines you might like to do something along the lines of what I do - filter robotic page requests and insert User Agent, IP address, date, time, requested URI, and anything else you fancy into a database table, after which it's pretty straight forward to produce reports showing spider visits per page, by day and by User Agent.

Put simply:

Monitoring spiders can show you areas of your websites that are being missed by the major search engines, and can also help protect your content.


[ a - e ] [ f - j ] [ k - o ] [ p - t ] [ u - z ] [ other ]


spiders