Don't count visits from known bots

For websites with low traffic, bots can be a large chunk of all visits. I noticed when I created an AdWords campaign, I consistently got 3 more visitors a day. Turned out to be a Google bot.

I have two suggestions:

[ol]
[li] By default filter out all visits from known bots by user agent. The list doesn’t have to be complete, but just filtering Google, Bing, Yahoo, etc. would make a big difference. Or:
[/li][li] Tag visits from bots and give users the option to show/hide these visits from reports. I’d say the default option is hiding the bot visits.
[/li][/ol]

One may argue this is no new functionality and users can filter the bots already. However, I think for most users, figuring out how to filter visits from bots is difficult and a lot of work. Moreover, all users having to maintain such lists, can be compared with all people using e-mail maintaining their own spam filters.

By the way, 3 visits/day may not look like a lot, but it got me to believe that there was still a significant number of visitors using Win XP with an old IE version coming to my site. Even on sites with 10x the traffic of my site, some stats can be completely distorted by bots.

By default we try to avoid counting google bots. Can you tell me the IP address and the User agent of the google bot visit that got tracked in piwik?

In PM. I may be wrong and publishing someone’s personal data otherwise.

I created a ticket: visits from Google adwords bot should be excluded · Issue #4441 · matomo-org/matomo · GitHub

and fixed it, It would be great if you could test the fix and confirm in the ticket if it works as expected

Ok, I’ll remove the IP address from the block list and see if the bot shows up in my logs in the next couple of days.

I just installed Piwik 2.0.3 and the issue still persists. I haven’t check the complete log, but it visits from Google on pages that are part of an AdWords campaign jump out as still being counted.

Worse, after adding 66.249.. to the block filter list under Settings>Websites>Global list of Excluded IPs, and forced Piwik to reprocess the reports by dropping the archive tables (as described here), the visits still show up.

Reprocessing the visits won’t exclude the bots from the already- recorded visits.

Can you post screenshot of visitor log for all visits from google bot that are not hidden afterupgrade to 2.0.3 ?

There you go.

Thanks for that! I would have one last request… could you look into your “access.log” file on your server, for these IP addresses, and let me know the full list of user agents?

In the code I checked, we exclude all user agents containing googlebot, AdsBot-Google, etc. but maybe google is now using a new user agent ?

Sorry for the late reply. Where do I find that log file?

Ask your sysadmin or host, as it depends on your config. Sometimes in /var/log/apache2/access.log

I’m sorry, I can’t access the file. I’m on shared hosting and found a directory path matching your description (the path is inside a .vs folder), but I don’t have rights to download the file.

We had an unusual high amount of Internet Explorer 8 users, so I had to dig into the issue. I’m seeing exactly the same behavior and can provide you with some details from the webserver log files.


...
66.249.85.36 - - [25/May/2014:08:39:48 +0200] "GET / HTTP/1.1" 200 3600 "http://..." "Mozilla/5.0 (Linux; U; Android 4.0.4; en-us; C5170 Build/IML77) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30"
66.249.85.127 - - [25/May/2014:08:39:49 +0200] "GET / HTTP/1.1" 200 37486 "http://..." "Mozilla/5.0 (Linux; U; Android 4.0.4; en-us; C5170 Build/IML77) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30"
66.249.83.217 - - [25/May/2014:08:39:50 +0200] "GET / HTTP/1.1" 200 762 "http://..." "Mozilla/5.0 (Linux; U; Android 4.0.4; en-us; C5170 Build/IML77) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30"
66.249.85.36 - - [30/May/2014:01:25:20 +0200] "GET / HTTP/1.1" 200 5430 "-" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; WOW64; Trident/4.0))
...

These accesses are following Google Adword campaign urls, but are not using the corresponding bot. So I can’t find any reference in the log files to the “AdsBot-Google”, as mentioned at Google help[/url]. So bots from the Google IP range are requesting the Adwords URLs to [url=https://support.google.com/adwords/answer/2404197]check for quality etc., but are not announcing themselves as the AdsBot. Maybe they are doing this, to prevent from browser sniffing and show the AdsBot different content…

But I also have accesses from the “AdsBot-Google” user-agent in my log files, like:


66.249.91.71 - - [26/May/2014:04:22:29 +0200] "GET / HTTP/1.1" 200 3574 "-" "AdsBot-Google (+http://www.google.com/adsbot.html)"
66.249.91.87 - - [26/May/2014:04:22:30 +0200] "GET /" 200 3594 "-" "AdsBot-Google-Mobile (+http://www.google.com/mobile/adsbot.html) Mozilla (iPhone; U; CPU iPhone OS 3 0 like Mac OS X) AppleWebKit (KHTML, like Gecko) Mobile Safari"

I don’t see any chance to filter this request by user-agent. Currently I see the following IP addresses in my log file:


66.249.85.36: google-proxy-66-249-85-36.google.com.
66.249.85.127: google-proxy-66-249-85-127.google.com.
66.249.83.217: google-proxy-66-249-83-217.google.com.
66.249.85.36: google-proxy-66-249-85-36.google.com.
66.249.81.75: google-proxy-66-249-81-75.google.com.
66.249.83.106: google-proxy-66-249-83-106.google.com.
66.249.85.36: google-proxy-66-249-85-36.google.com.
66.249.85.127: google-proxy-66-249-85-127.google.com.
66.249.83.217: google-proxy-66-249-83-217.google.com.
66.249.85.36: google-proxy-66-249-85-36.google.com.
66.249.91.71: rate-limited-proxy-66-249-91-71.google.com.
66.249.91.87: rate-limited-proxy-66-249-91-87.google.com.
66.249.91.150: rate-limited-proxy-66-249-91-150.google.com.
66.249.91.134: rate-limited-proxy-66-249-91-134.google.com.
66.249.91.118: rate-limited-proxy-66-249-91-118.google.com.
66.249.91.103: rate-limited-proxy-66-249-91-103.google.com.
66.249.92.63: rate-limited-proxy-66-249-92-63.google.com.
66.249.92.196: rate-limited-proxy-66-249-92-196.google.com.
66.249.92.183: rate-limited-proxy-66-249-92-183.google.com.
66.249.92.89: rate-limited-proxy-66-249-92-89.google.com.
66.249.92.76: rate-limited-proxy-66-249-92-76.google.com.

I was adding 66.249.. to the installation wide list of blocked IP addresses. But I think it would be better to have a general solution in place, so that not everyone would need to dig into this.

Edit: There also seems to be another IP range at 64.233.172.* accessing the side, as mentioned above…

@nlsrchtr Thanks for the report! I’ve added these IP ranges to the known “Non human bots IP ranges” + added some tests, see: adding new google IPs as known bot ips + adding unit tests · matomo-org/matomo@62d9ccc · GitHub