How to Remove Referral Spam from Google Analytics Reports

My bug bearer of the week: referral spam! How much of a pain in the a**e it is become over the past few months?! I’ve now got to a point where it’s got so bad I had no choice but to seek a solution.

I have purposefully avoided researching and implementing methods of identifying/handling referral spam in the hope that Google or another super-power will find some magic solution to resolve the increasing levels of referral spam. Unfortunately that does not appear to be the case. Google have recently acknowledged referral spam as a problem but there is seemingly no solution on the horizon, leaving webmasters like me to go it alone.

What makes it even more frustrating is the levels of referral spam appear to be growing! The Google Analytics help forum is now inundated by posts complaining of referral spam with frustrated webmasters around the globe questioning how to resolve it. Referral spam is quickly becoming more and more of a frustration for marketeers as their reports are skewed by rogue, irrelevant and fake data just like this one from the infamous Mr Vitaly Popov:

Vitaly Rules Google Referral Spam

Unfortunately there is no way of removing spam. Web spam is web spam and will always be a problem no matter what guise it takes. Luckily there are however preventative measure which you can adopt to minimise, remove and stop (and hopefully eradicate!) referral spam in your reporting.

In this post I’ll cover a few of the methods which I have recently adopted to remove referral spam in Google Analytics. These include blocking:

  1. Known bots and spiders through Google Analytics’ view setting
  2. Referrers via .htaccess
  3. Ghost referrals via hostname filtering
  4. Dodgy crawlers and fake referrals via campaign source filtering

If you’re using anything other than Google Analytics I can’t help at the moment I’m afraid, sorry.

Prerequisites

For the most part we’ll be working with ‘Filters’ within your Google Analytics ‘View’. Before we dive straight in it’s important to ensure:

  1. You always have one ‘View’ in place which has no filters whatsoever allowing you to see all data, regardless of whether this in your main view.

    This will act as a backup view allowing you to see all unfiltered data to your website. It’ll also allow you to cross-reference and double-check data if required.

  2. Don’t apply the following filters to your main ‘View’. Instead either create a new, blank ‘View’ or clone your current ‘View’ and create filters on this.

    To clone your view go to Admin -> View Setting -> Copy View

    Clone a View in Google Analytics

Once you’ve covered the prerequisites you’re ready to proceed into the following preventative measures.

1. Blocking Known Bots and Spiders

In July 2014 Google Analytics announced a new feature to automatically exclude traffic from known bots and spiders. This optional configuration allows webmasters to automatically rid their reports of spurious traffic thus provide higher data accuracy for “real” traffic.

To do this Google matches User Agents from the Interactive Advertising Bureau (IBA) International Spiders and Bots List against web hits and automatically excludes those known to the IAB.

To enable known bot and spider filtering go to your ‘View Setting’ and simply check the “Exclude all hits from known bots and spiders” box and Google will do the hard work for you.

Exclude Known Bots and Spiders in Google Analytics

The only downside to this approach is that you’ll never actually know which bots/spiders hit your website and were automatically excluded by Google. I don’t suppose you’ll ever need to know but the data geek inside me always like to have the full picture. Again this where the unfiltered ‘View’ recommended in the prerequisites may come in handy.

2. Block Referrers via .htaccess

Through your .htaccess file (assuming you’re running on Apache) you can block referrer spam before it has a chance to trigger the Google Analytics tracking code and register as a web referral.

There are two ways to do this depending on your preference. Some people simply prefer to block the referral whereas others prefer to direct the traffic back to the source in a process known as deflecting. I’ll show both methods and you can decide.

Blocking Referrers
This method will simply stop the referring domain from ever reaching your website. I’ve used two of the most irritating referall sources as an example, semalt and buttons-for-website


## Blocking referrals from semalt.com and buttons-for-website.com
RewriteCond %{HTTP_REFERER} semalt.com [NC,OR]
RewriteCond %{HTTP_REFERER} buttons-for-website.com [NC,OR]
RewriteRule .* - [F]

Deflecting Referrers
This particular example is courtesy of Avi Wilensky of Promediacorp.

First you need to create a Deflector map (in this case named “deflector.map”) which is used to map the referring domain back to itself, for example:


#
## deflector.map
##
##referer --> redirect target
http://semalt.com http://semalt.com
http://seoanalyses.com http://seoanalysis.com
http://buttons-for-website.com http://buttons-for-website.com

You’ll then need to place the following code in your .htaccess file to trigger the correct mapping upon referral match.


RewriteMap deflector txt:/path/to/deflector.map
RewriteCond %{HTTP_REFERER} !=""
RewriteCond ${deflector:%{HTTP_REFERER}} =-
RewriteRule ^ %{HTTP_REFERER} [R,L]

I’m yet to try this approach for myself but it seems like a sound solution. Whichever method you adopt it’ll stop the referral from ever reaching your website.

3. Restricting Ghost Referrals

At this point you’re probably wondering what are “ghost referrals”? The term “Ghost Referrals” is associated with referrals which never actually visits your website. By exploiting vulnerabilities in Google Analytics’ tracking code referrers can trigger fake pageviews without ever having visited your website – annoying hey! So how on earth do you prevent that?!

Since the referrer triggers the pageview externally, therefore not actually visiting the website, the aforementioned method of .htaccess blocking or deflecting will not work. The only option is to use ‘Filters’ within Google Analytics to exclude specific hosts from your performance data.

One of the easiest methods and one which requires less effort to maintain is exclusion via hostname. “Why hostname?” I hear you ask, especially when we’re looking at referrals?

It’s complicated, but to cut a long story short ghost referrals will select tracking ID’s at random and attempt to trigger the pageview. These referrals don’t specifically target you. Instead it’s likely to be an automated spam-bot churning through random tracking ID’s. As a result they don’t actually know the correct hostname (i.e. chrisains.com) associated with the tracking ID so they’ll send a fake hostname. Being fake it’s often easy to identify hostnames via the correct Google Analytics reports.

By analysing hostnames and applying the necessary filtering we can exclude any sessions or pageviews which do not belong to a host associated with your tracking ID.

I’d definitely advise a level of caution otherwise you may inadvertently exclude valid traffic!

Here’s what you need to do:

  1. Within Google Analytics go to a ‘View’ which contains historical web traffic data
  2. Set the data range to relatively long period, say over the past 18-24 months if possible, to ensure you gather as much hostname data as possible
  3. Go the Audience -> Technology -> Network -> Hostname report
  4. Identify those hostnames which you are associated with and those which you are not.

    Valid hostnames should be the ones which you have associated your tracking code with. These are likely to be your domain (obviously!), your sub-domain, any third party tools which you have provided access to and so forth. Be careful as you’re likely to see known hosts such as Google.com or Apple.com. These are spam!

    Hostname Report in Google Analytics

    As you can see I’ve got some rouge hostnames listed such as:

    • co.lumb.co
    • 4webmasters.org
    • forum.topic2961997.darodar.com
    • google.com
    • message2961997.cenokos.ru

    I haven’t got pages on any of these domains therefore traffic from those hostnames is definitely spam.

    There are also a lot of visits with hostname “(not set)”. There are a couple of reasons why this may occur. It may be legitimate event-based goal completions which are not associated with a pageview therefore having no hostname value. Or it’s more likely spam.

    When “(not set)” hostnames occur you’ll need to investigate why. You can use a range of secondary dimensions such as ‘Source / Medium’ or ‘Full Referral’ within the Google Analytics reports to drill out more information.

    Hostname Source / Medium Report

    You’ll need to have a dig around in the data and see what you can find. Some people recommend blocking all “(not set)” hostnames seeing as it’s most often spam, which is true, but personally I’m a little skeptical about doing this just in case.

  5. Once identified which hostnames are yours create a list of your hostnames using the pipe separator, for example:

    chrisains.com|translate.googleusercontent.com|webcache.googleusercontent.com

    This will be used as the Filter Pattern to include only your hostnames within your reports, therefore excluding all ghost referrals. Please note there is no trailing pipe on the filter pattern.

    Notice I have included translate.googleusercontent.com for those foreign visitors translating my web pages and webcache.googleusercontent.com for those visitors viewing cached versions. These are legitimate hosts whereas something like darodar.com is a well known source of spam!

  6. Next, in Google Analytics go to ‘Admin’, select the view to apply the filer to, go to ‘Filters’. Configure your filter as follows:

    Hostname Filter Configuration

  7. Hit the ‘Save’ button and you’re good to go.

TIP: Always use your unfiltered view to ensure you’re not blocking important traffic! It’d be very easy to inadvertently block legitimate traffic by accidentally excluding a hostname related to your tracking ID. So please ensure you approach this technique with caution.

Please also note that the filter will need updating whenever you enter your tracking ID into a new web service.

Finally please note the filter will not back-date. It will only filter new traffic ahead of the time of filter configuration. If you need to apply the logic to historic data follow this guide from AnalyticsEdge.com using advanced segments.

4. Blocking Dodgy Crawlers & Fake Referrals

Using Google Analytics filters you can also dodgy/malicious web crawlers and block sources of fake referrals.

To do this we’ll be filtering on Campaign Source matching on Domain (as opposed to Referral through which you’d need to match the full referral path).

There is a big blacklist of span referrers on GitHub provided by Piwik. It currently contains 271 spam domains and is growing all the time.

Unfortunately Google Analytics limits the Filter Pattern to 255 characters (a massive pain in the ar*e!) which means for an extensive list you’ll need to create multiple filters.

For the purpose of this post I’ll to show you how configure a single filter using the following referrers:

semalt|anticrawler|best-seo-offer|best-seo-solution|buttons-for-website|buttons-for-your-website|7makemoneyonline|-musicas*-gratis|kambasoft|savetubevideo|ranksonic|medispainstitute|offers.bycontext|100dollars-seo|sitevaluation|dailyrank|darodar

You can then repeat this approach to create additional filters as required.

The process is very similar to before. Go to ‘Admin’, select the view to apply the filer to, go to ‘Filters’. This time you’ll be configuring the filter slightly differently, as follows:

Campaign Source Filter Configuration

Note that this time we’re using the filter to exclude the rogue domains.

Are there any other methods for identifying & reducing referral spam?

There are. Tom Capper of Distilled wrote an excellent post using Screen resolution to identify referral spam.

Resolution Not Set

In essence this approach consists of reviewing GA data to identify instances where the visitor’s screen resolution is “(not set)” – meaning they don’t have a screen. The assumption of course being if you don’t have a screen resolution you must be spam!

The other methods discussed in this post should adequately cover any referral spam identified through the “(not set)” resolution technique, but there’s no harm in running both. Speaking of which…

Should you use all of these approaches?

Ideally I would say yes. All four approaches which I have covered compliment each other and can be used in combination to effectively limit referral spam. You could also use Tom Capper’s screen resolution filter too as an additional mechanism to reduce spam.

My only advise would be to remain cautious. Be sure to review your unfiltered ‘View’ regularly to ensure data accuracy and ensure that you’re not accidentally blocking real traffic.

Is it an on-going process?

For the time being yes – very much so. If you’re thinking you can follow these methods then walk away you’re very much mistaken.

As I mentioned at the start web spam isn’t going away. It will evolve and spammers will find different methods of referrals spam, even if only though the use of differing hostnames. With this in mind it’s essential that you keep you .htaccess restrictions, hostname inclusion filters, and referal exclusion filters up-to-date at all times.

I hope this helps. Feel free to let me know how you get on with a comment below!

One thought on “How to Remove Referral Spam from Google Analytics Reports

  • Hey Chris,
    I was tired of that spam data in my Google Analytics report. Even I have tried to filtered but I couldn’t do this. But After a through reading this post I have implemented the things that you discussed above. Now I am sure I can get rid of the spamming traffic in my monthly reports. Thanks for such a great and in-depth post on referral traffic. Let see how it helps .:)

Leave a Reply

Your email address will not be published. Required fields are marked *