clean up awstats referral spam

The 'Links from an external page (other web sites except search engines) ' section on the few awstats installations I get to look at has been chaotic the past few months. Tonnes of referral spam overshadow any useful information in the `links from an external page` section.

AWstats comes with a simple mechanism to block ( not include referral spam ) based on a black-list. To enable referral spam blocking set the SkipReferrersBlackList var-directive to the path of a file containing the spam domains you want to exclude.
However, this change is effective only for new updates.

I wrote a little script that can clean up awstats logs from referral spam using the same blacklist.txt file used by the `SkipReferrersBlackList`.

Usage instructions for the clean_awstats_rspam.pl program:

Get the clean_awstats_rspam.pl script
# wget kod.ipduh.com/lib/clean_awstats_rspam.pl


set $blacklist to the path of your blacklist.txt.
# vim clean_awstats_rspam.pl
You may change $bl_subdomain to 0 if you don't want to get rid off sub-domains of blacklisted domains. To print-log the count of referral spam links removed to the bottom of the awstatsYYMMMM.example.com.txt file set $rlog to 1.

The awstats logs files are usually in the awstats data directory and usually look like awstats062013.example.com.txt. The clean_awstats_rspam.pl script will read a log file and print to the standard out a `clean` log. You will have to direct the output to an intermediate log file, copy the intermediate log file to the original and then refresh your awstats page.
# cp clean_awstats_rspam.pl /var/www/sites/example.com/awstats/data
# cd /var/www/sites/example.com/awstats/data
# chmod 700 clean_awstats_rspam.pl
# ./clean_awstats_rspam.pl  awstats062013.example.com.txt >> clean_awstats062013.example.com.txt
# cp clean_awstats062013.example.com.txt awstats062013.example.com.txt


The clean_awstats_rspam.pl script:

 #!/usr/bin/perl
#g0 2013
#clean_awstats_rspam.pl
#clean up awstats referral spam
#http://alog.ipduh.com/2013/06/clean-up-awstats-referral-spam.html
# usage: 
# ./clean_awstats_rspam.pl awstats062013.example.com.txt >> clean_awstats062013.example.com.txt;
# cp clean_awstats062013.example.com.txt awstats062013.example.com.txt

use strict;

#set to 1 if you want to get rid off subdomains of blaclisted domains
my $bl_subdomains=1;
#location of your blacklist.txt file
my $blacklist="/var/www/sites/example.com/awstats/lib/blacklist.txt";
#set to 1 if you want to log the count of spam referral links removed to the awstatsMMYYYY.example.com.txt file
my $rlog=0;

my %spamdoms=();
my $logtxt=$ARGV[0];
my $foundspam=0; 
my $crap=0;
my $me="clean_awstats_rspam";

open FH, "$blacklist" or die "$me:I could not open $blacklist ($!)";
while (<FH>)
{
 chomp;
 if(/^#/) { next; }
 $_=~s/^\s+//;
 $_=~s/\s+$//;
 unless($spamdoms{$_})
 {
   $spamdoms{$_}=1;
 }  
}
close FH;

my @spamdoms=keys %spamdoms;
%spamdoms=();

open FH, "$logtxt" or die "$me:I could not open $logtxt ($!)";
while (<FH>)
{

 if(/^#/) { print $_ ; next; }
 unless(/^http/) { print $_ ; next; }
OOF:{
 $foundspam=0; 
 for my $spamdom (@spamdoms)
 {
    if( $bl_subdomains )
           {
  if( /^http:\/\/$spamdom/ || /^http:\/\/[a-z0-9A-Z\-\.]*\.$spamdom/ )
  {  
   $crap++;
   $foundspam=1; 
   last OOF;
  }
           }else{
    if( /^http:\/\/$spamdom/ )
  {
   $crap++;
                        $foundspam=1;
                        last OOF;
  }
           }
 }
 print $_;
    }
 
}
close FH;

print "$me:removed $crap referrals\n" if($rlog);




     




a script that cleans awstats referral spam