Cover V12, I02

Article
Figure 1
Figure 2
Figure 3
Listing 1

feb2003.tar

Filtering/Replacing Advertisements with Squid and Apache

Dustin Anders

The proliferation of banner and pop-up ads can be quite annoying when surfing the Internet. In this article, I will describe a quick and easy way to block banner ads via Squid and Apache.

Prerequisites

The following applications are necessary for filtering ads in this manner:

Perl 5 or greater -- available at http://www.perl.com

Apache 1.3.26 -- available at http://www.apache.org

Squid 2.5 -- available at http://www.squid-cache.org

Installation of the above packages will not be discussed in this article. Installation documents are available on each site to help you through the process. You will also need to download adfilter.pl, hosts, and filters file from http://www.unixfun.com/sysadmin/adfilter/ or from the Sys Admin Web site: http://www.sysadminmag.com.

Design Notes

In this configuration, Squid is installed on a Linux machine (running Red Hat 7.3) that will be used as a gateway for a Windows 2000 machine. In the workstation's browser configuration, specify port 8010 as your http gateway. Squid has been configured to bind to port 8010. See Figure 1 for the proxy settings configuration. You could also run Squid on a workstation and point locally (127.0.0.1) on port 8010.

The process used for filtering Web ads is as follows:

1. Web requests sent to the Squid Web proxy are redirected to the Perl script adfilter.pl (described in this article).

2. If the requested URL is a .jpg, .bmp, or .png image file, the URL passes through the following checks:

  • A regular expression filter, which searches for directory paths that might indicate a banner ad, such as //.*/ads/.*
  • A hosts file lookup using a special hosts file that lists known Internet ad servers (available on the Web from a company called Smartin Designs).
3. If the URL is identified as an ad by either of the checks, the URL is redirected to a local Apache Web server, where a replacement graphic is substituted for the ad.

Squid Configuration

To begin, you should have Squid and Apache installed on your server. We will make a few additions to the Squid configuration file, squid.conf, which should be located in the conf directory of your Squid installation directory.

We will make use of Squid's redirect feature, which allows us to pass a few parameters to an application of our choosing. Below are the parameters that are passed:

URL -- The requested Uniform Resource Locator.

IP ADDRESS / FQDN -- The IP address or fully qualified domain name of the client that requested the page.

IDENT -- The identity of the user running the Web browser.

METHOD -- The request method, which is GET, POST, or HEAD.

You can also specify how many child processes Squid is to fork for the redirect program. In this example, we redirect all URLs to a Perl script called adfilter.pl (see Listing 1). Below are the lines that need to be added to your squid.conf:

redirect_program /usr/local/squid/adfilter/adfilter.pl
redirect_children 10
Depending on how many machines are using your gateway for Web surfing and the usage, you will also need to increase your child count. That concludes all the changes that need to be made to the squid.conf.

This script requires Perl 5.x or greater to be installed on the system. As discussed before, Squid passes several key values to the redirect application. We will ignore all the values except the URL.

The script supports reading in a list of expressions in a filters file and a host file. It compares the passed-in arguments to the filters file first. Comparison is done for image files only that have the following extensions: .jpg, .bmp, and .png. You can add more extensions to check by changing the URL comparison statement.

If no match is returned from the filters comparison, it will compare the URL to the hosts file. If it finds a match in either file, it prints out the replacement URL defined within the same line in the matched entry. In this configuration, everything is simply redirected to 127.0.0.1. Apache will provide a bit of redirect magic of its own to allow us to insert an image for any request.

The filters file contains a list of regular expressions to match the URL against. The following regular expression is a sample from the file:

127.0.0.1 //.*/ads/.*
The hosts file contains a list of hosts that have been determined to be ad related by a company called Smartin Designs. The hosts file is too long to include here, but can be downloaded from:

http://www.unixfun.com/sysadmin/adfilter/
or from the Sys Admin Web site.

A sample hosts entry is:

127.0.0.1 123banners.com
Apache Configuration

The Apache configuration is relatively simple. In this configuration, we are only using Apache to redirect all requests to an image file on the server. We could have changed the hosts/filter files to redirect all requests to a Fully Qualified Domain Name/file. However, since we are making use of a regularly updated host file that redirects requests to 127.0.0.1, we need to redirect those requests.

The Apache configuration consists of the following line that performs the redirect:

AliasMatch ^(.*) /var/www/htdocs/1.png

Aliasmatch allows us to specify a regular expression to compare to the URL. In this case, we are matching every request to our server. Every URL that hits the Apache server will be redirected to a 5-byte image file called 1.png. This image contains my initials.

Final Product

After you have downloaded the filters file and the hosts file from http://www.unixfun.com/sysadmin/adfilter/, place them in a subdirectory called adfilter under the Squid installation directory. If your Squid installation directory does not exist in /usr/local/squid, update the filter_dir variable at the top of the adfilter.pl to reflect the proper location. With all this in place and the modifications to the squid.conf and httpd.conf, you will need to restart Squid and Apache. After restarting Squid, you will notice that quite a few more processes were started. See Figure 2.

Assuming you created your own banner replacement image that was referenced in the changes to the httpd.conf, you should see that image when you go to a Web site, such as http://www.espn.com. See Figure 3 for an example of the output.

Updates

As one may expect, new ad servers are popping up daily. It makes sense to keep the hosts/filters files updated with the latest regular expressions/entries. The Smartin Designs site at:

http://www.smartin-designs.com/downloads.htm
provides a page where you can download the latest hosts file for ad sites. The site also provides a notification service when new files are released. If you would like to automate the update process, you can make use of Lynx by placing the following command in your system crontab:

/usr/bin/lynx -dump http://www.smartin-designs.com/ \
  downloads/full_127001.txt > /<ADFILTER DIR>/hosts
Unfortunately, I have not found one site that provides a filter list for ads that's updated on a continual basis. There are quite a few sites that have lists for pattern matches, so most likely you will need to build out your own filter list.

Considerations

Currently, three types of images are checked to determine whether they are banner ads. This was done primarily because the hosts database is so large. If one checks all requests, it sufficiently slows down browsing access.

You may want to define your own set of filters rather than using the hosts file, which currently contains 13,000 entries. Most of these entries can be eliminated and covered by a few simple regular expressions. For example, the hosts file contains a lot of ad1.server.com, ad2.server.com ... ad255.server.com. You could eliminate 254 entries with one regular expression. Or, you could cover every ad server of a certain type by simply specifying //ads.* in the filters file.

Another important note is that the potential exists for sites to be blocked inadvertently. For example, if ESPN's main title bar has an address of http://www.espn.com/banner/title.jpg, then the regular expression, //.*/banner/.*, will block the request. Entries in the hosts file will also cause some non-ad sites to be blocked.

Dustin Anders, CISSP is a Senior Network Security Consultant for a large security company. He has more than six years experience of network and systems security experience. He spends the majority of his free time writing Perl and PHP scripts and playing with his son, Ian.