Cover V12, I08

Article
Figure 1
Figure 2
Sidebar
Table 1
Table 2
Table 3
Table 4

aug2003.tar

Compressing Web Content with mod_gzip and mod_deflate

Stephen Pierzchala

Cost-reduction is a key component in every IT budget. One item that is scrutinized is the cost of bandwidth. The use of a content compression solution is one way to conserve bandwidth. With that in mind, this article will cover some of the compression modules for Apache: mod_gzip for Apache 1.3.x and 2.0.x; and mod_deflate for Apache 2.0.x.

Content Compression Basics

Most compression algorithms, when applied against a plain text file, can reduce its size by 70% or more, depending on the content in the file. (See the sidebar for an overview of compression levels.) When using compression algorithms, there is little difference between standard and maximum compression levels, especially when you consider the extra CPU time necessary to process these extra compression passes. This is particularly important when dynamically compressing Web content. Most software content compression techniques use a compression level of 6 (out of 9 possible levels) to conserve CPU cycles. The file size difference between level 6 and level 9 is usually so minimal that it's not worth the extra time.

Content Compression of Web Content

With files that are identified with the text/.* MIME-types, compression can be applied to the file before it's placed on the wire, reducing the number of bytes transferred, and improving performance at the same time. Testing has also shown that Microsoft Office, StarOffice/OpenOffice, and Postscipt files can be GZIP-encoded for transport by the compression modules.

Some important MIME-types that cannot be GZIP-encoded are: application/x-javascript (external javascript files); application/pdf (PDF files); and image/.* (all image files). The caveat against Javascript files is mainly due to bugs in browser software, as these files are really text files and would improve overall performance by being compressed for transport. PDF and image files are already compressed, and attempting to compress them again will simply make them larger, as well as lead to potential rendering issues with browsers.

Before sending a compressed file to a client, the server must ensure that the client receiving the data understands the compressed format and can render it correctly. Browsers that understand compressed content send a variation of the following client request headers:

Accept-encoding: gzip
Accept-encoding: gzip, deflate
All current major browsers include some variation of this message with every request sent. (The exceptions to this rule are all versions of Microsoft Internet Explorer when the HTTP/1.1 settings are turned off. This seems to be a result of the erroneous belief that only HTTP/1.1 clients will send a request for GZIP-encoded content. See:

http://www.microsoft.com/technet/prodtechnol/iis/maintain/featusability/httpcomp.asp
If the server sees the header and chooses to provide compressed content, it will respond with the following pair of server response headers:

Content-encoding: gzip
Content-type: [insert appropriate MIME type]
These headers tell the receiving browser to first decompress the content, and then parse it normally or pass it to the appropriate helper application.

The file size advantages of compressing content can be easily seen by looking at a couple of examples -- one is an HTML file (Table 1), and the other is a Postscript file (Table 2). Performance improvements will be discussed later in the article.

Configuring mod_gzip

The mod_gzip module is available for both Apache/1.3.x and Apache/2.0.x., and can be compiled into Apache as a DSO or a static module. The Apache/1.3.x version is located at:

http://sourceforge.net/projects/mod-gzip/
and the Apache/2.0.x version is located at:

http://www.gknw.de/development/apache/httpd-2.0/unix/modules/
The compilation for a DSO is simple -- from the uncompressed source directory, perform the following steps as root:

make APXS=/path/to/apxs
make install APXS=/path/to/apxs
/path/to/apachectl graceful
The mod_gzip must be loaded last in the module list because Apache/1.3.x processes content in module order, and compression is the final step performed before data is written to the wire. (Note that mod_gzip installs itself in the httpd.conf file, but it is commented out.)

A very basic configuration for mod_gzip in the httpd.conf would include:

mod_gzip_item_include mime ^text/.*
mod_gzip_item_include mime ^application/postscript$
mod_gzip_item_include mime ^application/ms.*$
mod_gzip_item_include mime ^application/vnd.*$
mod_gzip_item_exclude mime ^application/x-javascript$
mod_gzip_item_exclude mime ^image/.*$
mod_gzip_item_exclude file \.(?:exe|t?gz|zip|bz2|sit|rar)$
This allows Microsoft Office and Postscript files to be GZIP-encoded, while not compressing PDF files. PDF files should not be GZIP-encoded because they are compressed in their native format, and compressing them leads to issues when attempting to display the files in Adobe Acrobat Reader. For the paranoid systems administrator, you may want to explicitly exclude PDF files from being compressed:

mod_gzip_item_exclude mime ^application/pdf$
Configuring mod_deflate

The mod_deflate module for Apache/2.0.x is included with the source for this server. This makes compiling it into the server very simple:

./configure --enable-modules=all --enable-mods-shared=all --enable-deflate
make
make install
With mod_deflate for Apache/2.0.x, the GZIP-encoding of documents can be enabled in one of two ways: explicit exclusion of files by extension, or by explicit inclusion of files by MIME type. These methods are specified in the httpd.conf file.

Explicit exclusion:

SetOutputFilter DEFLATE
DeflateFilterNote ratio
SetEnvIfNoCase Request_URI \.(?:gif|jpe?g|png)$ no-gzip dont-vary
SetEnvIfNoCase Request_URI \.(?:exe|t?gz|zip|bz2|sit|rar)$ no-gzip dont-vary
SetEnvIfNoCase Request_URI \.pdf$ no-gzip dont-vary
Explicit inclusion:

DeflateFilterNote ratio
AddOutputFilterByType DEFLATE text/*
AddOutputFilterByType DEFLATE application/ms* application/vnd* application/postscript
In the explicit exclusion method, note the same exclusions as the mod_gzip file, namely images and PDF files.

Compressing Dynamic Content

If your site uses dynamic content (XSSI, CGI, etc.), there is no need to do anything special to compress the output from these modules. Because mod_gzip and mod_deflate process all outgoing content before it is placed on the wire, all content from Apache that matches either the MIME-types or the file extensions mapped in the configuration directives will be compressed.

The output from PHP, the most popular dynamic scripting language for Apache, can also be compressed in three possible ways:

  • Using the built-in output handler, ob_gzhandler
  • Using the built-in ZLIB compression
  • Using one of the Apache compression modules

Configuring PHP's built-in compression is simply a matter of compiling PHP with the --with-zlib configure flag and re-configuring the php.ini file.

Output buffer method:

output_buffering = On
output_handler = ob_gzhandler
zlib.output_compression = Off
ZLIB method:

output_buffering = Off
output_handler =
zlib.output_compression = On
The output buffer method produces marginally better compression, but either will work. The output buffer, ob_gzhandler, can also be added on a script-by-script basis if you do not want to enable compression across your whole site.

If you do not want to (or cannot) reconfigure PHP with ZLIB enabled, the Apache compression modules will compress the content generated by PHP. I have configured my server so that the Apache modules handle all of the compression, and so that all pages are compressed in a consistent manner, regardless of their origin.

Caching Compressed Content

Can compressed content be cached? The answer is an unequivocal yes. With mod_gzip and mod_deflate, Apache sends the "Vary" header, indicating to caches that this object differs from other requests for the same object based on certain criteria -- User-Agent, Character Set, etc. When a compressed object is received by a cache, it will note that the server returned a Vary: Accept-Encoding response, indicating that this response was generated based on the request containing the Accept-Encoding: gzip header.

This does lead to a situation where a cache can store two copies of the same document, one compressed and one uncompressed. This is a design feature of HTTP/1.1, and allows clients with and without the ability to receive compressed content to still benefit from the performance enhancements gained from local proxy caches.

Logging Compression Results

When comparing the logging methods of mod_gzip and mod_deflate, there really is no comparison. The mod_gzip logging is very robust and configurable, and is based on the Apache log format. This allows the mod_gzip logs to be configured in pretty much any way you want for analysis. The default log formats provided when the module is installed are shown below:

# LogFormat "%h %l %u %t \"%r\" %>s %b mod_gzip: \
  %{mod_gzip_compression_ratio}npct." common_with_mod_gzip_info1
LogFormat "%h %l %u %t \"%r\" %>s %b mod_gzip: %{mod_gzip_result}n \
  In:%{mod_gzip_input_size}n Out:%{mod_gzip_output_size}n \
  Ratio:%{mod_gzip_compression_ratio}npct." common_with_mod_gzip_info2
# LogFormat "%{mod_gzip_compression_ratio}npct." mod_gzip_info1
# LogFormat "%{mod_gzip_result}n In:%{mod_gzip_input_size}n \
  Out:%{mod_gzip_output_size}n Ratio:%{mod_gzip_compression_ratio}npct." mod_gzip_info2
The logging allows you to see the file's size prior to compression, the size after compression, and the compression ratio. After tweaking the log formats to meet your specific configuration, these can then be added to a logging system by specifying a CustomLog in the httpd.conf file:

CustomLog logs/gzip.log common_with_mod_gzip_info2
# CustomLog logs/gzip.log mod_gzip_info2
Logging in mod_deflate is limited to one configuration directive, DeflateFilterNote, and this is intended to be added to an access_log file. Be careful about doing this in your production logs, because it may cause some log analyzers to have issues when examining your files. It's best to start by logging compression ratios to a separate file:

DeflateFilterNote ratio

LogFormat '"%r" %b (%{ratio}n) "%{User-agent}i"' deflate
CustomLog logs/deflate_log deflate
Performance Improvement from Compression

How much improvement can you see with compression? The difference in measured download times on a very lightly loaded server indicates that the time to download the Base Page (the initial HTML file) improved by between 1.3 and 1.6 seconds across a very slow connection when compression was used. See Figure 1.

There is a slightly slower time for the server to respond to a client requesting a compressed page. Measurements show that the median response time for the server averaged 0.23 seconds for the uncompressed page and 0.27 seconds for the compressed page. However, most Web server administrators should be willing to accept a 0.04 increase in response time to achieve a 1.5 second improvement in file transfer time.

Web pages are not completely HTML. How do improved HTML (and CSS) download times affect overall performance? Figure 2 shows that overall download times for the test page were 1-1.5 seconds faster when the HTML files were compressed.

To further emphasize the value of compression, I ran a test on a Web server to see the average compression ratio when requesting a very large number of files. Furthermore, I wanted to determine the affect on server response time when requesting large numbers of compressed files simultaneously. There were 1952 HTML files in the test directory, and I checked the results using CURL across my local LAN. (The files were the top-level HTML files from the Linux Documentation Project. They were installed on an Apache 1.3.27 server running mod_gzip. Minimum file size was 80 bytes and maximum file size was 99419 bytes.) See Table 3.

As expected, the "First Byte" download time was slightly higher with the compressed files than it was with the uncompressed files. But this difference was in milliseconds, and is hardly worth mentioning in terms of on-the-fly compression. It is unlikely that any user, especially dial-up users, would notice this difference in performance.

That the delivered data was transformed to 43% of the original file size should make any Web administrator sit up and take notice. The compression ratio for the test files ranged from no compression for files that were less than 300 bytes, to 15% of original file size for two of the Linux SCSI Programming HOWTOs. Compression ratios do not increase in a linear fashion when compared to file size; rather, compression depends heavily on the repetition of content within a file to gain its greatest successes. The SCSI Programming HOWTOs have a great deal of repeated characters, making them ideal candidates for extreme compression.

Smaller files also did not compress as well as larger files for this same reason. Fewer bytes means a lower probability of repeated bytes, resulting in a lower compression ratio. See Table 4.

The data shows that compression works best on files larger than 5000 bytes; after that size, average compression gains are smaller, unless a file has a large number of repeated characters. Some people argue that compressing files below a certain size is a wasteful use of CPU cycles. If you agree with these folks, using the 5000-byte value as floor value for compressing files should be a good starting point. I, however, compress everything that comes off my servers because I consider myself an HTTP overclocker, trying to squeeze every last bit of download performance out of the network.

Conclusion

With a few simple commands, and a little bit of configuration, an Apache Web server can be configured to deliver a large amount of content in a compressed format. These benefits are not simply limited to static pages; dynamic pages generated by PHP and other dynamic content generators can be compressed by using the Apache compression modules. When added to other performance tuning mechanisms and appropriate server-side caching rules, these modules can substantially reduce the bandwidth for a very low cost.

Stephen Pierzchala is Senior Diagnostic Analyst for Keynote Systems in San Mateo, California. His focus is on analyzing Web performance data and evangelizing on Web performance topics, such as Content Compression, Caching, and Persistent Connections.