Cover V12, i11

Article

nov2003.tar

Speeding up Your Perl Programs

Randal L. Schwartz

How fast does your Perl run? I hope that, most of the time, the answer is "fast enough". But sometimes it may not be. How do we speed up a Perl program? How do we know what's making it unacceptably slow?

The first rule of speed optimization is "don't". Certainly, don't write things that waste time, but first aim for clarity and maintainability. With today's processor speeds far exceeding the "supercomputers" of yesterday, sometimes that's "fast enough".

But how do we know whether something is "fast enough"? The first easy cut is just to check total run time. That's usually as simple as putting "time" in front of your program invocation, and looking at the result (which may vary depending on your operating system and choice of shell). For example, here's a run of one of my frequently invoked programs, preceding the command (get.mini) with time:

[localhost:~] merlyn% time get.mini
authors/01mailrc.txt.gz ... up to date
modules/02packages.details.txt.gz ... up to date
modules/03modlist.data.gz ... up to date
67.540u 3.290s 1:40.62 70.3%    0+0k 1585+566io 0pf+0w
So, this tells me that this program took about 70 CPU seconds over a 100-second timespan, essentially doing nothing useful for me except verifying that my local mirror of the CPAN is up to date. (Generally, to get sensible numbers, I have to do this on an otherwise idle system.)

Hmm. I wonder where it's actually spending all of its time? I can get the next level of information by invoking the built-in Perl profiler on the script. Simply include -d:DProf as a Perl command switch, and Perl will automatically record where the program is spending its time in a file called tmon.out in the current directory:

[localhost:~] merlyn% perl -d:DProf 'which get.mini'
authors/01mailrc.txt.gz ... up to date
modules/02packages.details.txt.gz ... up to date
modules/03modlist.data.gz ... up to date
[localhost:~] merlyn% ls -l tmon.out
-rw-r--r--    1 merlyn   staff    44991232 Sep  8 08:46 tmon.out
[localhost:~] merlyn%
(I have to use which here because perl won't look in my shell's search path for the script.) The execution under the profiler slows the program down a bit, but otherwise doesn't affect any functionality. I now need to summarize the raw data of tmon.out with the dprofpp (also included with Perl):

[localhost:~] merlyn% dprofpp
Total Elapsed Time = 99.50810 Seconds
  User+System Time = 70.79810 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c  Name
 14.9   10.60 12.739 545807   0.0000 0.0000  URI::_generic::authority
 14.8   10.54 12.002 461833   0.0000 0.0000  URI::_generic::path
 13.1   9.296 19.853 293900   0.0000 0.0001  URI::new
 12.8   9.108 39.082  83971   0.0001 0.0005  URI::_generic::abs
 9.82   6.953 12.677 461842   0.0000 0.0000  URI::_scheme
 8.16   5.778  6.024 293900   0.0000 0.0000  URI::_init
 7.58   5.368 91.054  41987   0.0001 0.0022  main::my_mirror
 6.39   4.522  4.521 293900   0.0000 0.0000  URI::implementor
 5.78   4.093 32.267  41984   0.0001 0.0008  URI::_generic::rel
 5.07   3.588  3.588 251910   0.0000 0.0000  URI::_generic::_check_path
 4.26   3.016  3.016  83976   0.0000 0.0000  File::Spec::Unix::canonpath
 3.92   2.775 15.390  83968   0.0000 0.0002  URI::http::canonical
 3.48   2.465 14.186 377874   0.0000 0.0000  URI::scheme
 3.09   2.185  9.973  83968   0.0000 0.0001  URI::_server::canonical
 2.25   1.596  3.419  41988   0.0000 0.0001  File::Spec::Unix::catdir
[localhost:~] merlyn%
Now, this is some interesting information that I can use. It took 70 CPU seconds. This is just like the previous run, so the profiling didn't significantly alter the execution. The first routine listed, URI::_generic::authority, was called half a million times, contributing about 10 CPU seconds to the total. (There are other switches to dprofpp, but the default options seem to be the most useful most of the time.)

If we could make URI::_generic::authority twice as fast, we'd save about 5 seconds on the run. Even making URI::new run twice as fast would save us 5 seconds on the run as well, even though it's called half as frequently. Generally, the important thing to note here is total CPU for a routine, not the number of times it is invoked, or the CPU time spent on an individual invocation. However, I do find it interesting that I had to create 293,000 URI objects -- perhaps there's some redesign of my algorithm to avoid that. Good algorithm design is important, and is both a science and an art.

The other thing to note is that most of the routines that are burning CPU are in libraries, not in my code. So, without changing the code of a library, I need to call library routines more efficiently or less frequently if I want to speed this up.

We have the bigger picture to consider. I invoke this program maybe a dozen times a day. If I worked hard to reduce this program to 35 CPU seconds instead of 70 CPU seconds, I'll save about 5 CPU minutes a day. How long would I have to work to optimize it to that level? And how many times would I have to invoke the program before the money I spent on the (usually otherwise idle) CPU being saved is equal to my billing rate for the time I spent?

This is why we usually don't worry about speed. CPUs are cheap. Programmer time is expensive.

But let's push it the other way for a minute. Suppose this program had to be invoked once a minute continuously. At 70 CPU seconds per run, we've just run out of processor (not to mention that the realtime speed was even longer). So we're forced to optimize the program, or buy a bigger machine, or run the program on multiple machines (if possible).

Armed with the data from the profiling run, we might want to tackle the URI::new routine. But rather than tweaking the code a bit, and re-running the program overall, it's generally more effective to benchmark various versions of the code for given inputs, in isolation from the rest of the program.

For example, suppose we wanted to speed up the part of the routine that determines the URI's scheme separate from the rest of the URL. Three possible strategies come to mind initially: a regular expression match, a split, or an index/substr, implemented roughly as follows:

sub re_match {
  my $str = "http://www.stonehenge.com/perltraining/";
  my ($scheme, $rest) = $str =~ /(.*?):(.*)/;
}
sub split_it {
  my $str = "http://www.stonehenge.com/perltraining/";
  my ($scheme, $rest) = split /:/, $str, 2;
}
sub index_substr {
  my $str = "http://www.stonehenge.com/perltraining/";
  my $pos = index($str, ":");
  my $scheme = substr($str, 0, $pos-1);
  my $rest = substr($str, $pos+1);
}
We now have three candidates. Which one is faster? To determine that, we can use the built-in Benchmark module, as follows:

use Benchmark qw(cmpthese);
my $URI = "http://www.stonehenge.com/perltraining/";
sub re_match {
  my $str = $URI;
  my ($scheme, $rest) = $str =~ /(.*?):(.*)/;
}
sub split_it {
  my $str = $URI;
  my ($scheme, $rest) = split /:/, $str, 2;
}
sub index_substr {
  my $str = $URI;
  my $pos = index($str, ":");
  my $scheme = substr($str, 0, $pos-1);
  my $rest = substr($str, $pos+1);
}
cmpthese(-1, {
  re_match => \&re_match,
  split_it => \&split_it,
  index_substr => \&index_substr,
});
When run, this results in:

             Rate              re_match    split_it     index_substr
re_match     131022/s           --         -37%         -46%
split_it     208777/s          59%           --         -14%
index_substr 242811/s          85%          16%           --
The "-1" on the cmpthese call says "run each of these for roughly 1 CPU second". Note that index_substr seems to be the clear winner here, even though there are more things for me to type and get correct.

One thing to watch out for in benchmarking is optimizing for atypical data. For example, the URI I picked is probably typical, but what happens if the URL is longer?

my $URI = "http://www.stonehenge.com/merlyn/" .
  "Pictures/Trips/2003/03-06-PerlWhirlMacMania/" .
  "Day-0-Pearl-Harbor/?show=14";
Now the results show that index_substr is even better!

             Rate              re_match    split_it     index_substr
re_match     117108/s           --         -41%         -49%
split_it     199110/s          70%           --         -13%
index_substr 229682/s          96%          15%           --
Given those two extremes, I'd say that index_substr is probably the right tool for this particular task, all other things being equal.

Armed with Perl's built-in profiler and the Benchmark module, I can usually fine-tune my routines and get the speed I want out of my Perl programs. If that's not enough, I could take those frequently called expensive routines and code them in a lower-level language instead. But usually that won't be needed. Until next time, enjoy!

Randal L. Schwartz is a two-decade veteran of the software industry -- skilled in software design, system administration, security, technical writing, and training. He has coauthored the "must-have" standards: Programming Perl, Learning Perl, Learning Perl for Win32 Systems, and Effective Perl Programming. He's also a frequent contributor to the Perl newsgroups, and has moderated comp.lang.perl.announce since its inception. Since 1985, Randal has owned and operated Stonehenge Consulting Services, Inc.