Thursday, January 24, 2013

The grep Project

Background

J. Wolfgang Goerlich (@jwgoerlich) and some others had a discussion on Twitter awhile back on a post from Mike Haertel. Mike explained in a post why GNU grep was really fast. Wolfgang explains the rest of the details and his tests of GNU grep for Windows vs. Microsoft PowerShell 3 on his blog.

In addition to suggesting a more scientific iteration of tests to remove outliers, I became more curious, based on his results, as to what else went into answering the fundamental question of "How can I quickly find the string I'm looking for?"  So I decided to test with OS X, and compare several versions of grep and pash, an open source emulator for PowerShell, written with Mono. I liked the idea of having a project that works, as we always should, to bring science into information security.

Here are my preliminary results. I am working on the pash tests, but pash appears to be a languishing project at this point, and I needed to build a Mono development environment, teach myself some C#, and figure out a few things in the IDE. So check back for updates on the real question of pash vs. PowerShell.

What I can say, is that based on the results Wolfgang got, vs. the ones I got, I can say that it doesn't much matter to me if PowerShell is faster than GNU grep on Windows - on OS X either grep smokes both of them.

Test Environment

I used the same test files that Wolfgang did, downloading them from the reference in his blog. His file included scripts to run tests, but they were a Microsoft thing so I wrote my own bash shell script instead. I'll reference that at the end of this entry; it's nothing fancy. Wolfgang used Diagnostics.Stopwatch; I substituted time for mine.

Just as Wolfgang did, I ran each test for 7 iterations, and I dropped the min/max outliers, and averaged the remaining five. My system is an iMac 3.4 GHz i7 with 16 GB of RAM, running OS X 10.8.2. Filesystem is Macintosh Journaled with full disk encryption via FileVault 2 (256-bit AES).

BSD grep used (intrinsic to OS X) was grep (BSD grep) 2.5.1-FreeBSD. I also compiled GNU grep from source, version 2.14, compiled with gcc i686-apple-darwin11-llvm-gcc-4.2 (GCC) 4.2.1.

Results

The following graph charts the same data points as with Wolfgang's test; I ensured the x-axis and y-axis matched his.
Chart of results of BSD and GNU grep on OS X.
Note that, just as in Wolfgang's test, my times are in milliseconds (ms). I went back to Twitter and asked him to confirm, because the speed differential between Windows 2008 R2 SP1 and OS X is huge in this case. It essentially means that GNU grep on OS X will get you the answer you seek in a large file 984 times faster than on Windows, and 173 times faster than PowerShell's Select-String.

I will also, in the interest of science, note that GNU grep did have a lag in larger file searches on the first iteration, creating an outlier, in the 100,000 line file, for example, of 319 ms vs. an average of 19 ms. The subsequent 6 runs were all within a millisecond of 19 ms, however.

As soon as I get pash working, I will update with additional results for comparison.

Parting Thought

Of course, it's not just enough to produce results in science and just let them lay there like wet noodles. It is useful to then derive meaning from them, form new hypothesis, and test further. One thought that occurred to me, to explain speed differentials between what are, essentially, nearly-identical computer systems, is the filesystem itself.

It further occurred to me that I had run my tests using the hard drive in the iMac; I wondered then if one would see a differential in result times, if on the same system I ran the tests based on my Pegasus R4 RAID 5 cabinet, connected via Thunderbolt. I am wagering, since the file can be read faster, the results should be faster. But then it could be that the 16 GB of RAM has something to do with it, and my Mac isn't having to page as much. So further tests could be done to monitor I/O, with iostat, for example, while the grep is run.

testGrep.sh

And here is my test script. If you feel it can be improved, by all means write your own. I will only say that part of my desire to work on this, besides just for the fun of it, and intellectual curiosity, is that it has been more than 5 years since I wrote a shell script; so I'm a little rusty, just sayin'.

#!/bin/bash
#
#  runs comparative tests of built-in BSD grep vs. GNU grep
#
#  24-jan-2013 / mboltz
#

# declare and set the number of times we run the test for SCIENCE!
iterations=7 # how many times we run
runcount=0   # start the count of runs at 0

# set some other variables so we don't have to delve further to alter tests
grepFile=$2    # the file we are using (from second arg)
grepString=key # the string we are looking for in the test file(s)

# print usage if arguments are missing
if [ -z "$1" ]
then
   echo "Usage:  timeGrep.sh [ bsd | gnu ] <filename>"
   exit 1 
fi

# set which grep we're using
if [ "$1" = "gnu" ]
then
   grep=/usr/local/bin/grep
fi
if [ "$1" = "bsd" ]
then
   grep=/usr/bin/grep
fi

# run the tests
while [ "$runcount" -lt "$iterations" ]
do
   echo $grep
   time $grep $grepString $grepFile > /dev/null 
   runcount=$((runcount + 1))
done

So please feel free to run your own versions, test and verify. Suggestions for additional interesting tests, and constructive discussion about the results are welcome.