Friday, June 01, 2007

Fun with sort(1)

There are some unix utilities which give it a bad name - prime culprits are sed(1) - just read the man(1) page and you'll understand why. I think sort(1) is pretty abstruse too - I've been using it to manipulate log files which monitor market data info being pushed in and out of wombat.

The trouble with log files is that they usually are full of everything - which is fine if you have the time or patience to extract the information you require, however, now that we're shoving hundreds of trades through the algo system, this generates hundreds of thousands of log messages, I can no longer use vi(1) - the unix editor, to view them, as it runs out of space for the temp file. This mans resorting to all sorts of sed/awk/grep nonsense in order to extract the info we need. The criteria for this embryonic scriptette was to order entries according to a suffix alphabetically, then order numerically ascending within that suffix. Here's a script which does the job. The input data looks like this:

10:23:34.323 : 5 3, 2760.MAIN-EXC.Dx.LT {[1]=24 [2]=2 [3]=7360 [4]=36.730340 [5]=28.032340 [6]=2007-05-29 09:11:00Z}
10:23:34.541 : 6 3, 2760.MAIN-EC.DM.SS {[1]=24 [2]=1 [3]=7260 [4]=34.730000 [5]=28.000000 [6]=2007-05-29 10:23:34.095576000Z}

All 598344 lines of it. The first line sorts on the field "LT" and "SS" above and gives us a list of subsets that we need to process:

FIELDS="`sort -u -t '.' -k 5,5.2 MarketDataServer0.log | sed 's/.*\.\(.*\) {.*/\1/'`"

Now we create a file callled out

> out

for CODE in $FIELDS
do
sed -n '/.*\.'"$CODE"' {.*/p' MarketDataServer0.log | sort -n -t '>' -k 2,2 >> out
done

Then we cut out the entries for each "code" then pass them to our sort command which uses the > as a field delimiter and sorts numerically on the second field - ugly but necessary. No error handling or parameter passing yet - but this saves a whole lot of pain. Looks painful? sure but it's the sort of thing you just can't do on windows (well without Cygwin anyway)

1 comment:

Anonymous said...

Just a minor word - you mentioned cygwin for windows. I'm not a fan of cygwin due to it's size and the fact that it's difficult to use from batch files (or at least used to be last time I tried a couple of years ago).

Instead I use http://sourceforge.net/projects/unxutils which are native windows builds of most popular unix tools (sed, awk, etc). They don't need DLLs, they work fine in batch files, etc.

sford