Friday, June 01, 2007

Fun with sort(1)

There are some unix utilities which give it a bad name - prime culprits are sed(1) - just read the man(1) page and you'll understand why. I think sort(1) is pretty abstruse too - I've been using it to manipulate log files which monitor market data info being pushed in and out of wombat.

The trouble with log files is that they usually are full of everything - which is fine if you have the time or patience to extract the information you require, however, now that we're shoving hundreds of trades through the algo system, this generates hundreds of thousands of log messages, I can no longer use vi(1) - the unix editor, to view them, as it runs out of space for the temp file. This mans resorting to all sorts of sed/awk/grep nonsense in order to extract the info we need. The criteria for this embryonic scriptette was to order entries according to a suffix alphabetically, then order numerically ascending within that suffix. Here's a script which does the job. The input data looks like this:

10:23:34.323 : 5 3, 2760.MAIN-EXC.Dx.LT {[1]=24 [2]=2 [3]=7360 [4]=36.730340 [5]=28.032340 [6]=2007-05-29 09:11:00Z}
10:23:34.541 : 6 3, 2760.MAIN-EC.DM.SS {[1]=24 [2]=1 [3]=7260 [4]=34.730000 [5]=28.000000 [6]=2007-05-29 10:23:34.095576000Z}

All 598344 lines of it. The first line sorts on the field "LT" and "SS" above and gives us a list of subsets that we need to process:

FIELDS="`sort -u -t '.' -k 5,5.2 MarketDataServer0.log | sed 's/.*\.\(.*\) {.*/\1/'`"

Now we create a file callled out

> out

for CODE in $FIELDS
do
sed -n '/.*\.'"$CODE"' {.*/p' MarketDataServer0.log | sort -n -t '>' -k 2,2 >> out
done

Then we cut out the entries for each "code" then pass them to our sort command which uses the > as a field delimiter and sorts numerically on the second field - ugly but necessary. No error handling or parameter passing yet - but this saves a whole lot of pain. Looks painful? sure but it's the sort of thing you just can't do on windows (well without Cygwin anyway)

Thursday, May 31, 2007

Red Pepper and Coconut Chutney

Outside of the American psyche we have other programmer food than pizza which fuels software development - a lot of people who work in the Investment Banking community are addicted to curry and there's a good choice of hard core restaurants nearby like one of my favourites, the Lahore Kebab House. So as well as software wisdom, we'll be posting the occasional recipe - here's a start. This chutney only keeps for about a week in a cold fridge but is wonderful - if you can get it - use Kashmiri chilli powder for a more mellow flavour...

Ingredients

Two large red peppers (Bell or Capsicum)
Two teaspoons of roasted cumin seeds, finely ground
Two cloves of garlic
Two tablespoons of dessicated coconut
Half a teaspoon of salt
Half a teaspoon of hot chilli powder
Two tablespoons of water

Method

Dry roast the cumin seeds in a frying pan, colouring them to your taste, the darker the
roast, the stronger the taste. Grind in a spice mill or coffee grinder to a fine powder

Deseed the peppers and cut into small chunks suitable to put into a hand blender jug, then add
all the ingredients. Blend with a handblender and decant to a sterilised jar which should be refrigerated.

This is an excellent accompaniment to cold meats, dosas and any curry, particularly fish/shellfish.
OR49 Keynote Speech in the Knowledge Management Stream

I'm giving the keynote in the KM stream at this year's Operational Research Society Conference OR49 based on a stream of research which started about 8 years ago after reading a paper on newsgroup cluster analysis called telltale. Here's the abstract:

    "It is proposed to summarise and statistically categorise multiple public and private information feeds to produce centroids directed by a combination of user constructed keywords and analysis of previously archived or disseminated knowledge. Social and physical networks will be extracted for temporal analysis and association projection. Comprehensive analysis of centroid relationships across sectors, categories and physical location will give a statistical event prediction capability and lead to the discovery of hidden relationships and associated events. End-users will construct a hierarchical keyword tree which will contain individual articles, summarisations, centroids or sets of related centroids. Users will also participate in a community of interest which they may form inter or intra-federation in order to disseminate emerging events or explicit knowledge. The system has applicability to financial market analysis, law enforcement and intelligence analysis."

This paper is the crystalisation of several themes and our experience into a system which we hope to build into an operational system. Many of the components already exist and over a series of articles I'll be discussing the philosophy behind the system. I'm joined on the enhyper blog by two experts in data visualisation whom you'll meet in due course. One is Dr Elie Naulleau from Semiophore. We'll propose how we can use the system for expert trading, algo trading and on the flip side, intelligence analysis.

Tuesday, May 29, 2007

Why Events are a Bad Idea (for high concurrency servers)

And Why Mixing Events and Threads is Even Worse

I've just been badly burned by mixing two paradigms - threading and events. All was working fine until the day before go live; the testers started to pour 400 stock baskets through the system rather than 40 which resulted in one of those issues which make your heart sink as a programmer. Between 5 and 15 stocks would go into pending which meant that there was a threading issues somewhere and it was go live that evening. Tracking it down was to prove difficult due to poor separation of duties between the threads resulting from the design having its origins as a single threaded, serial set of calls which took data from an event generated on one side and generated a modified message dispatched to the receiver and vice versa. In retrospect, the problem would have been solved by having single queue between the two threads, following the asynchronous put/take connector pattern. This would have ensured complete separation and higher throughput.

Mutex Spaghetti


As it happens, it was impossible in the time available, to redesign the solution and the simplest course of action was to go back, reluctantly, to a single threaded implementation. Time to test was a major factor here, however, not before some time was spent playing the mutex game. I started mutexing at a very low granularity which appeared to fix the issue (or break things completely) almost - I was down to one or two trades pending - which was not good enough.

Adding additional mutexes, it quickly became apparent that it's very easy to get in a mess either by blocking on an already locked mutex or by adding too many mutexes which means you end up in a mess anyway. After four hours of mutex soup - I made the decision to remediate the code back to single-threaded code. Performance, at the moment is not an issue.


So the moral of the story is look for a design pattern which fits your code - understand it and think the design through, we seldom have time, but if you can, use UML to build a sequence diagram - then you'll see the call chain and understand conflict between threads.

A friend of mine related the story of Sun's attempts to make the solaris kernel mt safe - this was a much harder task than they anticipated. Most of their effort was centred around protecting the key data structures rather and changing the programming paradigm so that users took the responsibility for data allocation.

Another pointed out an article on slashdot this morning "
Is Parallel Programming Just Too Hard?" which raises some concerns which, as we can see from the above, seem to be valid. If you're going to write parallel threads, you have to spend time on the design - it takes three times the effort and you really need to use patterns, example code and sequence diagrams. I hacked it and got away with it - to a point. If performance proves an issue, and you can bet it will at some stage soon, then it will be back to the drawing board - and this time, the design will come first.

Interestingly, threading versus events is an interesting debate which this paper: "Why Events Are A Bad Idea (for high concurrency servers)" argues well and threading comprehensively wins the day as a paradigm over events. I've always been of the opinion that separation of duties via a thread is intuitively faster than event dispatch - this paper goes a way to prove my intuition right.

Finally, there's a well deserved mention for Functional Programming Languages such as Erlang and Haskell
, which has much promise for multi-core programming as outlined in these excellent slides: Data Parallel for Haskell.