Wednesday, August 29, 2007

Micro and Macro Performance Analytics - a Log with Two Tales

Micro and Macro Performance Analytics - a Log with Two Tales

Sometime when we measure things we end up introducing latency into the system. I recently came across a neat scheme where a timing message was “stamped” on every hop then rerouted to the source. The average round trip time was calculated by dividing the elapsed time in two. The overall times produced were adequate for what the guy required but there was an annoying latency spike which happened periodically and was a cause for concern.
The average trip time from source to receiver was 900 microseconds. Occasionally, a 9 millisecond time was reported which, if echoed in production, would have been unacceptable. To find the problem, all the messages above 1ms were tagged and grep’d out into a file. Using vi to analyse this log file, it was noted that the latency spikes had a definite periodicity of around 600ms.

Visual inspection is a great technique as the brain is good at recognising patterns. However, as we can now generated millions of log entries per second, this tactic has severe limitations in that it highlights micro trends, therefore It’s important to visualise the data too so that macro trends can be recognised. Luckily I did as there was one lurking in the data.
I took a minute’s worth of >1ms log file entries and manipulated them in good old excel to produce the graph below:

As you see from the graph, there's a an obvious 10 second event which is highly regular in it's form. On further investigation, it turned out to be a call to a summarisation routine which caused the spikes to cease momentarily. Now it's lucky this was found using a synthetic load - in production, it could be a lot more difficult to spot. On the plus side, it's nice to come across a well designed load simulation - this is seldom the case. People are prone to quote averages and pay little attention to the spikes - but in high frequency trading, it's the spike that will lose you big money.

So, be careful what you measure and be careful that by measuring, you don't introduce side effects which could so easily be hidden in production.

No comments: