Saturday, June 16, 2007

Mining Massive Data Sets for Security

Semiophore points me to the forthcoming two week workshop on the above to be held in mid-September 2007 in Italy.

"It is the purpose of this workshop to review the various technologies available (data mining algorithms, social networks, crawling and indexing, text-mining, search engines, data streams) in the context of very large data sets."

I'd love to attend as this is an area I think is crucial for High Frequency Finance. Whilst working on a high performance trade order router for a tier 1, I did some research which I was allowed to present publicly at the Fiteclub, a forum which meets occasionally in London. I presented two papers of note - Financial Data Infrastructure with HDF5 which concentrated on high performance data delivery and analysis. In this presentation I proposed a machine which could be built for around $25K that could eat 20TB of data in 90 minutes - using COTS components. This was inspired by the seminal article on disk technology amusingly entitled "You don't know jack about disks" published by the ACM.

The second presentation, also at Fiteclub, was entitled Open Source Intelligence in Finance and was inspired by the techniques used in open source intelligence applied to finance. Here I build the case for news analysis applied to program trading.

The Rise and Rise of Software Patents

James Robertson points to a New York Times op-ed piece by Timothy Lee.

Microsoft has changed their position on software patents since 1991 – what was different then?

In 1991 most personal computer users were still using DOS, not Windows.

Windows was welcome because it provided a standard way for PC applications to take advantage of RAM over 640K. Applications like Lotus and WordPerfect required incompatible and unstable high memory drivers (want to use a different app? Edit AUTOEXEC.BAT and reboot. Repeat). While Mac users were able to use Excel and Word without these limitations, the vast majority of PC users were unaware of the advantages of standardised graphical user interfaces, postscript printing or built-in networking. Apple, however, saw Windows as a threat, and had already been litigating against Microsoft for three years by this time.

When Bill Gates was stating his concerns about patents held by "some large company", he may have talking about Apple. But the lawsuit was related to copyrights, not patents.

Apple had the tables (briefly) turned on them, when Xerox filed suit, also alleging copyright infringement against work they had done on the Star.

So where were the patents?

Some people feel that Apple, Microsoft, and others plagiarized the GUI and other innovations from the Xerox Star, and believe that Xerox didn't properly protect its intellectual property. The truth is a bit more complicated. Many patent disclosures were in fact submitted for the innovations in the Star; however, at the time the 1975 Xerox Consent Decree, an FTC
antitrust action, placed restrictions on what the company was able to patent.[4] In addition, when the Star disclosures were being prepared, the Xerox patent attorneys were busy with several other new technologies such as laser printing. Finally, patents on software, particularly those relating to user interfaces, were an untested legal area at that time.

The law is clearer now, notwithstanding differences between Europe and the US.

Apple wasn't able to patent these things because they didn't invent them.

Xerox wasn't able to patent them because of an antitrust ruling (and there may be some truth in the fact that the business was oblivious to the incredible work being done at PARC).

While I don't like to play the alternative history game (the real thing is confusing enough, thanks), the rise of Microsoft and Apple might have turned out differently had it not been for this ruling.

Investment banks have an interest in software patents for two reasons. First, they have a responsibility to their clients to invest in companies with legally identifiable market advantages. Patents are a big part of that, especially for technology companies. Right or wrong, if you have a technology company and you want someone to invest in your firm, patents can increase your market value.

Second, financial patents are becoming more widespread, and software is a big part of that. Investment banks are actively encouraging their (very large) development teams to work with their legal departments to protect their inventions. Again, right or wrong, this is partly defensive (e.g. if firms have a portfolio of patents they can enter into cross-licensing agreements with potential competitors, or they can use patents to obtain a monopoly on innovative business ideas), and partly opportunistic. It is possible that even banks might find the revenue streams from IP licensing attractive.

While not everyone is happy about software patents (they increase the cost of doing business, and can act as barriers to new entrants into a market), they can also be useful. They can help to create and sustain advantages for people who are careful and clever enough to spot the opportunities.

Either way, ignoring them is not advisable.

Update: Donald Knuth on Software Patents

Friday, June 15, 2007

Functional Programming meets #crypto

We met up last night as mentioned in for a Functional Programming Beer in the Evening event and had a thoroughly enjoyable evening. Dr Dominic Steinitz, Nigel Stuckey of Systemgarden, Steve Wart and I spent several hours
talking about a variety of topics ranging from FP through to Cryptography.

Nigel, Steve and I met at a tier 1 we all worked for through an internal chat channel I ran on cryptography called #crypto which was a hot bed of financial crypto interest. I had been following the great Robert Hettinga and his rants on bearer security and his promotion of FC in general. Through this I met and became friends with Ian Grigg who runs the Financial Cryptography website which is always worth a read as Ian challenges conventional wisdom and crypto practise from a more business related angle.

Dominic is an active member of the Haskell community and hopefully will be joining the enhyper blog and contributing his expertise. Dominic wrote the cryptographic functions (sha1, rsa, blowfish etc) in Haskell and maintains the library on one of the Haskell webservers. I'm going to create a subversion respository with public access for Dominic's work to allow the community to contribute to the Haskell crypto effort so watch this space.

Nigel will also be contributing to the Enhyper blog - his expertise in platform management is world-class, gained through the development of Harvest at System Garden.

We talked in detail about the evolution of specialist languages in Finance, in particular K and Smalltalk. Steve Wart is a smalltalk programmer and outlined a fantastic sounding system he worked on for a Candadian power company so I hope he's going to expand on that here. Dominic and I discussed stream fusion at length and I think I understand it - however, I'm not going to steal his thunder as he promises to write an article on data parallelism. Nigel and Steve had a lively debate on Second Life and discussed the additon of models of ancient buildings as a mechanism for historical education. We finished the evening with a sausage and sauerkraut at Kurz and Lang in Smithfield which was reminiscent of Zurich night life.


HPC=Grid+Data+FPGA

I was kindly invited by Platform Computing to give a presentation at the their European Grid Conference PGC06 last October. I've just made this deck and others available on the Enhyper subversion share under decks. You can download the powerpoint presentation here HPC=Grid+Data+FPGA. The abstract for the pitch is below:

High Frequency Finance demands an infrastructure that can supply large quantities of computational resource plus the delivery of multidimensional, arbitrary precision data to applications at scalable rates, in a platform independent manner.

Statistical analysis techniques for pricing and risk measurement also require significant algorithmic performance. This is being delivered by multi-core processors, however, the quest for low-latency is driving the emergence of algorithms in silicon using Field Programmable Gate Array techniques.

The traditional approach to application and data engineering has been client/server glued together with delimited files and relational databases. These technologies are increasingly being surpassed by grid enabled; in-memory applications combined with FPGA based algorithms.

This was immediately after some friends at Celoxica had run a trial of BGM in one of the tier 1's - a trial that has since been emulated in academia by Wayne Luk et al as outlined in this paper entitled Reconfigurable Acceleration for Monte Carlo based Financial Simulation. The problem with academics is that, to quote Dr Duncan Shaw, "they have 98% of the time but only 2% of the data, whereas it's the reverse for the practitioner". There are better ways of skinning this particular cat which could have significantly improved the performance...




Wednesday, June 13, 2007

Time to Embrace New Data Parallel Programming Models?

In "The rise and fall of High Performance Fortran: An Historical Object Lesson" there's a several lessons to be learned but of significant relevance to the development and adoption of Functional Programming Languages in finance was the promise offered by data parallel architectures.

Data parallelism is the ability to run sequences of instructions which are broadcast to processing units in lock-step, on different elements of data in parallel. This was first implemented in hardware in the 1960's and was classifed as SIMD - Single Instruction Multiple Data

This is the mechanism is now in implemented in software and part of the Glasgow Haskell Compiler and is explored in detail in the presentation "Nested Data Parallelism in Haskell" where they state that it's the only way to take advantage of 100's of cores and that it can even be run on GPUs. Like Handel-C, Haskell has par and seq statements allowing the programmer to parallelise and sequentialise instructions. They also introduce a new concept called stream fusion which is non-recursive allowing data to be "bused"from one stream to another inline - now I'm not sure I fully understand the implications of this but I'm sure going to find out.


Functional Programming Beer In The Evening Event Redux

Following our last meet up last week to speak about Functional Programming in the Finance. I'm setting up another beer around Smithfield in London, 14th June 2007, with Dominic Steinitz who has made considerable contribution Functional Programming via his Haskell crypto library and paper on Trends in Functional Programming. We'll be talking about Haskell in particular but also about Erlang and its use in service based analytics.

Following our last meet, Someone sent me a link to a pdf entitled Caml Trader: Adventures of a Functional Programmer on Wall Street by Yaron Minsky of Jane Street Capital which confirms the rise of FP in finance. Worth a read.

So if you fancy joining us, we'll be in the Long Lane Pub in Long Lane near Smithfield from 6pm onwards. Drop me a note (rgb at enhyper.com) or call my mobile +44 791 505 5 three eight zero.

Monday, June 11, 2007

Skillsets for the HFF Future

So which languages are going to succeed in the world of High Frequency Finance? Well you can bet your bottom dollar it's not going to be C# or C++. Both generate buggy and leaky solutions and are too reliant on third party libraries of unknown veracity. You can also rule out the raft of scripting languages - no matter how much the developers like them: Perl, Python, Ruby et al have good geek factor, however they don't cut it in production systems.

So we're left with C and the functional languages. C is considered as a weird throwback to the 70's by most nascent programmers. What most of them do not realise is that the C# CLR, Java Virtual Machine (parts of), Ruby, Perl, Python, Apache, Linux, Solaris, C++ etc etc are all written in C. The main reason for this is performance, simplicity, robustness. C has it's share of problems, but in general it's a fairly good language.

So, why are all the quantitative libraries in Investment Banks written in C++? And why are they mostly single threaded? The answer is hubris on the part of the programmers and, as previously outlined, the difficulty in coding thread safe libraries.

Well, times have changed I'm afraid. C++ has to die because it does not translate into hardware - a route where tremendous performance gains are to be had. C is looking weak in light of the scalability models which can take advantage of multi-core within the functional languages. The future in the short term is C but watch for the rise of Erlang and Haskell. Their time has come.








Sunday, June 10, 2007

News Analysis for Program Trading

As previously posted, I'm writing a paper on news based program trading for the KM stream at the Operational Research Society's Annual Conference in September. This paper is the culmination of many years research and interest in the area of new analysis and I hope to show that the application of statistical techniques combined with visualisation can lead to an effective intelligence system which solves some of the conundrums facing traders, and for that matter, intelligence analysts.

The goal is to greatly shorten the time to disseminate events to the people who need to consume them, allowing them to act on this information. However, there's also an intention to analyse the likely outcome of this interaction and put in place a strategy to take advantage of this event. Another hypothetical outcome is that event "signatures" will be recognised and effects correlated in different sectors.

News Analysis

The first goal is to simplify the elements of news which we will analyse. To do this, I propose to model the way that people tend to read newspapers and select stories which interest them. Some read from front to back, others select favourite sections first, others, and I include myself here, read from back to front.

When we read, the first element to be considered is either the title or a picture. The writer of the article has to aphoristically state the contents of the news in an attempt to get the readers interest.
The title also contains other information like people, places, sectors, amounts, therefore this is the key piece which is used for presentation to the end user.

The rest of the story consists of a series of sentences arranged into paragraphs. Within the story will also be the information we are interested in. The relativity of people can be used to build a Social Network Analysis graph based on proximity. If two people are mentioned in the same sector (e.g. FX trading) they are related. If they are mentioned in the same publication they are related more closely. The same story, closer still. Same paragraph, even closer. Same sentence, the closest. From this we can draw a graph showing the individuals "social network". There's a very good example of this at www.namebase.org where you can perform useful searches on people involved in the intelligence world from their appearance in related publications.

Topographical Mapping

News also contains physical places. Mapping individuals, companies, sectors, amounts to physical location can reveal useful information and is a technique much used in policing.

Categorisation

Categorisation is something humans do every day and is fundamental to our heuristic judgement. Humans are very good at it, however, what they're not so good at is dealing with something which falls into multiple categories.


To be continued shortly...

The Case for Asynchronous Logging

It is common practise for federated systems to maintain seperate logfiles to assist in fingerpointing should an error in production occur. However, this duplication of effort is unsustainable in the world of High Frequency Finance (HFF) where messaging volumes are approaching the 400K messages per second leading to a rethink and perhaps a spirit of cooperation between data sinks and sources.

I propose that it's time to abrogate responsibility to one party not both, to log asynchronously. Deciding who carries the responsibility however, is not obvious. To understand the problem, lets look at the issues involved (or if you're of the half-empty glass persuasion - who gets the blame). Here's an example of a possible route between two applications:

  • Memory/Disk/SAN
  • Sender Application
  • Application proxy
  • TCP/IP Stack
  • Software Firewall
  • Hardware NIC
  • Network infrastructure (various routers/switches/firewalls, lan/wan etc)
  • Receivers NIC
  • Software Firewall
  • TCP/IP Stack
  • Application proxy
  • Receiver Application
  • Memory/Disk/SAN
As you can see, there's quite a log to go wrong. Lets now analyse where to perform the logging.

Sender Logging


If we rely on the sender, there's the immediate advantage that the sender will have to account for the log file space, access control and maintenance. However, from a consumer's point of view - that means there is a lack of control and potentially the case where you require a log and it's been deleted or is offline. From an audit point of view, you have increased the external dependency and hence the risk.

From the senders perspective, consumer lifecycle mangement also becomes slighty more difficult as you now have to poll your customers to see if they are still consuming your data, as it's not unknown for applications to be turned off without turning the feeds off due to lack of knowledge on who to contact.

Receiver Logging

With receiver logging, we have, effectively, a forensic record of the transfer across the stack and have control over the logfile lifecycle. It seems strange to state the obvious, but for higher performance, you should log to local disk, not NFS or SAN storage then back up the log files to resiliient storage.

Service Level Monitoring

A nice addition is to write service level monitoring as part of our productionised system. In this way you can monitor the normal performance of the system and build a predictive capability on applicaiton performance.

Conclusion

Asynchronous logging has the potential to save considerable disk space and processor time whilst reducing maintenance overhead. The receiver/data sink is the right place to log as it tests the circuit between server and receiver and puts the management of log files in the domain of the application which is where it belongs from a resource, audit and service level management perspective.