Monday, May 14, 2012

PAPI - Performance API

PAPI

The Performance API (PAPI) project specifies a standard application programming interface (API) for accessing hardware performance counters available on most modern microprocessors. These counters exist as a small set of registers that count Events, occurrences of specific signals related to the processor's function. Monitoring these events facilitates correlation between the structure of source/object code and the efficiency of the mapping of that code to the underlying architecture. This correlation has a variety of uses in performance analysis including hand tuning, compiler optimization, debugging, benchmarking, monitoring and performance modeling. In addition, it is hoped that this information will prove useful in the development of new compilation technology as well as in steering architectural development towards alleviating commonly occurring bottlenecks in high performance computing.

Thursday, February 23, 2012

netmap - user space NIC ring buffer

netmap looks promising but it's about to be blown away by the ability to inject packets into L3 cache in the next iteration of Intel chips which have DCA - direct cache access

Monday, January 09, 2012

LEON3


The LEON3 is a synthesisable VHDL model of a 32-bit processor compliant with the SPARC V8 architecture. The model is highly configurable, and particularly suitable for system-on-a-chip (SOC) designs. The full source code is available under the GNU GPL license, allowing free and unlimited use for research and education. LEON3 is also available under a low-cost commercial license, allowing it to be used in any commercial application to a fraction of the cost of comparable IP cores. The LEON3 processor has the following features:
  • SPARC V8 instruction set with V8e extensions
  • Advanced 7-stage pipeline
  • Hardware multiply, divide and MAC units
  • High-performance, fully pipelined IEEE-754 FPU
  • Separate instruction and data cache (Harvard architecture) with snooping
  • Configurable caches: 1 - 4 ways, 1 - 256 kbytes/way. Random, LRR or LRU replacement
  • Local instruction and data scratch pad RAM, 1 - 512 Kbytes
  • SPARC Reference MMU (SRMMU) with configurable TLB
  • AMBA-2.0 AHB bus interface
  • Advanced on-chip debug support with instruction and data trace buffer
  • Symmetric Multi-processor support (SMP)
  • Power-down mode and clock gating
  • Robust and fully synchronous single-edge clock design
  • Up to 125 MHz in FPGA and 400 MHz on 0.13 um ASIC technologies
  • Fault-tolerant and SEU-proof version available for space applications
  • Extensively configurable
  • Large range of software tools: compilers, kernels, simulators and debug monitors
  • High Performance: 1.4 DMIPS/MHz, 1.8 CoreMark/MHz (gcc -4.1.2)

The LEON3 processor is distributed as part of the GRLIB IP library, allowing simple integration into complex SOC designs. GRLIB also includes a configurable LEON3 multi-processor design, with up to 4 CPU's and a large range of on-chip peripheral blocks.

Tuesday, November 22, 2011

Waters European Trading Architecture Summit 2011

Some feedback from this event which I attended today.

Event http://events.waterstechnology.com/etas
Infrastructure Management: Reducing costs, Improving performance, Professor Roger Woods, Queens University, Belfast
Prof Woods gave an impassioned talk about a tool that he has developed which takes c++, allows you to navigate the code and identify subsystems which you can target to run on hardware or emulation of hardware.
  • He worked on the JP Morgan collaboration with Maxellor and was bullish about the technology.
  • Two years from pilot to production.
  • Developed at tool that allows identification of sections that are suitable for FPGA
  • Key issue: programming FPGA bitstreams (http://en.wikipedia.org/wiki/Bitstream) - took six months
  • C++ is translated into C (manually) before being cross compiled into Java which is what the Maxellor compiler requires.
  • This is to remove c++ abstraction which "kills parellisation" (see slides)
  • Focus was hardware FFT - all other logic in software - comms via FPGA bitstream
In summary:
  • ideal for risk calculation and monte carlo where algorithm does not change.
  • C++ legacy code does not parallelise easily and is not a candidate for FPGA
  • Three year dev cycle.
  • Complex, manual process
  • JPM own 20% of Maxellor
Resources
This continued to a panel hosted by Chris Skinner

Panel: The Parallel Paradigm Shift: are we about to enter a new chapter in the algorithmic arms race
Moderator: Chris Skinner, Panel: Prof Woods, Steven Weston, Global Head of Analytics, JPM. Andre Nedceloux, Sales guy, Excelian
  • FPGA Plant needs to be kept hot to achieve best latency. To keep FPGA busy you need a few regular host cores loading work onto them.
  • Programming/debugging directly in VHDL is ‘worse than a nightmare’, don’t try.
  • Isolate the worst performing pieces, (Amdahl’s law) de-abstract and place on FPGA, they call each of the isolated units a ‘kernel’ .
  • Compile times are high for Maxeler compiler to output VHDL, 4 hours for a model on a 4 core box.
  • Iterative model for optimisation and implementation. They improved both the mathematics in the models and the implementation onto FPGA – ie, consider it not just a programming problem, but also a maths modeling one.
  • They use python to manage the interaction with the models (e.g pulling reports)
  • Initially run a model on the FPGA hosts and then incrementally update it through the day - when market data or announcements occur.
  • No separate report running phase – it is included in the model run and report is kept in memory. Data only written out to a database at night time, if it is destroyed then it can be re-created.
  • Low-latency is no longer a competitive advantage but now a status quo service for investment banking.
  • Requires specialist non-general/outsourced programmers required who can understand hardware and algorithms who work alongside the business.
Panel

How low can you go? Ultra-low-latency trading

Moderator: David Berry

Members: Jogi Narain, CTO, FGS Capital LLP, Benjamin Stopford, Architect, RBS. Chris Donan, Head of Electronic Trading - Barcap.

This was a well run panel with some good insights from Chris Donan in particular:

  • Stock programmers don't understand the stack from network to nic to stack to application and the underlying hardware operations
  • Small teams of experienced engineers produce the best results
  • Don't develop VHDL skills in house - use external resources.
  • Latency gains correlate to profitability
  • FPGA is good for market data (ie fixed problem) and risk
  • Software parallelism is the future.

Sunday, June 12, 2011

HFT World 2011 and intro

Finally my first post to the Enhyper blog, on a rainy Sunday afternoon. I've just returned from speaking at HFT World 2011 in Amsterdam. Turnout wasn't as large as I expected, however, there was some good discussion around what the future is likely to hold.

I shared some of my views on embracing the end-to-end principle with Mike O'Hara from the High Frequency Trading Review, and I'll return to this post to set some of those views down in blog space a bit later. My background's in IP/telecoms systems design, so my views of HFT and its technology tend to be coloured by that.

Friday, June 10, 2011

Thomson Reuters Expert Session

I was kindly invited to give an expert session by Thomson Reuters when I was between assignments. I gave an hour presentation which has been edited into four sessions:


On the FX Business Model http://thomsonreuters.na4.acrobat.com/p72695781/

On FX and the OTC Market http://thomsonreuters.na4.acrobat.com/p97524929/

On High Frequency Trading http://thomsonreuters.na4.acrobat.com/p67792422/

On FX Strategies http://thomsonreuters.na4.acrobat.com/p10287782/


I'm now gainfully employed so back to silent running for me.