In the search for low latency, the minimisation of hops or stack traversals is key. So here's a fairly obvious idea perhaps, but one I've yet to encounter in the real world.
If I were designing the pragmatic ideal trading engine, i.e. one which would be able to sit in a today's data center, I'd be looking an add in card-based solution with optical air inter-card interconnects - similar to the sli/crossfire approach taken to link multiple graphics card.
External interfaces would be fibre and multiple, 10GE capable at present rates so that I could accept multiple MPI IO streams for configurable bandwidth delivery.
Persistence would be staged from high speed ram through to SSD (or perhaps memristor when it hits main stream.)
Processing would be SoC with DSP, GPGPU, MMC CISC integrated with on chip with multi-gigabit optical quantum dot interconnects.
Ideally the server chassis holding the cards would be in an exchange-mart (sic) where multiple exchanges would co-locate - probably somewhere cold, geodesic with sustainable power (Iceland?)