Catching up with some reading this morning, I picked up a series of articles from the Mar/April edition of ACM Queue. In particular, CUDA which was released by Nvidia last year. I read the article "Scalable Parallel Programming with CUDA" which can be found here.
The article identifies three key abstractions, hierarchical thread groups, shared memories and barrier synchronisation. CUDA is based on C with extensions for parallelisation - much like Handel-C. The difference is that Handel-C was FPGA based whilst CUDA is for GPU with its built-in floating point capability. There are simple and straightforward code examples showing parallel threading and memory sharing which was always an issue in my mind with FPGA: the leap of faith with Handel-C was what to do with the data set you generated in a Monte Carlo simulation.
This question has been perplexing developers on the CUDA forums at Nvidia too - but it looks like there's been progress as outlined in this presentation on Monte Carlo Options Pricing paper on the Nvidia developer site. However, the algorithm outlined in the paper is trivial, the secret being the generation of quasi-random numbers enabling quick convergence. Then filtration close to the data so you're not schlepping large lumps of data unnecessarily.
Then the next logical step is to make this a service. The appetite is reckoned to be about 5 trillion simulations per day in the average organisation according to a quant chum of mine. Combine this with S3 for asynchronous storage and you have the makings of a nice little business I think.