Nvidia - Cuda Toolkit for options pricing


Older and Wiser
500,000 options in 1.5seconds, it's not a real case because you will need cudamemcpy for all your data

I lied, I'm sorry. It really takes 1 minute to send the data across the wire to the server via http services, calculate the implied vol and get the result back. This is no fake, this is real in our case.
I lied, I'm sorry. It really takes 1 minute to send the data across the wire to the server via http services, calculate the implied vol and get the result back. This is no fake, this is real in our case.
Ok with a QuadCore computer how many times it takes ? with 8-Core ?
Because in this example, the GPU power is canceled by transfering data of the 500 000 options...
We use extensively CUDA in our main program Zonar for options calculations and in the portfolio system.
Based on the experiences we got and speed improvements, we are now implementing CUDA in an algorithmic trading solution.
Also nice with a version 2, that now works with VS-2008

Used in ZOnar:
SoftCapital - Zonar, learn more


Bastian Gross

German Mathquant
Does anyone have experience with Star-P?
S T A R - P ™
Star-P and some thoughts on MonteCarlo + Cuda

Is the questions about Star-P relating to Star Sytems and their multi-FPGA offerings? If so I have looked at those but was limited by bus bandwidth and general complexity. At the time, now about 3 years ago, I decided to push for FPGA cards on 8-lane PCI etc. instead.

Another message below related the difficulties with Sobol for MC and how the overhead of copying the numbers across to the GPU was prohibative. I found that just generating the Sobols on the card is fine. I first did this for FPGA's but the principle is the same: the modified Sobol (Antonov-Saleev) will do the job very quickly -- I did 1 per clock on a Xilinx 4 FPGA for 32-bit. I used different generators per sequence and used da Silva's heuristics for the generating polynomials, backed with selection via simulation runs (of up to 3 weeks!) to reject those with high auto/cross correlations. I ended up with a battery of about 10,000 Sobol variants that I could use to provide up to 10,000 independent, minimally correlated low discrepancy sequences. All could be done in parallel so you can get enormous speed-ups that way. If you want random entry into the squences then use a random parameter to start your initial values from, otherwise you can be reatable which is good for testing. That random parameter can come from Mersenne Twister if required. Obviously MT is no use over 624 dimesnions, hence the effort in getting 10,000 useable Sobol variants with low auto-/cross-correlations for high dimensional work. For low-D problems MT is OK.

I am putting these same generators onto a tesla, now... I too think openCL looks the way to go, but early days...

- kieron


Director, Wasserman Trading Floor/Subotnick Center
This just arrived in my in box. No endorsements implied.

A Seminar on GPU-Accelerated Derivative Pricing and Risk Models
Presented by SciComp Inc. and NVIDIA Corporation
GPU-accelerated Monte Carlo derivative pricing and risk models run 30X-200X faster than serial code. No CUDA/parallel programming expertise required.

Attend this free seminar to learn about:
• Automatically generating parallelized (GPU-accelerated) C/C++ source code for any Monte Carlo
derivative pricing and risk model (including path dependency and early exercise)
• How you can easily generate CUDA-enabled code for your own proprietary pricing models
• How to reduce the cost of your compute farm by more than 10X
Sign up to see how easy it is to get started on accelerating your derivative pricing models.

New York City
When:Thursday, June 11, 2009
Reception to follow
Where:Doubletree Guest Suites Times Square
1568 Broadway
NY, NY 10036-8201
Times Square Room
RSVP:Please http://www.scicomp.com/seminars/NYC?LS=EMS276606. Deadline is Friday, May 29, 2009.

When:Monday, June 15, 2009
Reception to follow
Where:London Marriott West India Quay
22 Hertsmere Road, Canary Wharf
London, E14 4ED
Barbados Room
RSVP:. Deadline is Friday, June 5, 2009.
To attend, simply RSVP by clicking on the RSVP links by the deadlines indicated above. The seminar is free, but space is limited – sign up now!

I think this is the link to RSVP for New YorK:


Prof. H.
A little bump action here.

I got to tinker around with an nVidia Tesla machine earlier today (Tesla's are cards that are basically CUDA optimized GPU's, made specifically for GPU programming purposes). I made a simple program to calculate every prime number from 1 to 10 billion, add the numbers up and then find every prime number from 1 to that number. No purpose really, it sounds lame but its what I came up with in about 30 minutes.

The first run yielded me the result in just under 2 seconds (1.59 seconds to be exact). This was after skimming through a CUDA optimization tutorial. Then, someone who uses the Tesla machine daily looked at my code, tweaked it and then ran it, the results were instantaneous (we could only get to two decimal points in measuring how long each process took). The same program on my quad core desktop computer took roughly a minute and a half. This thing was absolutely phenomenal and I'd love to work with it some day on Wall St.

EDIT: I used it at the University of Maryland - College Park, where I'm currently and undergrad, the PC belonged to a grad student who is using it to work with and study compression.

Daniel Duffy

C++ author, trainer
You create with a Mersenne Twister random number on the GPU card

Is this code re-entrant? One issue is *serial equivalence*, does the GPU code give *exactly* the same prices as the 1-core solution? If not, why not?

here's a simple exercise

1. Define a float f = 1.0
2. Increment it by 1 for 10^8 times
3. Print the result

4. Do same exercise using double
Bloomberg Uses GPUs to Speed Up Bond Pricing

Two-factor model for calculating hard-to-price asset-backed securities now runs on graphics processing units paired with Linux servers.

Each night, Bloomberg calculates pricing for 1.3 million hard-to-price asset-backed securities such as collateralized mortgage obligations (including cash flows, key rate duration and such). Since 1996, the market news giant has performed these calculations single-factor stochastic models based on Monte Carlo simulations on a farm of Linux servers in its data centers in New York and New Jersey. "These models are ideal for doing things in parallel, and we did parallelize them over traditional x86 Linux computers," says CTO Shawn Edwards.

In 2005, Bloomberg released a more precise two-factor model that calibrates itself to the current volatility surface, but only for ad-hoc, on-demand pricing, not overnight batch mode. (The previous model ran in both ad hoc and batch mode.) "This model was more expensive to run and we ran it when people asked for it," Edwards says. "These securities are being held in large portfolios, and there was client demand for us to use this better model for our overnight pricing."

In early 2008, Bloomberg considered scaling up its Linux farm to accommodate this customer demand. "It turned out that in order to compute everything within that eight-hour window, we would need to go from 800 cores to 8,000 cores," Edwards. "That's a lot of servers, about 1,000. We could do it, but it doesn't scale very well. If we wanted to use it for other ideas, we were faced with having to pile on more and more computers. That's when the idea came in for GPU computing."

A programmer on Edwards' staff suggested trying to run the models on graphics processing units (GPUs). (GPUs or graphics cards are specialized chips that run inside PCs to display 2D and 3D graphics. They tend to contain hundreds of floating point processors that are good at handling mathematically intensive and parallel processes such as Monte Carlo simulations.) The programmer ran a proof of concept in March 2008 using the cash flow generation part of the algorithm and showed a dramatic increase in performance. That programmer now runs the team of technologists that work on the bond pricing system.

Bloomberg went live in 2009 running its two-factor models on a farm of traditional servers paired with nVidia Tesla GPUs. Instead of having to scale up to 1,000 servers, Bloomberg is using 48 server/GPU pairs.

Bloomberg and nVidia engineers worked together to get the pricing software to run on the GPUs. "The underlying math and algorithms are proprietary to Bloomberg," says Andy Keane, general manager, Tesla supercomputing at nVidia. "We provide training, expertise to make the Bloomberg software GPU-compatible. There's a bit of a wall between the two to protect Bloomberg's intellectual property." Rewriting, restructuring and testing the code to run over the GPUs took about a year. "This service is mission-critical to our customers, they rely on it to make decisions, so we had an extensive testing period," Edwards says.

Part of the pricing application, data gathering, doesn't lend itself well to GPU computing, Edwards notes, because it can't be parallelized. The x86 servers also prepare the problems to be parallelized. But about 90% of the work does run on the GPU platform, he says.

"Overall, we've achieved an 800% performance increase," Edwards says. "What used to take sixteen hours we're computing in two hours." The GPUs are high speed, running double-precision mathematics at 16 teraflops. (A teraflop is equivalent to a trillion floating point operations per second.) And the firm is a little greener now the server/GPU pairs consume one-third of the energy 1,000 servers would have required and less data center space is occupied. Cost-wise, the GPU project was equivalent to scaling up the Linux farm, Edwards says.

In the future, Bloomberg plans to run other types of calculations, such as pricing of other types of derivatives and portfolio valuations, on GPUs.

"One of the challenges Bloomberg always faces is that we have very large scale," Edwards says. "We're serving all the financial and business community and there are a lot of different instruments and models people want calculated. This is a nice tool in our toolkit that we're looking to apply in different places."

Bloomberg Uses GPUs to Speed Up Bond Pricing by Wall Street & Technology
>9/28/09 Update: NVIDIA has released the industrys first publicly available OpenCL GPU >drivers for Windows and Linux, as well as an OpenCL Visual Profiler and SDK code >samples.

> OpenCL Download Survey

Now is the time to start learning OpenCL. There will probably be some run-time platform-dependent differences between NIVIDIA and ATI, something along the lines what Lugh talks about.
I will re-write my LU Matrix decomposition with row pivoting and Successive Overrelaxation algorithms for GTX 295... curious about performance results


Vice President
nVidia GT300’s Fermi architecture unveiled

Ferni architecture natively supports C [CUDA], C++, DirectCompute, DirectX 11, Fortran, OpenCL, OpenGL 3.1 and OpenGL 3.2. Now, you’ve read that correctly - Ferni comes with a support for native execution of C++. For the first time in history, a GPU can run C++ code with no major issues or performance penalties and when you add Fortran or C to that, it is easy to see that GPGPU-wise, nVidia did a huge job.You can see more information here
I have written pricing code for CUDA, been to the IBM Cell Processor workshop in Austin and looked into FPGA development . I also spent a 5+year on x86 microprocessor design using Verilog so hardware design on an FPGA is not foreign.

The numerical differences between platforms are quite scary. Everyone who is looking at CUDA tends to focus on the single precision not double precision performance. More importantly, GPUs do not fully support IEEE754. Different Nvidia GPU cards support different levels of IEEE754. Also, the levels of IEEE754 support differs between single and double precision on the same GPU chip! Also, as I mentioned in another post today, invalid memory accesses on CUDA, as of last year, does not cause an exceptions or the program to crash. The damned programs just keeps on running accessing invalid numbers. Personally, I would prefer a crash then to trade on nonsense numbers.

IBM Cell is even more complicated. CUDA has quite simple programming model. In CUDA you have a bunch of memory. You load it with values. You have a kernel that operates over that data. Your kernel is implicitly threaded. Hell, you don't even have control how the threads are created or how they are queued. Its a completely different game for the Cell. You have to manage EVERYTHING on the Cell yourself. You have write code to communicate to memory and between SPEs. Also, the Cell has an usual memory architecture - it has no cache. This means you have to maintain cache coherency yourself. Who wants to do that?

FPGAs. Forget about it: Its dead. Xilinx and Mentor Graphics have really dropped the ball on the hardware design software. Although new FPGA chips can run at 500 to 600 Mhz today, the hardware compilers can barely generate designs that can go above 300 or 400 mhz. We are near the age of 40 or 80 core processors and gighertz GPUs, why would we care about 400 mhz FPGAs. BTW, unless your design is designed and partitioned properly, a single line of code can cause multiday recompilation. Ouch. Also, most guys who demo FPGA code running financial code do not make the whole design single precision compliant: the intermediate calculations are stored in capacities other than single: maybe bigger or maybe smaller. This is because these designers are counting each and every single bit and the whole program needs to be shoe-horned into the little FPGA chip. How do you explain that you option prices are different and are based on IEEE "Bob."

An interesting real-world example. My coworker and I implemented an asian option pricing program. I wrote mine in CUDA and he wrote it for IBM Cell. His program ran faster, maybe 2x, but if you examined the code, you would realize some interesting things. IBM Cell is code was roughly 1/4 the size: he called all sorts of macros while in CUDA you have to write a lot of setup and tear down at the GPU/CPU boundary. The problem with the Cell code is that it more resembled a device driver with a lot of IO transfers. He spent an inordinate amount of time figuring out how to align memory read/writes along the proper quad word boundaries so his program wouldn't crash. Boring. The body of my code more resembles standard C code. This is quite powerful because it can make sense to other developers.

Also, if IBM updates the Cell with a one more SPEs, the CELL code needs to be modified. If Nvidia adds 20 or 80 more "cores" the CUDA code doesn't change. Nice.

What about ECC? All our Dell boxes have ECC memory at a 50% premium over non-ECC memory. Nvidia cards have no ECC. Cell does. FPGAs? No, way, they don't have enough silicon to double precision let alone store an extra parity bit. Does the integrity of your number's bit pattern matter to you? Obviously, not to Bloomberg.
IBM Cell is even more complicated. CUDA has quite simple programming model. In CUDA you have a bunch of memory. You load it with values. You have a kernel that operates over that data. Your kernel is implicitly threaded. Hell, you don't even have control how the threads are created or how they are queued. Its a completely different game for the Cell. You have to manage EVERYTHING on the Cell yourself. You have write code to communicate to memory and between SPEs. Also, the Cell has an usual memory architecture - it has no cache. This means you have to maintain cache coherency yourself. Who wants to do that?

I feel vindicated today. I found out that IBM is dropping the Cell processor as reported by this Ars Electronica article.

One particular point that I brought up appears in the article.
The differences between Intel's future CPU/Larrabee hybrid and IBM's Cell may seem small, but they're critical. Cell's smaller floating-point cores are not general-purposethey're specialized and they implement their own instruction set. These small cores also don't have cache coherency and a real virtual memory implementation. Rather, they have "local store" pools of programmer-managed memory that make them a huge pain to program for. So in terms of programmability, the difference between Cell and a CPU/Larrabee hybrid is night and day.

Boys and girl(s)*, its 2010. Do we want to spend the next two years before the end of the world writing cache coherency routines in your analytic code? I heard one IBM developer implemented some code that steals some memory from each SPE to build a coherent cache, but seriously this should be built into the hardware.

---------- Post added at 02:04 PM ---------- Previous post was at 01:41 PM ----------

You create with a Mersenne Twister random number on the GPU card

Is this code re-entrant? One issue is *serial equivalence*, does the GPU code give *exactly* the same prices as the 1-core solution? If not, why not?

Probably never. If you look at every single NVidia CUDA example they implement everything two ways: once for the CPU and once for the GPU. At the end of each example, they compute the run-times so they can compute the speed-up. Also, there is a simple test to see if the GPU result is close to the CPU result within some epsilon. If the numbers were exact, they wouldn't need an epsilon. Just my observation.

Also, in all their MC simulation code, they always round up the user specified number of sims to a power of 2 that will occupy all the "units." In CUDA, doing extra work, sometimes, doesn't take anytime at all if you are waiting anyways. For example, you might want to run with 1,000 sims but you will get 1,024. Kinda annoying if you are trying to match MC-based options prices.
I too was recently heavily involved in writing pricing code in CUDA, and was also doing lots of HPC work on other platforms in past couple years, so here are some of opinions of mine:

I'd say that going with NVIDIA hardware, and CUDA API, is still the safest option for anyone looking for speed-up. As mentioned above, IBM Cell is dead (and it was ugly as hell for programming). Larabee is going to be heavily delayed (for at least a year too), and overall its future is questionable too (although Intel guys seem like changing focus with Larabee from GPU to HPC accelerator). As for OpenCL (also mentioned in some posts) - I had an opportunity for doing lots of work in OpenCL recently, and this thing is crap at the moment. The programming model is OK, rather similar to CUDA, but drivers/tools implementation is awful. With NVIDIA OpenCL drivers, you get several times slower performance that for the same CUDA code; with AMD the performance is also very hard to get (AMD guys have some very fast hardware in their offerings, but overall they just still seem like unconvinced in using GPU for HPC, so the effort they put in is far behind what NVIDIA is doing). So overall, the state of the OpenCL is same as the state of CUDA approximately 2.5 years ago: it's certainly going to improve, but I see no reason to waste time in waiting for it, especially as it appears that write-OpenCL-once-then-run-everywhere is myth - you just have to tweak for each platform separately, so this really has no advantage over just deciding for NVIDIA hardware and sticking with CUDA.

There exist many other efforts to provide higher-level paradigm for the GPU/accelerator programming. One example of this are Matlab plugins, like above mentioned Jacket from Accelereyes (there are others, like GPUmat); was involved both in the implementation, and usage of something alike, and I'd say these won't fly either: it's very hard to match Matlab routines in semantic and numerical precision, and still keep GPU utilized efficiently. Furthermore, there is lots of work in trying to provide extensions for general purpose languages that would semi-automatically parallelize given sequences of code for the execution on the accelerator (GPU or other type). For example, recent release of Portland Group suite of compilers is offering something alike for Fortran (although I didn't liked it - too much OpenMP-like stuff to me; on the other side, I really liked an alternative capability offered by the same release of their compiler tools, and that is to write CUDA kernels in Fortran, together with having all of CUDA runtime functions available through nice native Fortran syntax). For C++, RapidMind was providing an automated translation platform for C++, and it was rather mature (if I remember it correctly, they supported all of multi-core CPU, GPU, and Cell), but they are recently acquired by Intel, so I'd expect soon-to-be-released-in-beta Intel Ct platform to be much like this, so it may be interesting to take a look into.

As far as FPGA concerned, I wouldn't agree with DailyVaR - I think there is lots of potential in FPGA, especially regarding recent C-to-FPGA developments. Impulse C offering is very mature - I experimented with it to the some extent, and while the programming model is certainly even more complicated than for GPUs, the effort could be definitely worthwhile regarding overall speed-up potentials. Also, there exist other vendors starting to offering this kind of tools (like Mitrionics), so I'd expect this field to quickly mature into viable alternative to using GPUs.

So - overall, lots of very interesting development ongoing, but the problem is that programming models are far from being standardized, and it's hard to know which one will eventually win out as de-facto standard. Still, considerable speed-ups (thus competitive advantage over competitors too) could be achieved by employing accelerators even today, so I'd say investing in that kind of development is must already; and I'd also re-state that at going with NVIDIA hardware and CUDA API is pretty safe bet at the moment: hardware is fast and improving (Fermi is going to bring some really nice improvements), software stack is mature and stable, and there also exist considerable pool of people knowledgeable in CUDA to hire from.
I'm interested in seeing some kind of hard number, benchmark, sample code, screenshots from real life applications, not ones from sale brochures.
If one decides to go the NVIDIA+CUDA path, what would you say the entry point price-wise to invest in this?


I was working as an external HPC consultant on this particular CUDA based options pricing project, so at the moment I'm still bound with various NDAs etc. Now, I can certainly provide you with contacts on the PM, if you would like to discuss the details with these guys (probably be prepared for some amount of marketing talk, though); but let me also state my experience regarding the performance improvement this way: the code I wrote made it possible (for both Monte-Carlo and PDE solvers, later implemented for European and American options only) to achieve speed-ups close to the ratio of number of GPU cores and CPU cores, multiplied with the ratio of GPU frequency with CPU frequency. Which means, if I run the pricing code on Tesla C1060, with 240 single-precision units and running on 1.3GHz, then the code is approximately 30 times faster than SSE based (which means - 4 single-precision units used) CPU code on the same machine, with CPU running on 2.5GHz. Now, I know it is somewhat like comparing apples and oranges here, but the speed-up of 30x is what counts after all; and note on the other side that the CPU code had to be heavily tweaked to get to some fair comparison - GPU code was like two orders of magnitude faster in my initial testings (what I'm trying to say is: yes, it takes lots of effort to get to really good GPU code, and thus utilize GPU to its max potential, but on the other side these days it is equally hard, if not even harder, to write good CPU code). As I was able to witness similar performance improvements on some of my other, quantitative finances unrelated, projects (some numerical algorithms map really well onto GPU, some others admittedly do not), I'm more than convinced that for large part of computing intensive calculations in finance GPUs could provide tremendous speed-ups. However, the speed-ups in core computations have certainly to be estimated within the context of your complete application (well known Amdahl and Gustafson-Barsis laws could help in estimating overall speed-up possible).

As far as the initial investment involved - again, it really depends, there are so many scenarios possible, and it's hard to tell. But, if you are into options pricing, then I guess I could state with some degree of certainty, that if you could bring some CUDA knowledgeable guy(s) on board, then in 4-5 man months you should be able to build basics of an options pricing engine that would prove the concept and let you estimate is it worth to proceed with CUDA or not. The software tools needed are mostly free, and as far as hardware investment concerned, you could probably start with one multi-GPU machine for testing (see here for some options, and on the other side developers could work on ordinary machines (I was doing all of my development on higher-end Lenovo/HP laptops, equipped with Quadro Mobile solutions). Add to this an estimation of how long it would take to you to detect bottlenecks in your existing engine, and/or adapt this code so that pricing algorithms are pluggable (so that you could switch back and forth between your existing CPU code, and eventually newly written GPU code), and I think you should be pretty close to an estimation of the initial cost.