• C++ Programming for Financial Engineering
    Highly recommended by thousands of MFE students. Covers essential C++ topics with applications to financial engineering. Learn more Join!
    Python for Finance with Intro to Data Science
    Gain practical understanding of Python to read, understand, and write professional Python code for your first day on the job. Learn more Join!
    An Intuition-Based Options Primer for FE
    Ideal for entry level positions interviews and graduate studies, specializing in options trading arbitrage and options valuation models. Learn more Join!

Nvidia - Cuda Toolkit for options pricing

Wallstyouth

Vice President
Joined
5/25/07
Messages
116
Points
28
We've been looking at Cuda for some options pricing application we run here and was wondering how many other shops on the street are using this toolkit.

What makes Cuda very attrafictive is that it executes code directly on the GPU instead of the CPU. Code executed on the GPU seems to run many times faster than traditional CPU.

Some good applications for Cuda seems to be:
Binomial Option Pricing
Black-Scholes Option Pricing
Monte-Carlo Option Pricing
Parallel Mersenne Twister (random number generation)
Parallel Histogram
Image Denoising
Sobel Edge Detection Filter
Computational Finance
CUDA Zone - resource for C developers of applications that solve computing problems
Learn More about CUDA - NVIDIA
CUDA - Wikipedia, the free encyclopedia
 
We are planning to use it. We are hiring a consultancy company that uses it already. I will let you know.
 
I'm very interested, so tell me more!
 
I already implemented a Monte Carlo pricing Engine using CUDA.
It is just a very simple exotic pricing engine. With 3 billions sample paths in just 20 seconds, I can accurately find the delta value within 0.5% error. (I'm just using a very cheap GF8600 GPU, but with the most advanced chips version, I guess the speed can improve further .... )
 
Would Cuda speed up the calculations required for non-linear regressions?
 
This is an interesting subject. I've read only a bit from the links on the thread.
In graphics and hardware research GPGPU is a hot topic. Computing power is incredible, the "price" being to write parallel code.
For option pricing (finite differences, trees etc) we would need to change the single threaded implementation to leverage parallel execution. Maybe will do some research in "spare time" ...
 
This is indeed an interesting subject. Recently, I had a chat with some guys at work in mortgage group at bloomberg when I was researching some hardware acceleration options for data compression. It gave a good insight in to the power and limitations of GPU computing. It is amazing to know that NVIDIA now makes "head less GPU hardware" specifically for financial applications. i.e., these GPUs dont have any video out socket at all. So, technically speaking they are not GPUs and they dont have anything to do with graphics.

Though GPUs have way more cores than CPU, it is not the main reason for using GPUs. In-fact, the typical clock speed of a GPU core is way less than the CPU clock speeds of today. Secondly, most financial problems are very sequential. However, they are more repetitive, i.e., pricing a single security is sequential but you can price more securities with more cores. GPUs power is really in their ability to handle floating point more efficiency and more importantly, the SIMD support (single instruction multiple data). Suppose, you have to add two vectors, a CPU will take linear time to execute the add operation because you will have a loop in your code to add each element separately. On the other hand, GPU's support vector add instructions which can typically add up to 128 elements in constant time.

But all this power comes at a cost.
1) You loose portability. GPU code is very much tied to vendor and hardware specific
2) Programming paradigm is different. Once you are on a GPU, OS has very little role in resource management. So, applications have to manage resources like cores and several types of memory and registers on GPUs themselves and also make sure that they are not stepping on each other's resources
3) The amount of memory on GPU is limited. So your data structures have to be more compact and less fragmented and the application on the CPU will have to move bits and pieces to the GPU and drive the algorithm.
4) Unless you are developing everything from scratch, integrating with existing code is going to be tricky and painful.
 
response to nVidia CUDA

>But all this power comes at a cost.
>1) You loose portability. GPU code is very much tied to vendor and hardware specific

Moot point.... Nvidia is not going away anytime soon. There are many proprietary technical solutions implemented on a desk that are tied to a vendor. This is an argument often espoused by "Java" and in general Open source supporters. If this was truly a concern, the MSFT .NET framework would have gone the way of Windows ME. Moreover, Apple's Grand Central would not be getting the buzz it has been receiving in the GPU community.

>2) Programming paradigm is different. Once you are on a GPU, OS has very little role in >resource management. So, applications have to manage resources like cores and several >types of memory and registers on GPUs themselves and also make sure that they are not >stepping on each other's resources

An argument often espoused by those who dabble in languages that run inside a virtual machine :). Here lies the difference between a coder and a programmer. Managing memory, threads, etc is tedious, but not hard to implement. The heavy lifting comes from designing the program or framework.


>3) The amount of memory on GPU is limited. So your data structures have to be more compact >and less fragmented and the application on the CPU will have to move bits and pieces to the >GPU and drive the algorithm.

It depends on the skills of the programmer. Unless you are loading an entire database in memory, the current memory on GPUs are more than adequate. And if you are loading a huge dataset into the GPU, then you have to reconsider your program design.

The "headless" GPU cards and standalone systems come with 1GB - 4GB of ram. It is more than
enough to handle heavy computing.

4) Unless you are developing everything from scratch, integrating with existing code is going to be tricky and painful.

CUDA is more or less C, which means it will talk to C++ programs (with some modifications), and if you look hard enough, you can find a way to wrap up the interface for other languages.
On the enterprise level, if you have a robust messaging system (i.e. Tibco rendezvous), then this becomes a non-issue.


The GPU is a great computing resource. It does take more effort in the design and coding of programs. The pros far outweigh the cons.
 
As soon as I have some hard details I will try to post them here. We are working with CUDA at the moment.

BTW, it is really really fast. Also, we are not talking about graphic cards but Tesla cards and Tesla machines.
 
This is indeed an interesting subject. Recently, I had a chat with some guys at work in mortgage group at bloomberg when I was researching some hardware acceleration options for data compression. It gave a good insight in to the power and limitations of GPU computing. It is amazing to know that NVIDIA now makes "head less GPU hardware" specifically for financial applications. i.e., these GPUs dont have any video out socket at all. So, technically speaking they are not GPUs and they dont have anything to do with graphics.

Though GPUs have way more cores than CPU, it is not the main reason for using GPUs. In-fact, the typical clock speed of a GPU core is way less than the CPU clock speeds of today. Secondly, most financial problems are very sequential. However, they are more repetitive, i.e., pricing a single security is sequential but you can price more securities with more cores. GPUs power is really in their ability to handle floating point more efficiency and more importantly, the SIMD support (single instruction multiple data). Suppose, you have to add two vectors, a CPU will take linear time to execute the add operation because you will have a loop in your code to add each element separately. On the other hand, GPU's support vector add instructions which can typically add up to 128 elements in constant time.

But all this power comes at a cost.
1) You loose portability. GPU code is very much tied to vendor and hardware specific
2) Programming paradigm is different. Once you are on a GPU, OS has very little role in resource management. So, applications have to manage resources like cores and several types of memory and registers on GPUs themselves and also make sure that they are not stepping on each other's resources
3) The amount of memory on GPU is limited. So your data structures have to be more compact and less fragmented and the application on the CPU will have to move bits and pieces to the GPU and drive the algorithm.
4) Unless you are developing everything from scratch, integrating with existing code is going to be tricky and painful.

1) NVIDIA-CUDA, AMD-Brook and IBM for the CELL are three possibility to use new way with GPU (altough cell it's a little be different)
However, openCL was launched today with the first header.
I think it would be the solution in the future
OpenCL - Wikipedia, the free encyclopedia
2) right
3) you can see this monster http://www.nvidia.com/object/personal_supercomputing.html, but the memory it's a big problem for read and write.
as far as i am concerned, the big problem with GPU it's to transfer DATA on the card and after copy it to the CPU.
For example, for a good montecarlo with sobol sequences for example we need to have all your data on the GPU otherwise if you want to read-write the memory you loose the power of GPU
OpenCL - Wikipedia, the free encyclopedia
J
 
FYI, in preliminary tests from a vendor, they are able to calculate implied vol for a 500,000 options in 1.5 seconds.

BTW, Brook has been around for some time but it has never caught on. GPU people are very secretive about their stuff so I don't see cross compatibility any time soon.
 
(1) OpenCL is an Apple initiative. Is there a Linux, Unix, or Win32/Win64 implementation?

(3) Please elaborate on the memory and system bus constraint? Do you expect to scale up to 16/32/64GB and to allow for a 3GHZ memory pipe between the CPU, GPU, and RAM?
You can design your program around memory constraints. The bus speed of your system
will allow your GPU to write to and from system RAM in an efficient manner especially if you
employ a PCIe 2.0 card and DDR3 ram. The bottleneck is reduced, almost un-noticable. You do not lose the power of the GPU, you get your results a little later, which to the end user is un-noticable.

Please elaborate on the "good montecarlo with sobol sequences" comment.


1) NVIDIA-CUDA, AMD-Brook and IBM for the CELL are three possibility to use new way with GPU (altough cell it's a little be different)
However, openCL was launched today with the first header.
I think it would be the solution in the future
OpenCL - Wikipedia, the free encyclopedia
2) right
3) you can see this monster NVIDIA Tesla Personal Supercomputer, but the memory it's a big problem for read and write.
as far as i am concerned, the big problem with GPU it's to transfer DATA on the card and after copy it to the CPU.
For example, for a good montecarlo with sobol sequences for example we need to have all your data on the GPU otherwise if you want to read-write the memory you loose the power of GPU
OpenCL - Wikipedia, the free encyclopedia
J
 
(1) OpenCL is an Apple initiative. Is there a Linux, Unix, or Win32/Win64 implementation?

(3) Please elaborate on the memory and system bus constraint? Do you expect to scale up to 16/32/64GB and to allow for a 3GHZ memory pipe between the CPU, GPU, and RAM?
You can design your program around memory constraints. The bus speed of your system
will allow your GPU to write to and from system RAM in an efficient manner especially if you
employ a PCIe 2.0 card and DDR3 ram. The bottleneck is reduced, almost un-noticable. You do not lose the power of the GPU, you get your results a little later, which to the end user is un-noticable.

Please elaborate on the "good montecarlo with sobol sequences" comment.
1) Just go to OpenCL
Opencl is universal ;)
3) "The bus speed of your system
will allow your GPU to write to and from system RAM in an efficient manner especially if you
employ a PCIe 2.0 card and DDR3 ram"
Ok, but if for each evaluation of your payoff you need to put GPU data on CPU data, you will have a CUDAread and a CUDAwrite which is expansive compare to kernel
Just to say, that if you want to get the total power of GPU you need to rebuild all your code on the GPU card and it's so expansive
In my example with montecarlo :
You create with a Mersenne Twister random number on the GPU card
After if you want Sobol sequences, you will need to copy it to GPU card and after if you have a Basket of 10 assets with 100 dates of evaluation and 100000 paths, you will have at least 1Gb Ram takes by this random number
Another example, with a BGM, the drift computation is very expansive on a CPU but with the GPU is very very fast. However, if you need your discount curve in the future on the CPU to make another computation you will need to cpy the forward curve on the CPU for each dates of your evaluation and it's expansive and so you loose the power of GPU.
So, you must for each example rebuild your code and it's so risky

Consequently, cuda it's a good solution but firm can not invest on it because they will rebuild all their application espacially with CUDA which is so dangerous. If one year later Brook is better you will need to destroy cuda code and begins with brook...and the same with CELL
OpenCL create homogeneasation and will be the solution
 
FYI, in preliminary tests from a vendor, they are able to calculate implied vol for a 500,000 options in 1.5 seconds.

BTW, Brook has been around for some time but it has never caught on. GPU people are very secretive about their stuff so I don't see cross compatibility any time soon.
For compatibility just look OpenCL and i think next year OpenCL will be a standard
However, intel with Larabee will be a good competitor
500,000 options in 1.5seconds, it's not a real case because you will need cudamemcpy for all your data : Forward, Discount etc... It's the famous example in the sdk CUDA with BS but he does not change the data.
When a vendor says that there is a gain of 100 compare to CPU it's generally a fake, when the vendor says 10 it's was more realistic
 
Back
Top