• C++ Programming for Financial Engineering
    Highly recommended by thousands of MFE students. Covers essential C++ topics with applications to financial engineering. Learn more Join!
    Python for Finance with Intro to Data Science
    Gain practical understanding of Python to read, understand, and write professional Python code for your first day on the job. Learn more Join!
    An Intuition-Based Options Primer for FE
    Ideal for entry level positions interviews and graduate studies, specializing in options trading arbitrage and options valuation models. Learn more Join!

Cloud services to backtest minute data

Hi Quantnet,

My current tasks is to backtest a 5 min bar trading strategy over 10 years. It uses regression and as expected, it's very slow. At 1 hour needed for a 1 asset 10 year result, it's obvious that I need more computational power. I have a few virtual machines at my disposal but they won't cut it.

Question: Are there any free cloud services where I can rent and backtest my strategies? Something like AWS and Azure. Emphasis is free. Scale is a priority here, so really I need like 100 free virtual machines to order to say that implementation is worth while.

Relatedly, have any of you wrote your own linear regression function that performs orders of magnitude faster than a library? I'm using accord-framework.net. I also believe third party libraries are already optimized to be fastest.

Donny
 
Last edited:
Actually for simple rolling linear regression of a fixed number of variables, you can implement it on CUDA C. It will speed up your regression at least an order of magnitude.
 
Thanks everyone. I heard of CUDA before. I think that's what I was looking for.

Out of the multitude of options - Cuda, own C# implementation, FPGA, Matlab, R - looks like CUDA is the way to go.

I trust your word on the CUDA implementation and look into it.
 
Well I actually wrote something like that a few years ago. Just paste it here in case it may help.

C++:
#include <cuda_runtime.h>

#define ARRAY_SIZE_PER_THREAD 16
#define BLOCK_SIZE 32
__global__ void RollingSimpleLinearRegression_kernel(float* x_dev,float* y_dev,float* beta_dev, float* alpha_dev, float* residual_dev, int n, int window){
    float Sx=0;
    float Sy=0;
    float Sxx=0;
    float Sxy=0;
    float Syy=0;

    //First regresssion
    size_t threadId=threadIdx.x + blockIdx.x * blockDim.x;

    float x=0;
    float y=0;   

    for (size_t i=0;i<window;i++){
        x=x_dev[threadId*ARRAY_SIZE_PER_THREAD+i];
        y=y_dev[threadId*ARRAY_SIZE_PER_THREAD+i];
        Sx+=x;
        Sy+=y;
        Sxx+=x*x;
        Sxy=x*y;
        Syy=y*y;
    }
    float beta=(window*Sxy-Sx*Sy)/(n*Sxx-Sx*Sx);
    float alpha=1.0f/window*Sy-beta*Sx*1.0f/n;
    float residual=0;
    float lastX=x_dev[threadId*ARRAY_SIZE_PER_THREAD+window-1];
    float lastY=y_dev[threadId*ARRAY_SIZE_PER_THREAD+window-1];

    for (size_t i=0;i<ARRAY_SIZE_PER_THREAD;i++){
        size_t dataId=threadId*ARRAY_SIZE_PER_THREAD+window+i;

        if (dataId < n){
            x = x_dev[dataId];
            y = y_dev[dataId];
            residual = y - alpha - beta*x;
            //Output
            beta_dev[dataId] = beta;
            alpha_dev[dataId] = alpha;
            residual_dev[dataId] = residual;
            //Update metrics
            Sx += x - lastX;
            Sy += y - lastY;
            Sxx += x*x - lastX*lastX;
            Sxy = x*y - lastX*lastY;
            Syy = y*y - lastY*lastY;
            lastY = y;
            lastX = x;
            beta = (window*Sxy - Sx*Sy) / (n*Sxx - Sx*Sx);
            alpha = 1.0f / window*Sy - beta*Sx*1.0f / n;
        }
    }

}
void RollingSimpleLinearRegression(float* x,float* y,float* beta, float* alpha, float* residual, int n, int window){
    cudaSetDevice(0);
    cudaFree(0);

    cudaStream_t stream;
    cudaError_t result;

    result = cudaStreamCreate ( &stream) ;

    if (result!=cudaSuccess){
        PrintError("Create stream");
        return;
    }
    size_t vecSzInBytes_float=n * sizeof(float);
      
    //Allocate vectors in device memory
    float* x_dev=0;
    float* y_dev=0;
    float* alpha_dev=0;
    float* beta_dev=0;
    float* residual_dev=0;

    cudaMalloc(&x_dev               ,vecSzInBytes_float);
    cudaMalloc(&y_dev               ,vecSzInBytes_float);
    cudaMalloc(&alpha_dev            ,vecSzInBytes_float);
    cudaMalloc(&beta_dev            ,vecSzInBytes_float);
    cudaMalloc(&residual_dev        ,vecSzInBytes_float);

    cudaMemset(alpha_dev           ,0,vecSzInBytes_float);
    cudaMemset(beta_dev               ,0,vecSzInBytes_float);
    cudaMemset(residual_dev           ,0,vecSzInBytes_float);
  
    cudaMemcpyAsync(x_dev          ,x,vecSzInBytes_float,cudaMemcpyHostToDevice,stream);
    cudaMemcpyAsync(y_dev          ,y,vecSzInBytes_float,cudaMemcpyHostToDevice,stream);

    cudaStreamSynchronize(stream);

    PrintError("cudaMemcpyAsync and cudaMemset");
    //Calculate number of blocks
    size_t noBlock = ceil((float) (n-window)/ARRAY_SIZE_PER_THREAD/BLOCK_SIZE);
    RollingSimpleLinearRegression_kernel<<<noBlock,BLOCK_SIZE>>>(x_dev,y_dev,beta_dev,alpha_dev,residual_dev,n,window);
    cudaStreamSynchronize(stream);
    //Copy back to host
    cudaMemcpyAsync(beta            ,beta_dev,vecSzInBytes_float,cudaMemcpyDeviceToHost,stream);
    cudaMemcpyAsync(alpha            ,alpha_dev,vecSzInBytes_float,cudaMemcpyDeviceToHost,stream);
    cudaMemcpyAsync(residual        ,residual_dev,vecSzInBytes_float,cudaMemcpyDeviceToHost,stream);

    cudaStreamSynchronize(stream);

    //Free memory
    cudaFree(x_dev);
    cudaFree(y_dev);
    cudaFree(beta_dev);
    cudaFree(alpha_dev);
    cudaFree(residual_dev);

    cudaStreamSynchronize(stream);
    PrintError("End function");  
}
 
Any idea why there isn't a simple linear regression library for CUDA?

I'm been searching the internet and only found -
https://devtalk.nvidia.com/default/topic/494482/linear-regression-cuda-code/

A guy asked the same thing but didn't get an answer.

I'm being pushed for results so if it'll take time, I need to go back to my manager and propose a tangent from our infrastructure to develop on CUDA, which may not be taken well.

Did a few more searches. Unless there's no such thing. I found something like this: Home

Basically, can I just go
CUDALinearModel aModel = new CUDALinearModel()
aModel.Regress(inputs, outputs)


Or I'll have to learn this CUDA thing from scratch.
 
Last edited:
Last time I checked there was no such library widely available (maybe there is a proprietary one). And also it will be quite inefficient as you will have to spend a round-trip to copy data to GPU memory, linear-regressing it, then copy it back for each regression, i.e. no batch regression. If you only do simple linear regression then I think my codes are ok. Otherwise, you will have to do something along the line of batch QR decomposition to speed up your multiple regression.
 
Thank you BurgerKing. Actually, I did a quick prototype on Matlab to backtest 10 years 5 mins with linear regression. You know what, it's obviously faster. So quick question to confirm:

In general, which is faster in linear regression - scientific languages (Matlab, R) or C# with library Accord.NET Machine Learning Framework

A discussion with my colleague said that scientific languages are meant for that sort of work. If what seems to be a 10x speed increase is true, I can quickly switch backtesting to Matlab, check results, then recode the strategy in C# for execution.
 
Actually, if you only do regression, then MATLAB is highly optimized for that kind of work, and is likely faster than the C# codes (I have no idea how good is R though). 10x is a bit too much, but maybe 2-3 times. I think you can try a simple prototype with MATLAB parfor command first with all your CPU cores. If still not enough then either rack-up a MATLAB computing cluster or code a MATLAB mex file to leverage CUDA in C, or both.

I think the trick is when you backtest your strategy, it's likely that you will have to do a lot of rolling regression with 5 minute increment, so try to reuse/online update the past estimation instead of doing the whole regression again.
 
I did a quick prototype to test speed for C#, Matlab and R. The set up is regress y ~ x1 + x2 where y, x1 and x2 are each 8,000 randomly generated numbers. Then repeat the regression 5,000 times.

Here's what I got. I'm sharing it with you since everyone else in Quantnet seems quiet.

Matlab: 17 seconds.
R: 55 seconds.
C#: 24 seconds.

Of course, you have to consider the time taken to generate the random numbers. I decided on new random numbers each pass just to be sure the language does a new regression in case the previous one was saved.

Looks like Matlab it is.

Though the problem I see here is that I'm not sure the differences are amplified if I used actual data and not random data. The scientific library might be able to pick up on the commonalities at each pass and optimized further.

I guess that's why I'm a quantitative research or not a software engineer. Thanks for the good discussion. At least I have something to back to my boss with.
 
@donny I think it should be faster if you are using real data, cos the data will be pre-allocated in memory and MATLAB memory performance is actually quite good for a scripting language.

@ExSan IMO, cloud computing has very poor price/performance ratio, which means you won't be able to profit from mining using cloud. Most people/companies uses cloud because it saves them maintenance and setup cost, not because of its performance.
 
what about bitcoin mining using the cloud ?
This is a money losing proposition. Even using ASICs (which I have done) won't justify the cost today. (It didn't justify the cost a year ago when I did it but it was a fun project)
 
please enlighten me
ASICs ?
Application-specific integrated circuit - Wikipedia, the free encyclopedia
You can buy specific circuits to mine bitcoin. They have been around sort of cheap for a while. You can buy highly sophisticated ones or small ones depending on your taste. These are specifically designed to only solve the bitcoin problem so they don't waste time/energy in anything else. If you aren't doing ASIC miners, you can pretty much kiss the odds of finding a block goodbye.

Look below, read and enjoy:

Learn about Bitcoin mining hardware
Mining hardware comparison - Bitcoin Wiki
Non-specialized hardware comparison - Bitcoin Wiki
Bitcoin Mining Hardware - ASIC Bitcoin Miner - Butterfly Labs
Can Hobbyist Bitcoin Miners Still Make a Buck?
Blockchain Smashers
Don’t buy a Terraminer
 
Last edited:
Application-specific integrated circuit - Wikipedia, the free encyclopedia
You can buy specific circuits to mine bitcoin. They have been around sort of cheap for a while. You can buy highly sophisticated ones or small ones depending on your taste. These are specifically designed to only solve the bitcoin problem so they don't waste time/energy in anything else. If you aren't doing ASIC miners, you can pretty much kiss the odds of finding a block goodbye.

Look below, read and enjoy:

Learn about Bitcoin mining hardware
Mining hardware comparison - Bitcoin Wiki
Non-specialized hardware comparison - Bitcoin Wiki
Bitcoin Mining Hardware - ASIC Bitcoin Miner - Butterfly Labs
Can Hobbyist Bitcoin Miners Still Make a Buck?
Blockchain Smashers
Don’t buy a Terraminer
thanks a lot !
 
Hmmm, I repeated the test again using actual prices and the improvement wasn't as much. For linear regression of 8000 data points, repeated 5000 times using real AUDUSD and ES prices, Matlab took 70% of the time C# took to perform it. R was slower than either Matlab and C#.

And I went to my boss and said linear regression done on Matlab was faster than C#, and he was shocked. Argh.
 
Hmmm, I repeated the test again using actual prices and the improvement wasn't as much. For linear regression of 8000 data points, repeated 5000 times using real AUDUSD and ES prices, Matlab took 70% of the time C# took to perform it. R was slower than either Matlab and C#.

And I went to my boss and said linear regression done on Matlab was faster than C#, and he was shocked. Argh.
did you use loops in your R version?
 

diegosanaz

BU MSMF
yeah, there are many tricks that could be used in R to improve performance, such as using some parallel computing library.
 
In R, you can't vectorize regression where each regression uses a vector in itself.

Or at least my knowledge in R believe it can't be done without loops.

The problem is loop X times where each time is a regression of Y points.
 
Top