• C++ Programming for Financial Engineering
    Highly recommended by thousands of MFE students. Covers essential C++ topics with applications to financial engineering. Learn more Join!
    Python for Finance with Intro to Data Science
    Gain practical understanding of Python to read, understand, and write professional Python code for your first day on the job. Learn more Join!
    An Intuition-Based Options Primer for FE
    Ideal for entry level positions interviews and graduate studies, specializing in options trading arbitrage and options valuation models. Learn more Join!

S+ : Need help with big data set

Joined
1/2/07
Messages
128
Points
28
I am working with a data set that is set up as follows:

There are several thousand ticker symbols whose daily returns are binned in 18 sections throughout the day. There are between 25 - 30 days of data for each ticker.

I want to find the mean return for each symbol and create a data set that that will have one row for each ticker, and the mean return for each of the 18 bins.

I wrote a function to do this, but it is much too slow. Are there any native functions that will interpret "sections" based on data in a column (in my case, the 'SYMBOL' column), perform calculations across each section and then create a new spreadsheet with this data?

I appreciate any help, I have looked all over the internet and not found a solution to this problem.
 
are you using any of the function with built-in internal loops, like apply, lapply, etc. They are incredibly fast and efficient. Stay away from for-loop as much as possible.

SPLUS Library: Advanced functions

You can substitute function sum with mean

mat1 <- matrix(rep(seq(4), 4), ncol = 4)
mat1

[,1] [,2] [,3] [,4]
[1,] 1 1 1 1
[2,] 2 2 2 2
[3,] 3 3 3 3
[4,] 4 4 4 4

#row sums of mat1
apply(mat1, 1, sum)
[1] 4 8 12 16

#column sums of mat1
apply(mat1, 2, sum)
[1] 10 10 10 10
 
This is definitely on the right track. What I need though is a way to distinguish between the data for each symbol. The data is laid out like this:

| RETURNS
symbol date bin1 bin2 bin3...
1 Oct1 .002 .042 .220
1 Oct2 .052 .041 .020
1 Oct3 .034 .742 .245
1 Oct4 .234 .042 .223
1 Oct5 .230 .092 .292
2 Oct1 .023 .002 .220
2 Oct2 .094 .023 .928
2 Oct3 .950 .293 .032
2 Oct4 .908 .140 .402
.
.
.

Also, importantly, there are not a uniform number of rows per symbol, so I can't use a hard-coded value to refer to the number of values used to calculate the standard dev. (I know I said 'mean return,' but I meant standard deviation).
 
This is definitely on the right track. What I need though is a way to distinguish between the data for each symbol. The data is laid out like this:

| RETURNS
symbol date bin1 bin2 bin3...
1 Oct1 .002 .042 .220
1 Oct2 .052 .041 .020
1 Oct3 .034 .742 .245
1 Oct4 .234 .042 .223
1 Oct5 .230 .092 .292
2 Oct1 .023 .002 .220
2 Oct2 .094 .023 .928
2 Oct3 .950 .293 .032
2 Oct4 .908 .140 .402
.
.
.

Also, importantly, there are not a uniform number of rows per symbol, so I can't use a hard-coded value to refer to the number of values used to calculate the standard dev. (I know I said 'mean return,' but I meant standard deviation).

S-Plus is really flexible with vector and matrix operations. Here's an example:
> retmat = matrix(c(1, 0.002, 0.042, 0.22, 1, 0.052, 0.041, 0.02, 1, 0.034, 0.742, 0.245, 1, 0.234, 0.042, 0.223, 1, 0.23, 0.092, 0.292, 2, 0.023, 0.002, 0.22, 2, 0.094, 0.023, 0.928, 2, 0.95, 0.293, 0.032, 2, 0.908, 0.14, 0.402), nrow = 9, ncol = 4, byrow = T)

I did not include the date column
> retmat
[,1] [,2] [,3] [,4]
[1,] 1 0.002 0.042 0.220
[2,] 1 0.052 0.041 0.020
[3,] 1 0.034 0.742 0.245
[4,] 1 0.234 0.042 0.223
[5,] 1 0.230 0.092 0.292
[6,] 2 0.023 0.002 0.220
[7,] 2 0.094 0.023 0.928
[8,] 2 0.950 0.293 0.032
[9,] 2 0.908 0.140 0.402

to retrive the bin1 data (note: column 1 holds the symbol value)
> retmat[,2]
[1] 0.002 0.052 0.034 0.234 0.230 0.023 0.094 0.950 0.908

to get an indicator vector, which identifies rows containing data for stock #1
> retmat[,1]==1
[1] T T T T T F F F F

use this indicator vector to retrive a vector containing data belonging to stock #1
> retmat[retmat[,1]==1,2]
[1] 0.002 0.052 0.034 0.234 0.230

use mean() function to find the average for this stock in bin1
> mean(retmat[retmat[,1]==1,2])
[1] 0.1104

# this would let you loop through from stock 1 to 2 and get the average for each stock (bin 1)
for (i in 1:2)
{
avg = mean(retmat[retmat[,1]==i,2])
}

> avg
[1] 0.11040 0.49375
 
Back
Top