S+ : Need help with big data set

Joined
1/2/07
Messages
128
Points
28
I am working with a data set that is set up as follows:

There are several thousand ticker symbols whose daily returns are binned in 18 sections throughout the day. There are between 25 - 30 days of data for each ticker.

I want to find the mean return for each symbol and create a data set that that will have one row for each ticker, and the mean return for each of the 18 bins.

I wrote a function to do this, but it is much too slow. Are there any native functions that will interpret "sections" based on data in a column (in my case, the 'SYMBOL' column), perform calculations across each section and then create a new spreadsheet with this data?

I appreciate any help, I have looked all over the internet and not found a solution to this problem.
 
are you using any of the function with built-in internal loops, like apply, lapply, etc. They are incredibly fast and efficient. Stay away from for-loop as much as possible.

SPLUS Library: Advanced functions

You can substitute function sum with mean

mat1 <- matrix(rep(seq(4), 4), ncol = 4)
mat1

[,1] [,2] [,3] [,4]
[1,] 1 1 1 1
[2,] 2 2 2 2
[3,] 3 3 3 3
[4,] 4 4 4 4

#row sums of mat1
apply(mat1, 1, sum)
[1] 4 8 12 16

#column sums of mat1
apply(mat1, 2, sum)
[1] 10 10 10 10
 
This is definitely on the right track. What I need though is a way to distinguish between the data for each symbol. The data is laid out like this:

| RETURNS
symbol date bin1 bin2 bin3...
1 Oct1 .002 .042 .220
1 Oct2 .052 .041 .020
1 Oct3 .034 .742 .245
1 Oct4 .234 .042 .223
1 Oct5 .230 .092 .292
2 Oct1 .023 .002 .220
2 Oct2 .094 .023 .928
2 Oct3 .950 .293 .032
2 Oct4 .908 .140 .402
.
.
.

Also, importantly, there are not a uniform number of rows per symbol, so I can't use a hard-coded value to refer to the number of values used to calculate the standard dev. (I know I said 'mean return,' but I meant standard deviation).
 
This is definitely on the right track. What I need though is a way to distinguish between the data for each symbol. The data is laid out like this:

| RETURNS
symbol date bin1 bin2 bin3...
1 Oct1 .002 .042 .220
1 Oct2 .052 .041 .020
1 Oct3 .034 .742 .245
1 Oct4 .234 .042 .223
1 Oct5 .230 .092 .292
2 Oct1 .023 .002 .220
2 Oct2 .094 .023 .928
2 Oct3 .950 .293 .032
2 Oct4 .908 .140 .402
.
.
.

Also, importantly, there are not a uniform number of rows per symbol, so I can't use a hard-coded value to refer to the number of values used to calculate the standard dev. (I know I said 'mean return,' but I meant standard deviation).

S-Plus is really flexible with vector and matrix operations. Here's an example:
> retmat = matrix(c(1, 0.002, 0.042, 0.22, 1, 0.052, 0.041, 0.02, 1, 0.034, 0.742, 0.245, 1, 0.234, 0.042, 0.223, 1, 0.23, 0.092, 0.292, 2, 0.023, 0.002, 0.22, 2, 0.094, 0.023, 0.928, 2, 0.95, 0.293, 0.032, 2, 0.908, 0.14, 0.402), nrow = 9, ncol = 4, byrow = T)

I did not include the date column
> retmat
[,1] [,2] [,3] [,4]
[1,] 1 0.002 0.042 0.220
[2,] 1 0.052 0.041 0.020
[3,] 1 0.034 0.742 0.245
[4,] 1 0.234 0.042 0.223
[5,] 1 0.230 0.092 0.292
[6,] 2 0.023 0.002 0.220
[7,] 2 0.094 0.023 0.928
[8,] 2 0.950 0.293 0.032
[9,] 2 0.908 0.140 0.402

to retrive the bin1 data (note: column 1 holds the symbol value)
> retmat[,2]
[1] 0.002 0.052 0.034 0.234 0.230 0.023 0.094 0.950 0.908

to get an indicator vector, which identifies rows containing data for stock #1
> retmat[,1]==1
[1] T T T T T F F F F

use this indicator vector to retrive a vector containing data belonging to stock #1
> retmat[retmat[,1]==1,2]
[1] 0.002 0.052 0.034 0.234 0.230

use mean() function to find the average for this stock in bin1
> mean(retmat[retmat[,1]==1,2])
[1] 0.1104

# this would let you loop through from stock 1 to 2 and get the average for each stock (bin 1)
for (i in 1:2)
{
avg = mean(retmat[retmat[,1]==i,2])
}

> avg
[1] 0.11040 0.49375
 
Back
Top Bottom