• C++ Programming for Financial Engineering
    Highly recommended by thousands of MFE students. Covers essential C++ topics with applications to financial engineering. Learn more Join!
    Python for Finance with Intro to Data Science
    Gain practical understanding of Python to read, understand, and write professional Python code for your first day on the job. Learn more Join!
    An Intuition-Based Options Primer for FE
    Ideal for entry level positions interviews and graduate studies, specializing in options trading arbitrage and options valuation models. Learn more Join!

Analyzing ultra high frequency data sets


Vice President
I'm curious to know what underlying technologies most here working in the industry are using to store and analyze very high frequency data sets. I've heard of many firms using proprietary file based storage systems on ramdisks and such? Is true for most firms playing in this space? I'm generally referring to the folks who are analying a few hundred thousand data points in a single day for a given set of securities.

Also what sort of non-generic tick filters are you using today? It's very easy to clean outliers and the obvious time/price bid/ask abnormalities etc.. I'm just curious to see what others consider "unclean" in their market feeds.

I don't mind high level explanations as I know not many will not feel comfortable taking freely about this kind of stuff.


Director, Wasserman Trading Floor/Subotnick Center
At the Baruch Options Data Warehouse (optionsdata.baruch.cuny.edu) we capture between 5 and 6 billion messages a day from the seven US options market centers. Often this data reaches a peak of close to 800,00 message per second. So I suppose this counts as "high frequency".

Since we mainly use this data for research, most of it sits on a RAID farm in compressed file sets. We go through an entire day's data set at least once a night and pull out a wide range of overall market statistics as well as market quality stats on all 200,000+ option series. Some of the studies we do can use these statistics rather than have to dig into the underlying data.

When we want to study a particular set of option series at a more detailed level (e.g., every quote), we pull the series out using a distributed processing system that can spread parsing and analysis jobs out over the trading floor (at night).

I think you would likely find a similar parallel approach used in industry. e.g., most firms likely study a subset of instruments and then the processing of this data can be done in a highly parallel fashion. e.g., you could potentially build correlation matrices by throwing a pair of instruments on to an idle processor core either within a server farm or out to the trading floor (at night).

The other direction to go is to look at a commercial analytics database/system (such as Vhayu) that can both analyze tick data on the fly as well as store the data for model building and other tasks. Again these systems can be configured in a highly parallel fashion using multiple feed handlers and analytics cores. In the case of Vhayu running under 64-bit Windows Server, it can cache data into as much RAM as you can fit into the machine. We have one Vhayu box with 48 GB of RAM that can fit several days of TAQ data in memory.

As you suggest there is a lot of "noise" in such raw data sets so for our options data we have some statistical filters and exception reporting that alert us to suspect data. Having said that, I think I can count on one hand the kinds of data issues we see each day and we account for these already using heuristics.

One thing to keep in mind is that if you spend a lot of time post-processing this data and build models on it, you need to be sure you can do the same kinds of processing in real-time when you put these models into production or your algorithms will barf when they get fed the bad ticks, etc.

Happy to chat further about what we do in our shop.


Prof. H.