OS DB for storing market update messages - Recommendations plz

Hi all,

was wondering if anyone can recommend or has experience with an open source database for storing and, especially, retrieving market depth and trade messages?

Thus far I am caching messages in memory until they grow to a certain size by creating an object of each message, push that object onto a list and then serialize that list using protobuf. The serialised lists then are zipped and stored onto the filesystem using some naming convention for later retrieval. That whole operation works actually quite well in terms of efficiency and storage space.

For analysis, I'd then simply deserialise the list into memory and use C# linq for queries. This works reasonably fast. For instances deserialising a list of about 2mil "message objects" takes about 4-6 seconds on a quad core with 16GB of RAM. Add another 10 seconds or less for returning the result of a particular query.

I'd like to get rid of the 'deserialising the list into memory' step and be able to query the file directly and am therefore looking at other methodologies. Thus far, I have looked at,

- Couple of OODBs,
- SQL Server and MySQL,
- KyotoCabinet

None of them were able to beat my benchmark, i.e. rather naive implementation described above. Currently, I am looking at HDF5, but given my previous experiences, I am not putting too much hope into it. Would be great if someone had any pointers. Note that commercial alternatives such as Kdb, OneTick, and the like are not an option at this stage.
I have read mixed reviews with respect to MonetDB's application to HF data and therefore never tried. Might give it a shot though. Anyone have experiences with MonetDB, Infobright and the like?
2 Million rows certainly isn't too big for modern database systems to handle. You may have fubar'd your tables and or queries.
This was just an example of one hour of market depth and trades messages. My experience is that if you have, say, a year of data relational DBs do not scale well and working with binary files is more suitable.
This was just an example of one hour of market depth and trades messages. My experience is that if you have, say, a year of data relational DBs do not scale well and working with binary files is more suitable.

I suppose this comment was a reply to my previous comment.

Your comment continues to lead me to believe you are doing something wrong. If you don't have relational data, you probably don't need to use a relational database.

Relational databases scale extremely well for large datasets (and I say this as someone who works on a system that indexes 2.6 Million documents per day), if you know what you are doing.

If you are talking Billions of data points, you are going to need something distributed. But if you are at that scale, its probably time to spring for something specialized.
I guess I should have clarified: the messages I am referring to are snapshots of the market order book at irregular-spaced, very short time intervals containing either depth updates (bid and ask volumes and prices up to ten levels each side) or trade updates.

What I meant to say is that relational DBs are inappropriate (rather than not scaling well) in terms of the space they require and slower at retrieving subsets (ie all data between some time range across various instruments) compared with simple and not very tweaked/ optimized binary file reads.

I happily admit not to be an expert when it comes to DBs, but my comparisons (Sql Server and MySQL vs binary file reads on ~16 mil records) confirmed this. I guess if I put more effort into my current 'access layer' to the binary files, I could get even better performance without throwing more hardware at the problem. From what I understand about hdf5, it seems to offer exactly that.

I have read about columnar DBs like Infobright and am curious to know whether anyone has actually used them for working with high frequency data? I have searched quite extensively but haven't seen any benchmark studies using HF data.
Have you already looked at Berkley DB? It's an open source alternative to kdb for storing real time data. My experience lies only with Sql Server so I cannot comment on Berkeley. I hear HDF5 is quite time consuming to master.

We'd like to know about your choice.
Havent tried BerkeleyDB but KyotoCabinet,

Kyoto Cabinet: a straightforward implementation of DBM

which is also a Key/Value type database. I think I have come across some test cases on the web where it actually did perform better. Havent thoroughly tested it, but just did some basic tests using the Hashtable class. I have yet to check its other implementations. At this stage I just want to test/ benchmark as much as possible to get an overview before committing too much time to a single system.

Eugene Krel

If you are doing academic research you can try using kdb, kx.com . The 32bit version is free to use for non-comercial purposes.


Older and Wiser
I have used Berkeley DB (aka BDB) in a production environment. It is really fast for Key-Value storage. You can create a pseudo relational schema (including a custom query language) on top of it by using Amazon Carbonado.

I have only heard good things about Tokyo Cabinet (it's used inside Google).
Thanks Alan and Eugene. Didnt know that the 32bit developer's version is for free - never thoroughly looked at it when I saw the price tag :S

KyotoCabinet is basically TokyoCabinet 2.0 from what I understand and developed in C++. It's actually quite straightforward to use, has decent docu, but not sure how a wide a user base. The pseudo-relationial scheme sounds interesting.

May I ask what sort of granularity your data had, what time frame and number of instruments it spanned? Also, could you give an example of how quickly you were able to retrieve record sets? That'd be appreciated, Thx!


Older and Wiser
it's daily market data. It covered all the equities markets (North America, UK, Europe, Latin America, Asia and Australia and Africa), all futures markets, all options markets, some fixed income (mostly US, UK, and EUR), all options markets. The total size was around 15GB per day. The data was used to do Portfolio Analytics. Retrieving recordsets was faster than optimized MySQL for the same use case (I did the test - I don't remember exact timings though).

One caveat though, BDB is no panacea. It has some quirks too.