Big Data in Finance

Technology has always been a driver in finance. That was true when Nathan Rothschild (the eponymous founder of Rothschild Bank) used carrier pigeons to relay the news of Napoleon’s defeat at Waterloo to London in 1815, something that was obviously going to move the London markets, and an innovation that, at the time, shortened the transmission of the outcome of the battle from days to hours. Technology as a driver in finance is also true today, perhaps even more so. And what’s really driving finance today, from a technology perspective, is Big Data (and Big Compute and Machine Learning and Data Mining and the Cloud, as these oftentimes go hand-in-hand with Big Data).

Which raises the question: What should a modern day quant know about Big Data?

In many ways, this is related to the changing role of the quant. From my own experience of having been a quant for 20+ years, you have to reinvent yourself every three to five years or die. These days, the best quants I see are not just good at the quantitative stuff (math and the technical side of finance), but are also ace programmers, because the best and most useful type of quants build useful tools that can be used to make more money while better understanding and (we hope) managing risk. Now added to the mix is the role of Data Scientist. For a modern-day quant it’s going to be difficult to avoid financial Big Data. Or, turning that statement around, if you are a modern-day quant and you aren’t really rather good with Big Data, you are handicapping yourself. (I’d like to say “shooting yourself in the foot,” but that may be a bit harsh, but not by much). Adapt and prosper, or die; it’s your choice. Has life ever been any different for a quant?

So, what should the modern-day quant know about Big Data? I'll answer that by picking out the “peaks” of the Big Data landscape that are particularly relevant from a quant finance point of view.

However, before I do that, I would like to define what Big Data is and describe some characteristics of Big Data, which I hope will leave us in the position of knowing what we’re talking about. Or, at the very least, for me to know what I am talking about!

I’m going to give not one, but two, definitions of Big Data in finance. The first is from an end-user perspective and leverages Microsoft Excel’s role as the de facto, front-end-of-choice for trading desks, risk departments, and pretty much every layer of the financial organization from front-to-back office. If end users have data that doesn’t fit in Excel, or requires hours for Excel to process, you typically have a Big Data problem.

The other perspective on Big Data is from an IT department’s point of view, and it basically says that if you are looking at a data set and the first thing that comes to mind is “gosh, this belongs in Hadoop,” then you have a Big Data problem. Note that in the second definition, it is only necessary to initially think the data belongs in Hadoop, not that it really does actually belong in Hadoop (and much financial doesn’t, but more on that later); it’s the sentiment that is at the core of the second definition. These are simple and practical working definitions for Big Data, and strangely enough, in more than a few years of working in financial Big Data, I’ve yet to hear someone disagree with them as useful working definitions.

Now that we know—or at least can broadly agree—on what Big Data is, it’s time to explore the nature, or character, of data. Data, generally speaking, is characterized along four axes: volume, velocity, variety, and veracity—also known as the four V’s for obvious reasons. Volume is the quantity of data. Velocity is the rate at which data is arriving. Variety is how structured or unstructured the data is (which, in short, is data complexity). And Veracity is the quality and reliability of the data (guess what, data that is clean and easily and unambiguously interpreted is better!) Data that is considered “Big” along any of these axes is, by definition, Big Data. This is our third, and perhaps, most generic definition of Big Data.

There is also a characteristic of financial data that sets it apart from data in many other industries, and this is its the relatively short “half-life.” The half-life of data is the time it takes for the economic value of the data to halve in value. To illustrate with a somewhat fatuous example, I’m going to give you the choice of two prices: 1) IBM’s stock price for yesterday or 2) IBM’s stock price for tomorrow. Any takers for yesterday’s stock price? No, I thought not. And a show of hands for tomorrow’s price? Yes, that’s more like it! Clearly, stock prices, generally speaking, have a rather short half-life. What this means from a practical point of view is that for much financial data, if you can’t use it within a small multiple of its half-life, you may as well throw it away. Its value can decay that quickly. There may be other reasons for storing the data, such as regulatory mandates, but you should always be mindful of the economic value of the data you are dealing with and deal with it accordingly.

What’s driving Big Data? Financial markets across all times and all places are governed by two simple impulses: greed and fear. Or, opportunity and risk, to be more polite. The opportunity to make a dollar, and the chance to not lose a dollar. Companies in finance see great value in Big Data, otherwise they would stop using the stuff in a heartbeat. They also see the chance to better manage their risks using more data. And, as mentioned before, the regulators are hard at work with new and far-reaching mandates that generate, and require the storage of, vast amounts of new data. What also drives data growth is technical capability, such as the price of disk storage; because if we could not store and process Big Data economically, again, we’d stop doing it.

Also, I’ll give a piece of advice to budding and existing quants. And it is this: build a portfolio of tools that people love to use. Not only will this endear you—beyond measure—to your existing employer and users, it will also prove to be an invaluable resource when it comes to finding your next employer and next group of users. If you are a budding quant without an employer, then build some demo tools or contribute to open-source projects in the finance space. There is a world of difference, a gulf that is tremendously wide, between just talking about something versus saying, “hey, look, if I can just flip open my laptop for a moment I can show you a real-time simulator I built for bank-wide CVA calculations that uses GPUs for compute acceleration” (or some other tool or technique that knocks their socks off). It’s good advice, because I have successfully used it many times myself.
Now, back to the “peaks” of the Big Data landscape.

The perspective I want to give is a combination of techniques and tools that a practicing quant should ideally have at their fingertips. You don’t have to be an expert in each, but you should know enough to know what techniques and tools to use in a given situation and to become expert when the need arises. In the case where you need to become expert in something very quickly, in the words of a former head trader I worked for: “You have a week!” On the trading desk I don’t think it has ever been otherwise.

In terms of techniques, there are a number of areas of importance. Data gathering, cleaning (also called “scrubbing”), normalizing (putting everything on the same apples-to-apples basis), storing and management; all of which I will group together under “Data Programming”. And “Data Insights” are ways of understanding the nature and character of the data you are dealing with; you need to understand your data before you can intelligently attack it with analysis. This “insights” step is often overlooked. “Fools rush in” is the expression that comes to mind when people do this. “Data Analysis” is extracting meaningful and actionable information from the data. This is the way I think when tackling Big Data problems; if you don’t find it useful, feel free to create your own. But one way or another, design and build a Big Data tool chain that works for you, because an ad hoc set of tools that you throw together for each project will leave you in a world of pain.

Data Programming
In many ways, this is the plumbing that supports everything else you want to do with data. Like real-world plumbing, you want this to be tight, clean, and have the right capacity. No one wants to be dealing with an ugly mess on the floor, or have to metaphorically put their hand in the toilet bowl to unblock things! This is something you just have to get right, otherwise you won’t get the insights and analysis for your data. Also, something that is very often overlooked is that few (if any) data sets are static; data is a dynamic and living thing, so an automated mechanism for updating your data set is a must, and this mechanism must also be robust and scale as your data set gets bigger.

In terms of gathering, cleaning, and normalizing your data, scripting languages are very useful. Languages such as Python have a rich set of libraries that make data manipulation, if not simple, then at least easier. And don’t be afraid to use older but still very useful and powerful tools such as Awk. You can use traditional languages such as C++, but that will be really productive only after you have built a data toolbox or developed a DSL (Domain Specific Language) for the purpose, in which case you have effectively created your own scripting language anyway (so why not just use something like Python and save yourself the effort?)

Also, since I’ve now mentioned DSLs, I would like to say a few more things about them. DSLs for Big Data are incredibly powerful. They are a way of getting things done very quickly and succinctly and at a level of abstraction that end users can understand; this way you can provide Big Data tools for end users to use. This is true not just for the data programming part of the data tool chain, but anywhere in the chain of tools you use for data insights and analysis. Start building DSLs for Big Data and make your life, and the lives of your users, easier. People will love you for it!

For data storage and management, there are plenty of good databases and tools. You should build your own only if doing so is a real competitive advantage (emphasis on “real” here). There are plenty of in-memory databases and NoSQL databases and relational databases (yes, some Big Data really does belong under the relational model!) that you should find one where your data fits well. Just make sure that your choice here makes downstream activities—insights and analysis—simple and not hard.

In the area of data storage and management, one tool that I must mention specifically is Hadoop, which I will introduce through a story.

Some years ago I was working on Big Data on Wall Street and I would often ask “Have you looked at Hadoop?” to which the response what nearly always “What’s Hadoop?” Fast-forward from that point by six months, and I would ask people “Are you working on Big Data?”—often to the same people as before—and the answer would be “Yes, we have a Hadoop project!” In six months people had gone from not knowing what Hadoop is to Hadoop being synonymous with Big Data! Gosh, things move fast in finance.

Hadoop is not just a tool for storing and managing Big Data, it is in reality an ecosystem of tools that includes such things as machine learning (Mahout). Hadoop is simply a must-have skill for a quant these days; start learning it today.

Data Insights
Know your data.
That seems a sensible idea, but it’s amazing how many people jump into analysis without even the most basic knowledge of what they are dealing with. Before doing analysis on your data, and certainly before you start making important decisions with your data, you should have an intimate knowledge of all aspects and characteristics of your data. Think of it like this: if you were going to attack an enemy on a hilltop over open ground, wouldn’t you want to do some reconnaissance first? Data reconnaissance, if we can call it that, will give you a good picture of the battlefield before you advance.

Tools I find useful here are, again, the scripting languages and tools used for data programming. In addition, data visualization is a very powerful technique for having a sense of what your data is about. Whereas a large data set presented as a table of numbers is largely incomprehensible to a human, the same data creatively displayed as a graphic—ideally one that is interactive and which allows the user to zoom in and out, flip, and rotate—can covey meaning at all scales, large and small. Tools that are useful for this type of exploratory work include MATLAB, Mathematica, and R, the latter being free and open-source. These same tools are very good at extracting statistical and other summary measures from your data. You should also keep an eye out for new and useful tools that may make you more productive; this is general advice for the whole data tool chain. The data language Julia is one such tool that comes to mind and is worth keeping an eye on.

I’ll close the discussion on data insights with a word of caution. Data sets are so large these days that there is a danger of seeing patterns in the data that simply aren’t there. The term for this is apophenia. If you have ever looked at a cloud and seen a ship, a car, or a face that looks like your grandmother, you have experienced apophenia. The cloud has so many countless water particles that almost any pattern can be fitted to them just by altering your point of view. Make sure this doesn’t happen with you and your financial data. Apophenia in financial data is particularly prevalent when looking for profitable strategies from the data.

Data Analysis
This, frankly, is the purpose of Big Data. Data programming and data insights were just a way to get you here in an orderly fashion. Now it’s time to extract value from the data. The data tool chain all the way up to this point has been expense, now it’s time for profit!

When you chose how to store and manage your Big Data (the data programming step), you will have chosen a tool that makes the analysis easier. Here’s where the ecosystem of tools around something such as Hadoop pays big dividends. Not only does Hadoop provide good out-of-the-box tools for analysis, it also provides tools to build your own analysis tools. Hadoop is particularly strong in this area, but other NoSQL and relational tools are coming along very nicely too and definitely worth looking at.

It’s also the case that Excel and R have become rather good front-ends to Big Data; Excel in particular is a comfortable and easy-to-use front-end for end users. And I will repeat a common theme throughout this article: if you build tools that give easy access to Big Data to your end users (eliminating you as the bottleneck at each stage of the data tool chain), people will love you for it.

Lastly, and this is again something all too often overlooked, analysis is also a source of Big Data. Data feeds on data, and more Big Data is often the byproduct of Big Data. Indeed, sometimes your analysis may generate data sets that are larger than your original Big Data. Just make sure that extra size is reflected in the added value they bring.

So, where are we today? Big Data is now a reality in finance and pervades every nook and cranny of financial institutions. IBM has determined that 90% of the world’s data was created in the past two years alone, and there seems no end in sight to data growth. From this quant’s perspective, you had better get your Big Data skills up to snuff—and quickly.
To echo again the words of my former head trader boss, “You have a week!”, so you’d better get started soon.

Andrew Sheppard started his career in finance as a quant at Bankers Trust working in London, then Tokyo, and finally in New York. Andrew has since worked as a consultant, chief quant, and CTO at various European and U.S. banks and a multi-billion dollar hedge fund. Since 2010 he has worked as a consultant exclusively in the areas of Big Data and Big Compute in finance and insurance.
Big Data and Financial Services.jpg
 
Back
Top