How much data engineering do QRs do?

Joined
3/28/23
Messages
6
Points
13
Hi there,

I am an applied math PhD student studying theoretial ML. It is theoretical because we mainly work with various computational models and assume various things about data(distributions, sample sizes, flows etc).

Anyway, I am now interested in getting into quant, but I find myself unprepared for the "engineering" parts of quant. Specifically, coding and data engineering.

Coding: I am familiar with standard ML frameworks(Pytorch etc), so I can implement basic things in Python. I am also familiar with the basics of algorithms and data structures at the undergrad level. But I know nothing about C++ and have very little experience in using computer programs to automate mundane tasks.

Data engineering: I took an intro course at the MMF program at my institution, and did some quant trading on my own. I might be wrong, but I find that, just as in "practical" ML, data engineering is very important. By data engineering, I mean things like "what kind of data to look at", "how to collect large amount of data", "how to structure the collected data".

For me, data engineering is an even higher barrier to entry than other quantitative problems. I have only interviewed with one firm so far. The technical questions were all interesting, but I found myself very unprepared for the questions related to data engineering. Perhaps they had higher expectations because of my background? But I am really more of a math person than a CS/ML person. I would have thought this will be the responsbility of quant developers.

So my question is, how much data engineering do real QRs do? What can an individual(not belonging to a group) do to become good at it? This really seems to be a new topic for interviews(at least there aren't not many in the green book).

A related question: Did you form alternative datasets for your own trades(before becoming a quant)? How did you do it?
 
But I know nothing about C++ and have very little experience in using computer programs to automate mundane tasks.
I would say QN C++

Not being able to program is like not having a driver's license IMHO.
 
But I know nothing about C++ and have very little experience in using computer programs to automate mundane tasks.
I would say QN C++

Not being able to program is like not having a driver's license IMHO.
Hi Daniel,

Thanks for the reply. Learning C++ is already an item on my to-do list, and I definitely plan to master the basics before applying to more serious positions. I will most likely do it by completing the QN C++ courses.

However, I am struggling with something bigger than not knowing C++, and that is not knowing how to "engineer".

To be fair, I can write prototypes for ML models in Python, and I can solve LeetCode style questions. Recruiters are lenient enough to allow me to do the coding exams in Python, so "not knowing C++" was never a problem, at least not explicitly. This may be because I was not interviewed for a quant dev role, or that they knew I was a theory person so there was little point to delving into something I was not expected to be good at, or that they just wanted a candidate with ML background so bad that they could tolerate it as long as the candidate could work with ML codebases(many of which are written in Python, but with C++ codes underneath). I really don't know why, but nobody actually said "we decided to discontinue the process because you don't know C++".

I think the real problem is I don't have experience with "real" programming. Prototypes and LeetCode, they all have small complexities. For what I do, datasets are already processed and standardized. Relevant data structures are already built(mostly). Implementations of optimizations at the hardware level are already written in C++. So on and so on.

I am like person who only knows how to cook pre-processed food: Just follow the recipe. But it doesn't really work when substantial customization is needed. This is why I asked about "data engineering" in the title. I used to work under the assumption that a streaming system is already available and "such and such" statistical properties have been known. But things are not that simple, right. Especially if we are building alternative dataset for new signals, where we need to build and design a streaming system tailored to the data/data source. I simply don't have this skill, and I feel like this is holding me back because this seems to be the kind of stuff they expect an ML person to be able to do.

So I was wondering if QRs also just cook with what they are given or if they need to actively work with the engineers to do the data engineering. If it is the latter, then I clearly need to become good at that and need to figure out how.

Not saying C++ is not as important, it is just that this(data engineering) is the roadblock I had already hit.
 
Back
Top Bottom