• C++ Programming for Financial Engineering
    Highly recommended by thousands of MFE students. Covers essential C++ topics with applications to financial engineering. Learn more Join!
    Python for Finance with Intro to Data Science
    Gain practical understanding of Python to read, understand, and write professional Python code for your first day on the job. Learn more Join!
    An Intuition-Based Options Primer for FE
    Ideal for entry level positions interviews and graduate studies, specializing in options trading arbitrage and options valuation models. Learn more Join!

Train/Test split


Playing with Random Forest Classifier, I am wondering what could cause in a 80:20 split the test results to perform better than in a 90:10 split?

With 2000+ data points and:
- with 80:20 split, considering only the test set, the model generates 150 signals with around 55% accuracy
- with 90:10 split, considering only the test set, the model generates 77 signals with around 49% accuracy

From the images, it seems like the more the model 'sees', the worse it gets.



And with 20:80:
53% accuracy with 784 generated signals

What could be the problem?
Last edited:
Erm...you are testing and comparing your strategy across different periods. You can't come to the conclusion "that the more the model 'sees', the worse it gets" because you are testing it on different periods. Train the model on different amounts of data but test it on the same period to test your hypothesis.