Preprint / Version 1

A 15-Year Empirical Comparison of Deep Learning, Tree-Based, and Regression Models for Stock Market Forecasting

##article.authors##

  • Samyak Jayanth Polygence

DOI:

https://doi.org/10.58445/rars.3334

Keywords:

Computer Science, Machine Learning, Artificial Intelligence, Linear Regression, Financial Markets, Stock Market, Neural Networks

Abstract

This study conducts a comprehensive comparison of traditional machine learning, deep learning, and naïve baseline models for daily stock close predictions, using twelve publicly traded equities from various sectors within the S&P 500 index, including Tesla ($TSLA), JPMorgan Chase ($JPM), NVIDIA ($NVDA), UnitedHealth Group ($UNH), Alphabet ($GOOGL), General Electric ($GE), The Coca-Cola Company ($KO), ExxonMobil ($XOM), Duke Energy ($DUK), American Tower ($AMT), and Linde plc ($LIN) and the SPDR S&P 500 ETF Trust ($SPY). The SPDR S&P 500 ETF Trust - hereafter referred to by SPY - is used as a highly liquid proxy for the S&P 500 index, closely mirroring its price movements. Five models, Linear Regression, Lasso Regression, Random Forest, XGBoost and Long Short-Term Memory (LSTM) neural networks, are trained on identical historical market data with engineered technical features. These are then evaluated against common baselines, including previous close, previous open, and midpoint estimators. The predictive accuracy of each model is assessed using the average Mean Squared Error (MSE) and directional accuracy of each model on all 12 equities, with additional analysis via permutation feature importance to determine key drivers of model performance. Results reveal that, despite the complexity of advanced models, like the LSTM and XGBoost, simple linear and decision-tree based approaches often outperform deep learning in this domain, while all models struggle to consistently surpass basic baselines due to the efficient and noisy nature of financial markets. This research highlights the practical strengths and limitations of various modeling approaches for financial time series, offering guidance for future research and practitioners interested in predictive modeling for asset prices.

References

Fischer, T., & Krauss, C. (2018). Deep learning with long short-term memory networks for financial market predictions. European Journal of Operational Research, 270(2), 654–669. https://doi.org/10.1016/j.ejor.2017.11.054

Krauss, C., Do, X. A., & Huck, N. (2017). Deep neural networks, gradient-boosted trees, random forests: Statistical arbitrage on the S&P 500. European Journal of Operational Research, 259(2), 689–702. https://doi.org/10.1016/j.ejor.2016.10.031

Bailey, D. H., Borwein, J., Borwein, J., López de Prado, M., & Zhu, Q. J. (2015). The Probability of Backtest Overfitting (February 27, 2015). Journal of Computational Finance (Risk Journals), 2015, Forthcoming. Available at SSRN: http://dx.doi.org/10.2139/ssrn.2326253

Twelve Data. (n.d.) Stock Market & Financial Data API. https://twelvedata.com/

Investing.com. (n.d.). S&P 500 Historical Data. https://www.investing.com/indices/us-spx-500-historical-data

Pedregosa, F., et al. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825–2830. https://scikit-learn.org/

Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794. https://doi.org/10.1145/2939672.2939785

Paszke, A., et al. (2019). PyTorch: An Imperative Style, High-Performance Deep Learning Library. Advances in Neural Information Processing Systems, 32, 8024–8035. https://pytorch.org/

Deniz, G. (2019). Random Forest [Figure]. Medium. https://medium.com/@denizgunay/random-forest-af5bde5d7e1e

Colah, C. (2015). Understanding LSTM Networks [Figure]. https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Harris, C. R., Millman, K. J., van der Walt, S. J., et al. (2020). Array programming with NumPy. Nature, 585, 357–362. https://doi.org/10.1038/s41586-020-2649-2

McKinney, W. (2010). Data Structures for Statistical Computing in Python. Proceedings of the 9th Python in Science Conference, 51–56. https://doi.org/10.25080/Majora-92bf1922-00a

Hunter, J. D. (2007). Matplotlib: A 2D Graphics Environment. Computing in Science & Engineering, 9(3), 90–95. https://doi.org/10.1109/MCSE.2007.55

Pedregosa, F., et al. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825–2830. https://scikit-learn.org/

Paszke, A., et al. (2019). PyTorch: An Imperative Style, High-Performance Deep Learning Library. Advances in Neural Information Processing Systems, 32, 8024–8035. https://pytorch.org/

Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794. https://doi.org/10.1145/2939672.2939785

Downloads

Posted

2025-10-26