Predicting stock market crashes (2024)

An attempt with statistical machine learning techniques and neural networks

Published in

Towards Data Science

14 min read

Jan 14, 2019

Data and Crashes

The first step was to collect financial data and identify crashes. I was looking for daily price information from low correlated major stock markets. Low cross-correlation is important for valid cross validation and testing of the model. The matrix below shows the cross-correlation of daily returns from 11 major stock markets.

To avoid having any two data sets with a cross-correlation greater than 0.5 in my collection, I proceeded with only data from the S&P 500 (USA), Nikkei, (Japan), HSI (Hong Kong), SSE (Shanghai), BSESN (India), SMI (Switzerland) and BVSP (Brazil).

To identify crashes in each data set, I first calculated price drawdowns. A drawdown is a persistent decrease in price over consecutive days from the last price maximum to the next price minimum. The example below shows three such drawdowns in the S&P 500 over the period from end of July to mid August 2018.

I considered two different methodologies to identify crashes. The first one follows a suggestion by Emilie Jacobsson [2] who defines crashes in each market as drawdowns in the 99.5% quantile . With this methodology I found drawdown thresholds that classify a crash ranging from around 10% for less volatile markets like the S&P 500 to more than 20% for volatile markets such as the Brazilian one. The second methodology follows the suggestion from Johansen and Sornette [3] who identify crashes as outliers, that is drawdowns that lie far from the fitted Weibul distribution when the logarithm of the rank of drawdowns in a data set is plotted vs the drawdown magnitude.

I tested my algorithms with both crash identification methodologies and concluded that the first methodology (Jacobsson) is advantageous for two reasons. First, Sornette does not clearly state how much deviation from the Weibul distribution classifies a drawdown as a crash, thus human judgement is necessary. Second, his methodology leads to the identification of fewer crashes which leads to heavily imbalanced data sets. This makes it harder to collect a sufficient amount of data for a machine learning algorithm to train on.

With the collection of the seven data sets mentioned above I accumulated a total of 59,738 rows of daily stock prices and identified a total of 76 crashes.

Problem statement and feature selection

I formulated a classification problem with the goal of predicting, for each point in time (e.g. each trading day), whether or not a crash will occur within the next 1, 3, or 6 months.

If past price patterns are indicative of future price events, the relevant information to make a prediction on a certain day, is contained in the daily price changes of all days prior to that day. Thus, to predict a crash on day t, the daily price changes from each day prior to t could be used as a feature. However, because models presented with too many features do get slower and less accurate (“curse of dimensionality”), it makes sense to extract a few features that capture the essence of past price movements at any point in time. I therefore defined 8 different time windows that measure mean price changes over the past year (252 trading days) for each day. I used increasing window sizes from 5 days (leading up to day t) to 126 days (for t-₁₂₆ to t-₂₅₂) to get a higher resolution of price changes in more recent times. Because price volatility is not captured when averaging price changes over multiple days, I added 8 features for the mean price volatilities over the same time windows. For each data set I normalized the mean price changes and volatilities.

To evaluate the feature selection, I performed a logistic regression and analyzed the regression coefficients. The logistic regression coefficients correspond to the change in log odds of the associated feature meaning the logarithm of how the odds (ratio of the probability of a crash vs no crash) change with a change in that feature when all other features are being held constant. For the plot below I transformed the log odds to odds. Odds greater than 1 indicate that the crash probability increases with an increase in the corresponding feature.

The coefficient analysis shows that the volatility over the past several days is the strongest indicator of an upcoming crash. A recent price increase, however, does not seem to indicate a crash. This is surprising at first glance because a bubble is typically characterized through an exponential increase in price. However, many of the found crashes did not occur immediately after a peak in price, instead, prices decreased over some time leading up to a crash. A high price increase over the past 6 to 12 months increases the likelihood of a predicted crash, indicating that a general price increase over the long term makes a crash more likely and that price movements over longer time periods contain valuable information for crash forecasting.

Training, validation, and test set

I selected the S&P 500 data set for testing and the remaining 6 data sets for training and validation. I chose the S&P 500 for testing because it is the largest data set (daily price information since 1950) and contains the largest amount of crashes (20). For training I performed 6-fold cross-validation. This meant running each model six times and using five data sets for training and the remaining one for validation.

Scoring

To evaluate the performance of each model I used the F-beta score. The F-beta score is the weighted harmonic mean of precision and recall. The beta parameter determines how precision and recall are weighted. A beta larger than one prioritizes recall and a beta smaller than one prioritizes precision.

I chose a beta of 2 which puts more emphasis on recall meaning that an undetected crash gets a stronger punishment than a predicted crash that did not occur. This makes sense under a risk averse approach assuming that not predicting a crash that occurs has more severe consequences (loss of money) than expecting a crash that does not occur (missing out on potential profits).

Regression models, Support vector machines and Decision trees

I started with linear and logistic regression models. Regression models find the optimal coefficients for a function by minimizing the difference of the predicted and the actual target variable over all training samples. While the linear regression estimates a continuous target variable, the logistic regression estimates probabilities and is therefore generally better suited for classification problems. However, when I compared the prediction results of both models, logistic regression outperformed linear regression only in some cases. While this came as a surprise it is important to note that even though the logistic regression might provide a better fit to estimate probabilities of a crash, the sub-optimal fit of a linear regression is not necessarily a disadvantage in practice if the chosen threshold effectively separates the binary predictions. This threshold was optimized to maximize the F-beta score on the training set.

Next, I tested Support Vector Machines (SVMs). SVMs use a kernel function to project the input features into a multidimensional space and determine a hyperplane to separate positive from negative samples. Important parameters to consider are the penalty parameter C (measure for how much misclassifications should be avoided), the kernel function (polynomial or radial basis function), the kernel coefficient gamma (determines the dimension of the kernel function) and the class weight (determines how to balance positive vs negative predictions). The best SVM models achieved scores similar to those of the regression models. This makes regression models preferable as they train much faster. Decision trees were not able to perform at the same level with any of the other tested models.

Recurrent Neural Networks

The next step was to implement recurrent neural networks (RNNs). As opposed to traditional machine learning algorithms and traditional artificial neural networks, recurrent neural networks are able to consider the order in which they receive a sequence of input data and thus to allow information to persist. This seems like a crucial characteristic of an algorithm that deals with time series data such as daily stock returns. This is achieved through loops that connect cells so that at time-step t the input is not only the feature xₜ but also the output from the previous time step hₜ-₁. The figure below illustrates this concept.

However, a major issue with regular RNNs is that they have problems learning long term dependencies. If there are too many steps between xₜ-ₙ and hₜ, hₜ might not be able to learn anything from xₜ-ₙ. To help with that, Long Short Term Memory networks (LSTMs) have been introduced. Basically, LSTMs do not only pass the output from the previous cell hₜ-₁, but also the “cell state” cₜ-₁ into the next cell. The cell state, gets an update at each step based on the input (xₜ and hₜ-₁) and in return updates the output hₜ. In each LSTM cell, four neural network layers are responsible for the interactions between the inputs xₜ, hₜ-₁, cₜ-₁ and the outputs hₜ and cₜ. For a detailed description of the LSTM cell architecture please refer to colah’s blog [4].

RNNs with LSTM have the capability to detect relationships and patterns that simple regression models would not be able to find. Therefore, if an RNN LSTM would be able to learn complex price structures that precede crashes, should such a model not be able to outperform the previous tested models?

To answer that question I implemented two different RNNs with LSTM with the Python library Keras and went through rigorous hyper-parameter tuning. The first decision was the length of the input sequence for each layer. The input sequence for each time step t consists of daily price changes from a sequence of days leading up to t. This number has to be chosen with care since longer input sequences require more memory and slow down the computation. In theory the RNN LSTM should be able to find long term dependencies, however, with an LSTM implementation in Keras, the cell states are only passed from one sequence to the next one if the parameter stateful is set to true. In practice this implementation is cumbersome. To avoid that the network recognizes long term dependencies ranging over different data sets and epochs during training, I implemented manual resetting of the state whenever the training data switches data sets. This algorithm did not deliver strong results so I instead set stateful to false but increased the sequence length from 5 to 10 time steps and supplied the network with additional sequences of average price changes and mean volatilities from time windows prior to 10 days up to 252 trading days back (similar to the features selected for the previous tested models). Finally, I tuned hyper-parameters and tried different loss functions, number of layers, number of neurons per layer and dropout vs no dropout. The best performing RNN LSTM has a sequential layer followed by two LSTM layers with 50 neurons each, uses an adam optimizer, a binary cross entropy loss function and a sigmoid activation function for the last layer.

Evaluation

While hyper parameter tuning and increasing length of sequences and adding long term features lead to faster training (optimal result on the validation set after around 10 epochs), none of the RNN LSTM models were able to outperform the previous tested models.

The plot above shows the precision and recall performance of the different models. Different colors indicate different models and different shapes indicate a different predictive variable (crash in 1, 3 or 6 months). The bar plot below visualizes the F-beta scores of all models for a 1, 3 and 6 month crash prediction. Random stands for the expected performance of model with no predictive power that predicts a crash as often as the tested models.

The best results show a F-beta score of 41, 37 and 29 for the prediction of a crash within 6, 3 and 1 month respectively. The precision ranged from 12–16% and the recall from 45–71%. This means that while around 50% of the crashes are detected, around 85% of the crash signals are “false alarms”.

Conclusion

First the bad news. The RNN LSTM is seemingly not able to learn complex price patterns that would enable it to outperform the simpler regression models. This suggests that there aren’t any complex price patterns that occur prior to all (or almost all) crashes but do not occur otherwise. This does not mean Sornette’s hypothesis of crashes that are preceded by certain price patterns fit by the log-periodic power law is not valid. It means however, that if such patterns exist, (1) these patterns also occur in cases when they are not followed by a crash, (2) there are many crashes which are not preceded by these patterns or (3) there is not enough data for a RNN to learn these patterns. While more data would definitely provide more clarity, part of the problem might be a combination of (1) and (2). Sornette fits log-periodic power laws to certain crashes identified as outliers but does not do so for all crashes with a drawdown of similar magnitude. To provide an algorithm that finds the crashes described by Sornette, the training data would need to be specifically labeled with only the crashes that fit these patterns. This might lead to an improvement of identification for those crashes but would not help with (2) since the crashes that are of a different type would not be expected to be detected. However, if provided with sufficient data and a large enough list of identified crashes, it might certainly be worth rerunning the RNN LSTM models.

The good news is that simple price patterns, defined through long term changes in price and changes in volatility, seem to be occurring regularly before crashes. The best models were able to learn these patterns and forecast a crash significantly better than a comparable random model. For example, for a crash prediction in 3 months the best regression model achieved a precision of 0.15 and a recall of 0.59 on the test set while a comparable random model with no predictive power would be expected to achieve 0.04 precision and 0.16 recall. The results look similar for 1 month and 6 month crash prediction, with the F-beta score being best for 6 month prediction and worst for 1 month prediction. Whether these results are good enough to optimize an investment strategy is debatable. However, risk averse investors might certainly allocate their portfolio positions more conservatively if the discussed regression crash indicator continuously warns of an upcoming crash.

A look at test data price index charts and crash predictor indicator at the time of crashes showed that while some crashes have been detected remarkably well others occurred with no or almost no warning from the crash predictor. The figure above shows an example of a not predicted crash (in 1962) and three consecutive pretty well predicted crashes (in 1974). That some crashes are being detected better than others is in line with the assumption that certain typical price patterns exist that do precede some but not all crashes. The different algorithms mostly struggled with the same crashes, which is why I did not attempt to combine different models.

By taking a weighted average of the binary crash predictions over the past 21 days, (more recent predictions weighted stronger), the logistic regression model predicts the likelihood of a crash in the S&P 500 as of November 5, 2018 within 6 months with 98.5%, within 3 months with 97% and within one month with 23%. After reading this study I leave it up to you what to do with this information.

Note from Towards Data Science’s editors: While we allow independent authors to publish articles in accordance with our rules and guidelines, we do not endorse each author’s contribution. You should not rely on an author’s works without seeking professional advice. See our Reader Terms for details.

References

[1] Why Stock Markets Crash, Didier Sornette. Book available here.

[2] How to predict crashes in financial markets with the Log-Periodic Power Law, Emilie Jacobsson. Find the paper here.

[3] Large Stock Market Price Drawdowns Are Outliers, (2001) Anders Johansen and Didier Sornette. Find the paper here.

[4] Understanding LSTM Networks, colah’s blog. Link here.