Stock Price Prediction with Big Data and Machine Learning

chollida1 · on Nov 25, 2014

I read this last night and thought it was a great writeup.

Meta note: I especially like how the code was intertwined with descriptive text like an R Knitr file. It makes it easy to follow allow and verify that what the author says he's doing, is actually what he's doing:)

A few issues about using this in production, none of which are intended to slight article's author or his work.

1) The biggest issue for me has always been speed. In this instance he's using one symbol, imagine trying to do this against 1000 symbols with the real depth of market and not just the precanned market data he's using. its alot more data to parse and classify. You can start to see why many HFT systems are more of an IT and programming endeavor than a quantitative one, not that the quantitative portion isn't important:)

2) Just being able to detect which way the stock will tick doesn't really help as much as you'd think.

Assume that you can correctly identify the direction of the next tick 100% of the time. To make money off of this information you need to:

- be able to do this classification and send the order to market faster than it takes to receive the next tick(n+1), a difficult task for most stocks that trade in major US markets.

- Get to the top of the order book, again a difficult task as the bid/ask spread is already tight, and sometimes at its penny limit.

- Get someone to fill your order, again a factor of being at the top of the order book,

- Identify if the tick n+2 is going in the same or opposite direction. if you guess wrong, then you lose. If you guess right you still need to be able to exit your position.

As always, if you are capable of doing this kind of work and able to work in Canada. I'd love to chat with you!! Heck even if you are interested in machine learning and the markets, I'll make time to chat.

terranstyler · on Nov 25, 2014

For this very reason I always try to stay in the "mid to low frequency" spectrum.

Put very bluntly, in high frequency domains you compete with fast people (and they have resources) but in mid- or low frequency you compete with smart people. The problem is much harder but therefore it's also much easier :)

This is not to say that I'm a profitable quant (because I'm not) but I think my chances are much higher in the are I chose.

haddr · on Nov 25, 2014

Exactly, the main problem is that it might be much easier to do such prediction "a posteriori" rather than use such algorithms in the real-time bidding...

The question that seems interesting: is it possible to guess the price movement for some t+delta moment, where the delta could be for instance 0.1 sec? Or it would be completely unpredictable...

skillachie · on Nov 26, 2014

I would also be very interested in seeing this applied to predictions at t + delta as well. Will probable attempt it over the Christmas break

haddr · on Nov 27, 2014

Great! let me/us know your results!

dxbydt · on Nov 25, 2014

>imagine trying to do this against 1000 symbols

How about doing this not with 1000 equities but with say SPY options, or any other high volume options on cboe ? As long as your derivatives are ATM or say max 2 hops from ATM on either side & open interest is high, it would work ok.

lordnacho · on Nov 25, 2014

You can use the advanced technique to uncover heuristics or simplifications that are faster to trade. Conclusions that take a fine instrument to reach may not need such a fine tool to act upon.

I've only had a glance, but it looks very well done.

reality · on Nov 25, 2014

Thanks for nice overview ;)

IgorPartola · on Nov 25, 2014

From a 10k foot view: if you actually manage to build something like this that you can use to trade, why would you ever publish it? No matter how small of an edge you get over the rest of the market, you can turn that into a huge amount of money, so why reveal it? The corollary to this is that if something like this is published, it means it doesn't actually work in practice.

I did a little bit of BTC trading, and I thought I had an interesting idea. I traded on one of the smaller exchanges, but uses Mt. Gox as the oracle to predict which way the price would move. The basic idea worked pretty well: there was correlation. The problem ended up being my order placing algorithm which would actually do the wrong thing on very large/fast swings.

I think that idea may be interesting to apply, in terms of correlated stocks. If you detect a price drop in stock for a lithium mining company, you might predict a stock drop for Apple, since Apple uses lithium batteries for their devices.

gjm11 · on Nov 25, 2014

> why would you ever publish it?

Suppose you have a miraculous trading algorithm that you believe makes somewhat above-market returns with somewhat below-market risks, but you have only a small amount of capital you're happy to gamble with.

Then what you can do by keeping your algorithm to yourself is: have investments that perform slightly better than most other people's. That's nice, for sure, but getting rich that way takes a long time unless you start out quite rich (or gamble a lot and get lucky, but that's also a way to get poor).

On the other hand, what you get by publishing it might be a lucrative job offer at a hedge fund or investment bank, who hope you can use the skills you've just demonstrated to get them an extra 0.1% of return on their $10B pot. Now you have the opportunity to apply the same techniques to thousands of times more money than before, along with much more data and a server room full of hardware for backtesting and other smart people who will look at your clever ideas and maybe notice if there's a big mistake or omission. Of course most of the gains from your miraculous algorithm now go to the investors, and most of the rest probably go to other people with more seniority, but you are still likely to get rich faster and more reliably that way.

I make no claim about the particular algorithm here or the person who published it. But the above seems to me like a pretty plausible reason why someone might prefer to publish, even if they have good reason to think their algorithm works.

freddealmeida · on Nov 26, 2014

I suppose this is the argument made for closed source over open source software as well. There are benefits in opening an algorithm since by far the current algorithm is not optimal. Of course, if you believe the market assumes all available information in pricing, then sharing an algorithm shouldn't really have an effect except for small rent extraction.

IgorPartola · on Nov 26, 2014

I don't know if that's a fair comparison. You can open source just about anything, because in software the execution matters a whole lot more than the core of the software in all cases (to a first approximation). For example, if Netflix open sourced their recommendation algorithm, do you think they'd go out of business because another movie streaming service would pop up overnight and take over? No, Netflix has name recognition, etc.

On the other hand, there is very little barrier to entry if you have a winning stock trading algorithm. All you really need is money, which is easy to acquire (for this purpose at least).

hayksaakian · on Nov 26, 2014

to add to the other comments's explanation:

imagine you have algos A and B, where A is better than B

your top competitor has algo B, and everyone else has C

if you release B to the public, now every C becomes a B and competes directly with your strongest competitor.

(pure speculation)

discardorama · on Nov 25, 2014

The problem is: "70% accuracy" doesn't mean much, since he's framed it like a classification problem. It's more like a regression problem.

Plus, the data used is only for a couple of days.

And finally: if someone had an ML model to reliably make money on the stock market, they wouldn't be writing about it; they'd be laughing all the way to the bank.

petegrif · on Nov 25, 2014

Many years ago I was a trader. I recall watching analysts 'explaining' what had just happened in the markets. The question that naturally arose, was why they weren't speaking from their yachts.

hayksaakian · on Nov 26, 2014

the simple answer is that you don't have the same information in front of you that you did at the time.

in hindsight, we can easily identify what is and is not relevant/influential, whereas in real-time anything could prove to be relevant/influential.

dkfmn · on Nov 25, 2014

The last point is especially salient. Except there's no "if." Major firms are already doing this (and not talking about it).

encoderer · on Nov 25, 2014

ThinkOrSwim has had a feature like this called ThinkAI for many years. Personally, I think it's not better than random.

Years of thinking and ruminating and learning (and investing) on the subject has left me solidly in the "random walk" camp. At any given point, a stock is equally as likely to go up or down. There's a small upward bias in the market (greater than inflation), and I reason that it's the premium offered over debt to take the higher risk of equity.

If you plot number of consecutive up-days and down-days, it's a normal distribution, skewed to the right, and with fat tails.

That said, my own belief and experience suggests that I can consistently press a small edge, which is why I gravitate towards options and futures (highly leveraged, high notional value). I don't think that would be possible without a "Portfolio Margin" account.

imaginenore · on Nov 26, 2014

Then how do you explain the success of companies like Jane Street? Their business is built on predicting the near-future stock price. They are pretty open about it, they even have tech talks on Youtube.

swframe · on Nov 25, 2014

Part of the stock's future price is based many features that are not in the dataset. In addition, a stock's current price is based on predictors which interact chaotically. Your predictor has to take into account that other predictors are watching in real-time and are trying to take advantage of it. It seems that attempts to predict the stock market make it random.

alphaBetaGamma · on Nov 26, 2014

The reason why they get good and enticing but unactionable predictions, is that they assume that they can trade instantaneously at the quoted price. They are neglecting the time their order would get to the matching engine, by which time the opportunity would be gone often enough to make their trading strategy unprofitable.

In fact the authors of the original paper seem completely clueless about this point: "For example, since the prediction time of AAPL, 0.0311ms, is less than 0.0612ms, which is the time difference between the upward spread crossing events from the Row k −1 to Row k + 4 in Table 1, the model could in principle perform fast enough to influence corresponding trading decisions"

As if their order could reach the matching engine in 30 us...

btbuildem · on Nov 25, 2014

Once again - if it worked, it would not be a scientific paper / freely available on github. The creators would be using it to make money, keeping mum all the while.

firebones · on Nov 26, 2014

While I 100% agree with the sentiment, there are freely published strategies for beating 80% of market participants (namely, buy and hold low cost index funds with rebalancing) that exist in the wild and appear to be resilient to disclosure. Yet many, many people ignore this edge over their peers.

raverbashing · on Nov 25, 2014

It will be another failed experiment.

Stock price is an stochastic process. It's unpredictable for the most part (of course if there are news with a big impact on a company the stock price usually reflects that, still)

It's certainly a nice experiment, but don't expect to get rich with it (the opposite is most likely)

throwaway283719 · on Nov 25, 2014

Out of curiosity, what percentage of out-of-sample variance explained would convince you that a predictive model had some power? 0.1%? 1%? 2%?

mikkom · on Nov 25, 2014

If I understand this article (only skipped it over) he hasn't normalized the data. So if the training data and validation data have similar trend, the result will be invalid.

edit: It's two days of data? For one stock? If you are interested in this stuff, this is not an article worth reading.