Is Facebook's "Prophet" the Time-Series Messiah, or Just a Very Naughty Boy?

Facebook's Prophet package aims to provide a simple, automated approach to prediction of a large number of different time series. The package employs an easily interpreted, three component additive model whose Bayesian posterior is sampled using STAN. In contrast to some other approaches, the user of Prophet might hope for good performance without tweaking a lot of parameters. Instead, hyper-parameters control how likely those parameters are a priori, and the Bayesian sampling tries to sort things out when data arrives.

Judged by popularity, this is surely a good idea. Facebook's prophet package has been downloaded 13,698,928 times according to pepy. It tops the charts, or at least the one I compiled here where hundreds of Python time series packages were ranked by monthly downloads. Download numbers are easily gamed and deceptive but nonetheless, the Prophet package is surely the most popular standalone Python library for automated time series analysis.

Prophet's Claims, and Lukewarm Reviews

Funny thing is though, that if you poke around a little you'll quickly come to the conclusion that few people who have taken the trouble to assess Prophet's accuracy are gushing about its performance. The article by Hideaki Hayashi is somewhat typical, insofar as it tries to say nice things but struggles. Yahashi notes that out-of-the-box, "Prophet is showing a reasonable seasonal trend unlike auto.arima, even though the absolute values are kind of off from the actual 2007 data." However, in the same breath, the author observes that telling ARIMA to include a yearly cycle turns the tables. With that hint, ARIMA easily beats prophet in accuracy - at least on the one example he looked at.

Professor Nikolaos Kourentzes benchmarked prophet against several other R packages - namely the forecast package and the smooth package which you may have used, and also mapa and thief. His results are written up in this article which uses the M3 dataset and mean absolute scaled error (link). His tone is more unsparing. "Prophet performs very poorly... my concern is not that it is not ranking first, but that at best it is almost 16% worse than exponential smoothing (and at worst almost 44%!)."

What's up with the top dog?

Is this a case of Facebook's brand and marketing catapulting a mediocre algorithm to prominence? Or perhaps it is the echo-chamber effect (lots of people writing how-to articles on medium?). Let's not be quick to judge. Perhaps those yet to be impressed by Prophet are not playing to its strengths, and those are listed on Facebook's website. The software is good for "the business forecast tasks we have encountered at Facebook" and that, according to the site, means hourly, daily or weekly observations with strong multiple seasonalities.

In addition, Prophet is designed to deal with holidays known in advance, missing observations and large outliers. It is also designed to cope with series that undergo regime changes, such as a product launch, and face natural limits, due to product market saturation. These effects might not have been well captured by other approaches. It doesn't seem unreasonable then, to imagine that Prophet could work well on a domain it was built for. It is presumably under these conditions that the claim can be made, as it is on a 2017 Facebook blog post, that "Prophet’s default settings produce forecasts that are often [as] accurate as those produced by skilled forecasters, with much less effort.''

The claim is pretty bold - as bold as Prophet itself, as we shall see. Not only does the software work better than benchmarks (though none are explicitly provided) but also human experts. Presumably, those human experts are able to use competing software in addition to drawing lines by hand ... but what were they using, one might wonder? The same blog post suggests "as far as we can tell there are few open source software packages for forecasting in Python."

Here I'm sympathetic, conscious of the officious policing of firewalls that can occur at large companies. And that statement was made in 2017, I believe, though even accounting for the date, the lack of objective benchmarking strikes me as a tad convenient. My listing of Python time series packages is fairly long, as noted, though of course many have come along in the last three years.

Still, it shouldn't be too hard to find something to test Prophet against, should it? A recent note suggests that prophet performs well in a commercial setting, but - you guessed it - does not explicitly provide comparison against other Python packages. Nor here. An article by Navratil Kolkova is quite favorable too (pdf). The author notes that the results are relatively easy to interpret - which is certainly true. But was performance compared to anything? I'll let you guess.

You will have surmised by now that the original Prophet paper, Forecasting at Scale by Taylor and Letham, is also blissfully comparison free (pdf). The article appears, slightly modified, in volume 72 of the American Statistician, 2018, so perhaps my expectations are unreasonable (pdf). The Prophet methodology is plausible, it must be said, and the article has been cited 259 times. The authors explain the tradeoffs well, and anyone looking to use the software will understand that this is, at heart, a low pass filter. You get what comes with that.

Objectively measured hard-to-beat accuracy might not be part of that bargain. I'm old school perhaps, but I think that's a highly relevant criterion. And I'm thankful not to be the only person dying of curiosity when it comes to the matter of whether the world's number one Python time series prediction library can actually ... you know ... predict stuff.

But having done a few too many Google searches on this topic I'm starting to anticipate depressing headlines. For instance a paper considering Prophet by Jung, Kim, Kwak and Park comes with the title A Worrying Analysis of Probabilistic Time-series Models for Sales Forecasting (pdf). Yes I think we know where that is heading. As the spoiler suggests, things aren't looking rosy for Facebook's flagship library. The authors list Facebook's Prophet as the worst performing of all algorithms tested. Oh boy.

Ah, you object, but under what metric? Maybe the scoring rule used was unfair and not well suited to sales of Facebook portals? That may be, but according to those authors Prophet was the worst uniformly across all metrics - last in every race. Those criteria included RMSE and MAPE as you would expect, but also mean normalized quantile loss where (one might have hoped) the Bayesian approach could yield better distributional prediction than alternatives. The author's explanation is, I think, worth reproducing in full.

The patterns of the time series are complicated and change dynamically over time, but Prophet follows such changes only with the trend changing. The seasonality prior scale is not effective, while higher trend prior scale shows better performance. There exist some seasonality patterns in the EC dataset, but these patterns are not consistent neither smooth. Since Prophet does not directly consider the recent data points unlike other models, this can severely hurts performance when prior assumptions do not fit*.*

In recent times, attention has turned to prediction of COVID-19 rather than product cycles. But again, Papanstefanopoulos, Lindardatos and Kotsiantis (paper) find Prophet underperforms ARIMA. Stick to TBATS, their results advise. A similar finding is relayed by Kumar and Susan (pdf), and there's no love either from Vishvesh Shah in his master's thesis comparing SARIMA, Holt-Winters, LSTM and Prophet. Therein, Prophet is the least likely to perform the best on any given time-series task. LSTM's won out twice as often, and both were soundly beaten by the tried and tested SARIMA.

Sales data with regularities and holidays is the wheelhouse for Prophet, one would think, but an excellent (and rather hilarious) Kaggle kernel by "Mysterious Ben" found that a "dumb model", in his words, easily outperformed Prophet in predicting store item demand. This was true even though the data was pre-regularized, and not actually real-world data (aside: it seems you can't trust Kaggle to give you real data, stick to microprediction.org if you want real "real" data). More surprisingly, Prophet's performance deteriorated when holidays were added to the model.

Grocery sales are better suited to ARIMA than Prophet, according to Hariharan (article) and woes continue for Prophet in the paper Cash Flow prediction: MLP and LSTM compared to ARIMA and Prophet by Weytjens, Lohmann and Kleinsteuber (download). I've included their summary table. Compared to the other papers is relatively favorable - as far as a head-to-head with ARIMA is concerned. However as you can see, neural networks easily best Prophet and ARIMA - at least in their setup.