In this article
We review the most important statistical ideas of the past half century, which we categorize as: counterfactual causal inference, bootstrapping and simulation-based inference, overparameterized models and regularization, Bayesian multilevel models, generic computation algorithms, adaptive decision analysis, robust inference, and exploratory data analysis. We discuss key contributions in these subfields, how they relate to modern computing and big data, and how they might be developed and extended in future decades. The goal of this article is to provoke thought and discussion regarding the larger themes of research in statistics and data science.
A lot has happened in the past half century! The eight ideas reviewed below represent a categorization based on our experiences and reading of the literature and are not listed in a chronological order or in order of importance. They are separate concepts capturing different useful and general developments in statistics. The present review is intended to cover the territory and is influenced not just by our own experiences but also by discussions with others; nonetheless we recognize that any short overview will be incomplete, and we welcome further discussions from other perspectives.
Each of these ideas has pre-1970 antecedents, both in the theoretical statistics literature and in the practice of various applied fields. But each has developed enough in the past 50 years to have become something new.
We begin with a cluster of different ideas that have appeared in statistics, econometrics, psychometrics, epidemiology, and computer science, all revolving around the challenges of causal inference, and all in some way bridging the gap between, on one hand, naive causal interpretation of observational inferences and, on the other, the recognition that correlation does not imply causation. The key idea is that causal identification is possible, under assumptions, and that one can state these assumptions rigorously and address them, in various ways, through design and analysis. Debate continues on the specifics of how to apply causal models to real data, but the work in this area over the past 50 years has allowed much more precision on the assumptions required for causal inference, and this in turn has stimulated work in statistical methods for these problems.
Different methods for causal inference have developed in different fields. In econometrics, the focus has been on the structural models and their implications for average treatment effects (Imbens and Angrist 1994), in epidemiology the focus has been on inference with observational data (Greenland and Robins 1986), psychologists have been aware of the importance of interactions and varying treatment effects (Cronbach 1975), in statistics there has been work on matching and other approaches to adjust for and measure differences between treatment and control groups (Rosenbaum and Rubin 1983). In all this work, there has been a common thread of modeling causal questions in terms of counterfactuals or potential outcomes, which is a big step beyond the earlier standard approach which did not clearly distinguish between descriptive and causal inferences. Key developments include Neyman (1923), Welch (1937), Rubin (1974), and Haavelmo (1943); see Heckman and Pinto (2015) for some background and VanderWeele (2015) for a recent review.
The purpose of the aforementioned methods is to define and estimate the effect of some specified treatment or exposure, adjusting for biases arising from imbalance, selection, and measurement errors. Another important area of research has been in causal discovery, where the goal is not to estimate a particular treatment effect but rather to learn something about the causal relations among several variables. There is a long history of such ideas using methods of path analysis, from researchers in various fields of application such as genetics (Wright 1923), economics (Wold 1954), and sociology (Duncan 1975); as discussed by Wermouth (1980), these can be framed in terms of simultaneous equation models. Influential recent work in this area has linked to probabilistic ideas of graphical models (Spirtes, Glymour and Scheines 1993; Heckerman, Geiger, and Chickering 1995; Peters, Janzing, and Schölkopf 2017). An important connection to psychology and computer science has arisen based on the idea that causal identification is a central task of cognition and thus should be a computable problem that can be formalized mathematically (Pearl 2009). Path analysis and causal discovery can be framed in terms of potential outcomes, and vice versa (Morgan and Winship 2014). However formulated, ideas and methods of counterfactual reasoning and causal structure have been influential within statistics and computer science and also in applied research and policy analysis.
A trend of statistics in the past 50 years has been the substitution of computing for mathematical analysis, a move that began even before the onset of “big data” analysis. Perhaps, the purest example of a computationally defined statistical method is the bootstrap, in which some estimator is defined and applied to a set of randomly resampled datasets (Efron 1979; Efron and Tibshirani 1993). The idea is to consider the estimate as an approximate sufficient statistic of the data and to consider the bootstrap distribution as an approximation to the sampling distribution of the data. At a conceptual level, there is an appeal to thinking of prediction and resampling as fundamental principles from which one can derive statistical operations such as bias correction and shrinkage (Geisser 1975).
Antecedents include the jackknife and cross-validation (Quenouille 1949; Tukey 1958; Stone 1974; Geisser 1975), but there was something particularly influential about the bootstrap idea in that its generality and simple computational implementation allowed it to be immediately applied to a wide variety of applications where conventional analytic approximations failed; see, for example, Felsenstein (1985). Availability of sufficient computational resources also helped as it became trivial to repeat inferences for many resampled datasets.
The increase in computational resources has made other related resampling and simulation-based approaches popular as well. In permutation testing, resampled datasets are generated by breaking the (possible) dependency between the predictors and target by randomly shuffling the target values. Parametric bootstrapping, prior and posterior predictive checking (Box 1980; Rubin 1984), and simulation-based calibration all create replicated datasets from a model instead of directly resampling from the data. Sampling from a known data-generating mechanism is commonly used to create simulation experiments to complement or replace mathematical theory when analyzing complex models or algorithms.
A major change in statistics since the 1970s, coming from many different directions, is the idea of fitting a model with a large number of parameters—sometimes more parameters than data points—using some regularization procedure to get stable estimates and good predictions. The idea is to get the flexibility of a nonparametric or highly parameterized approach, while avoiding the overfitting problem. Regularization can be implemented as a penalty function on the parameters or on the predicted curve (Good and Gaskins 1971).
Early examples of richly parameterized models include Markov random fields (Besag 1974), splines (Wahba and Wold 1975; Wahba 1978), and Gaussian processes (O’Hagan 1978), followed by classification and regression trees (Breiman et al. 1984), neural networks (Werbos 1981; Rumelhart, Hinton, and Williams 1987; Buntine and Weigend 1991; MacKay 1992; Neal 1996), wavelet shrinkage (Donoho and Johnstone 1994), lasso, horseshoe, and other alternatives to least squares (Dempster, Schatzoff, and Wermuth 1977; Tibshirani 1996; Carvalho, Polson, and Scott 2010), and support vector machines (Cortes and Vapnik 1995) and related theory (Vapnik 1998).
The 1970s also saw the start of the development of Bayesian nonparametric priors on infinite dimensional families of probability models (Müller and Mitra 2013), such as Dirichlet processes (Ferguson 1973), Chinese restaurant processes (Aldous 1985), Polya trees (Lavine 1992; Mauldin, Sudderth, and Williams 1992) and Pitman and Yor (1997) processed, and many other examples since then All these models have the feature of expanding with sample size, and with parameters that did not always have a direct interpretation but rather were part of a larger predictive system. In the Bayesian approach, the prior could be first considered in a function space, with the corresponding prior for the model parameters then derived indirectly.
Many of these models had limited usage until enough computational resources became easily available. Overparameterized models have continued to be developed in image recognition (Wu, Guo, and Zhu 2004) and deep neural nets (Bengio, LeCun, and Hinton 2015; Schmidhuber 2015). Hastie, Tibshirani, and Wainwright (2015) had framed much of this work as the estimation of sparse structure, but we view regularization as being more general in that it also allows for dense models to be fit to the extent supported by data.
Along with a proliferation of statistical methods and their application to larger datasets, researchers have developed methods for tuning, adapting, and combining inferences from multiple fits, including stacking (Wolpert 1992), Bayesian model averaging (Hoeting et al. 1999), boosting (Freund and Schapire 1997), and gradient boosting (Friedman 2001). These advances have been accompanied by an alternative view of the foundations of statistics based on prediction rather than modeling (Breiman 2001).