## Variable Selection in the Presence of Massive Data

- 02/06/2016
- 258
- 0 Like

**Published In**

- Big Data
- Analytics
- Business Intelligence

Data scientists are always stressing over the “best” approach to variable selection, particularly when faced with massive amounts of information, a not uncommon occurrence these days. "Massive" by today's standards means terabytes of data and tens, if not hundreds, of millions of features or predictors. There are many reasons for this but the reality is that there a single, canonical answer does not exist. There are as many approaches as there are statisticians since every statistician and their sibling has a POV or a paper on the subject.

For years, there have been rumors that Google uses all available features in building its predictive algorithms. To date however, no disclaimers, explanations or working papers have emerged that clarify and/or dispute this rumor. Not even their published patents help in the understanding. As a result, no one external to Google knows what they are doing, to the best of my knowledge.

One of the biggest problems in predictive modeling is the conflation between classic hypothesis testing with careful model specification vs pure data mining. The classically trained can get quite dogmatic about the need for "rigor" in model design and development. The fact is that when confronted with massive numbers of candidate predictors and multiple possible targets or dependent variables, the classic framework neither works, holds nor provides useful guidance. Numerous recent papers delineate this dilemma from Chattopadhyay and Lipson's brilliant paper Data Smashing: Uncovering Lurking Order in Data (available here), "The key bottleneck in that most data comparison algorithms today rely on a human expert to specify what ‘features’ of the data are relevant for comparison. Here, we propose a new principle for estimating the similarity between the sources of arbitrary data streams, using neither domain knowledge nor learning." To last year's AER paper on Prediction Policy Problems by Kleinberg, et al., (available here) which makes the case for data mining and prediction as useful tools in economic policy making, citing instances where "causal inference is not central, or even necessary."

The fact is that the bigger, $64,000 question is the broad shift in thinking and challenges to the classic hypothesis-testing framework implicit in, e.g., this Edge.org symposium on "obsolete" scientific thinking (available here) as well as this recent article by Eric Beinhocker on the "new economics" which presents some radical proposals for integrating widely disparate disciplines such as behavioral economics, complexity theory, predictive model development, network and portfolio theory as a platform for policy implementation and adoption (available here). Needless to say, these discussions go far beyond merely statistical concerns and suggest that we are undergoing a fundamental shift in scientific paradigms. The shifting views are as fundamental as the distinctions between reductionistic, Occam's Razor like model-building vs Epicurus' expansive Principle of Plenitude or multiple explanations which roughly states that if several findings explain something, retain them all ... (see, e.g., here)

Of course, guys like Beinhocker are totally unencumbered with practical, in the trenches issues regarding applied, statistical solutions to this evolving paradigm. Wrt the nitty-gritty questions of ultra-high dimensional variable selection, there are many viable approaches to model building leveraging, e.g., Lasso, LAR, stepwise algorithms or "elephant models” that use all of the available information. The reality is that, even with AWS or a supercomputer, you can't use all of the available information at the same time – there simply isn’t enough RAM to load it all in. What does this mean? Workarounds have been proposed, e.g., the NSF's Discovery in Complex or Massive Datasets: Common Statistical Themes to "divide and conquer" or "bags of little jacknife" algorithms for massive data mining, e.g., Wang, et al's paper, A Survey of Statistical Methods and Computing for Big Data (available here) as well as Leskovec, et al's book Mining of Massive Datasets (available here).

There are now literally hundreds, if not thousands of papers that deal with various aspects of these challenges, all proposing widely differing analytic engines as their core from so-called “D&C" or "BLJ" algorithms; unsupervised, "deep learning" models; random matrix theory applied to massive covariance construction; Bayesian tensor models to classic, supervised logistic regression, and more. Fifteen years or so years ago, the debate largely focused on questions concerning the relative merits of hierarchical Bayesian solutions vs frequentist finite mixture models. In a paper addressing these issues, Ainslie, et al. (available here) came to the conclusion that the differing theoretical approaches, in practice, produced largely equivalent results with the exception of problems involving sparse and/or high dimensional data where HB models had the advantage. Today with the advent of D&C workarounds, any arbitrage HB models may have historically had are rapidly being eliminated.

The basic logic of these D&C workarounds are, by and large, extensions of Breiman's famous random forest technique which relied on bootstrapped resampling of observations and features. Breiman did his work in the late 90s on a single CPU when massive data meant a few dozen gigs, a couple of thousand features processed over a couple of thousand iterations. On today's massively parallel, multi-core platforms, it is possible to run algorithms analyzing terabytes of data containing tens of millions of features that build millions of "RF" mini-models in a few hours. Theoretically, it’s possible to build models from petabyes of data with these workarounds but the present IT platforms and systems won’t execute that yet – to the best of my knowledge (if any knows where this is being done and how, please feel free to share that information).

There are any number of important questions coming out of all of this. One has to do with a concern over a loss of precision due to the approximating nature of these workarounds. This issue has been addressed by Chen and Xie in their paper, A Split-and-Conquer Approach for Analysis of Extraordinarily Large Data (available here) where they conclude that these approximations are indistinguishably different from "full information" models.

A second concern which, to the best of my knowledge hasn't been adequately addressed by the literature, has to do with what is done with the results (i.e., the "parameters") from potentially millions of predictive mini-models once the workarounds have been rolled up and summarized. In other words, how does one execute something as simple as "scoring" new data with these results? Are the mini-model coefficients to be saved and stored or does one simply rerun the D&C algorithm on new data?

In his book, Numbers Rule Your World (available here), Kaiser Fung describes the dilemma Netflix faced when presented with the ensemble of only 104 models handed over by the winners of their competition. The winners had, indeed, minimized the MSE vs all other competitors but this translated into only a several decimal place improvement in accuracy on the 5-point, Likert-type rating scale used by their movie recommender system. In addition, the IT maintenance required for this small ensemble of models cost much more than any savings seen from the "improvement" in model accuracy.

Then there's the whole question of whether "optimization" is even possible with information of this magnitude. For instance, Emmanuel Derman, the physicist and financial engineer, in his book My Life as a Quant suggests that optimization is an unsustainable myth, at least in financial engineering.

Finally, questions concerning relative feature importance with massive numbers of features have yet to be addressed.

There are no easy answers wrt questions concerning the need for variable selection and the new challenges opened up by the current, Epicurean workarounds remain to be resolved. The bottom line is that we are all data scientists now.

Source: Originally posted anonymously by the author to a thread on Stack Exchange's statistical Q&A site, Cross Validated (www.stats.stackexchange.com).

- 02/06/2016
- 258
- 0 Like

## Variable Selection in the Presence of Massive Data

- 02/06/2016
- 258
- 0 Like

#### Thomas Ball

Advanced Analytics Professional at Thomas Ball

Opinions expressed by Grroups members are their own.

#### Top Authors

Data scientists are always stressing over the “best” approach to variable selection, particularly when faced with massive amounts of information, a not uncommon occurrence these days. "Massive" by today's standards means terabytes of data and tens, if not hundreds, of millions of features or predictors. There are many reasons for this but the reality is that there a single, canonical answer does not exist. There are as many approaches as there are statisticians since every statistician and their sibling has a POV or a paper on the subject.

For years, there have been rumors that Google uses all available features in building its predictive algorithms. To date however, no disclaimers, explanations or working papers have emerged that clarify and/or dispute this rumor. Not even their published patents help in the understanding. As a result, no one external to Google knows what they are doing, to the best of my knowledge.

One of the biggest problems in predictive modeling is the conflation between classic hypothesis testing with careful model specification vs pure data mining. The classically trained can get quite dogmatic about the need for "rigor" in model design and development. The fact is that when confronted with massive numbers of candidate predictors and multiple possible targets or dependent variables, the classic framework neither works, holds nor provides useful guidance. Numerous recent papers delineate this dilemma from Chattopadhyay and Lipson's brilliant paper Data Smashing: Uncovering Lurking Order in Data (available here), "The key bottleneck in that most data comparison algorithms today rely on a human expert to specify what ‘features’ of the data are relevant for comparison. Here, we propose a new principle for estimating the similarity between the sources of arbitrary data streams, using neither domain knowledge nor learning." To last year's AER paper on Prediction Policy Problems by Kleinberg, et al., (available here) which makes the case for data mining and prediction as useful tools in economic policy making, citing instances where "causal inference is not central, or even necessary."

The fact is that the bigger, $64,000 question is the broad shift in thinking and challenges to the classic hypothesis-testing framework implicit in, e.g., this Edge.org symposium on "obsolete" scientific thinking (available here) as well as this recent article by Eric Beinhocker on the "new economics" which presents some radical proposals for integrating widely disparate disciplines such as behavioral economics, complexity theory, predictive model development, network and portfolio theory as a platform for policy implementation and adoption (available here). Needless to say, these discussions go far beyond merely statistical concerns and suggest that we are undergoing a fundamental shift in scientific paradigms. The shifting views are as fundamental as the distinctions between reductionistic, Occam's Razor like model-building vs Epicurus' expansive Principle of Plenitude or multiple explanations which roughly states that if several findings explain something, retain them all ... (see, e.g., here)

Of course, guys like Beinhocker are totally unencumbered with practical, in the trenches issues regarding applied, statistical solutions to this evolving paradigm. Wrt the nitty-gritty questions of ultra-high dimensional variable selection, there are many viable approaches to model building leveraging, e.g., Lasso, LAR, stepwise algorithms or "elephant models” that use all of the available information. The reality is that, even with AWS or a supercomputer, you can't use all of the available information at the same time – there simply isn’t enough RAM to load it all in. What does this mean? Workarounds have been proposed, e.g., the NSF's Discovery in Complex or Massive Datasets: Common Statistical Themes to "divide and conquer" or "bags of little jacknife" algorithms for massive data mining, e.g., Wang, et al's paper, A Survey of Statistical Methods and Computing for Big Data (available here) as well as Leskovec, et al's book Mining of Massive Datasets (available here).

There are now literally hundreds, if not thousands of papers that deal with various aspects of these challenges, all proposing widely differing analytic engines as their core from so-called “D&C" or "BLJ" algorithms; unsupervised, "deep learning" models; random matrix theory applied to massive covariance construction; Bayesian tensor models to classic, supervised logistic regression, and more. Fifteen years or so years ago, the debate largely focused on questions concerning the relative merits of hierarchical Bayesian solutions vs frequentist finite mixture models. In a paper addressing these issues, Ainslie, et al. (available here) came to the conclusion that the differing theoretical approaches, in practice, produced largely equivalent results with the exception of problems involving sparse and/or high dimensional data where HB models had the advantage. Today with the advent of D&C workarounds, any arbitrage HB models may have historically had are rapidly being eliminated.

The basic logic of these D&C workarounds are, by and large, extensions of Breiman's famous random forest technique which relied on bootstrapped resampling of observations and features. Breiman did his work in the late 90s on a single CPU when massive data meant a few dozen gigs, a couple of thousand features processed over a couple of thousand iterations. On today's massively parallel, multi-core platforms, it is possible to run algorithms analyzing terabytes of data containing tens of millions of features that build millions of "RF" mini-models in a few hours. Theoretically, it’s possible to build models from petabyes of data with these workarounds but the present IT platforms and systems won’t execute that yet – to the best of my knowledge (if any knows where this is being done and how, please feel free to share that information).

There are any number of important questions coming out of all of this. One has to do with a concern over a loss of precision due to the approximating nature of these workarounds. This issue has been addressed by Chen and Xie in their paper, A Split-and-Conquer Approach for Analysis of Extraordinarily Large Data (available here) where they conclude that these approximations are indistinguishably different from "full information" models.

A second concern which, to the best of my knowledge hasn't been adequately addressed by the literature, has to do with what is done with the results (i.e., the "parameters") from potentially millions of predictive mini-models once the workarounds have been rolled up and summarized. In other words, how does one execute something as simple as "scoring" new data with these results? Are the mini-model coefficients to be saved and stored or does one simply rerun the D&C algorithm on new data?

In his book, Numbers Rule Your World (available here), Kaiser Fung describes the dilemma Netflix faced when presented with the ensemble of only 104 models handed over by the winners of their competition. The winners had, indeed, minimized the MSE vs all other competitors but this translated into only a several decimal place improvement in accuracy on the 5-point, Likert-type rating scale used by their movie recommender system. In addition, the IT maintenance required for this small ensemble of models cost much more than any savings seen from the "improvement" in model accuracy.

Then there's the whole question of whether "optimization" is even possible with information of this magnitude. For instance, Emmanuel Derman, the physicist and financial engineer, in his book My Life as a Quant suggests that optimization is an unsustainable myth, at least in financial engineering.

Finally, questions concerning relative feature importance with massive numbers of features have yet to be addressed.

There are no easy answers wrt questions concerning the need for variable selection and the new challenges opened up by the current, Epicurean workarounds remain to be resolved. The bottom line is that we are all data scientists now.

Source: Originally posted anonymously by the author to a thread on Stack Exchange's statistical Q&A site, Cross Validated (www.stats.stackexchange.com).

- 02/06/2016
- 258
- 0 Like

## Thomas Ball

Advanced Analytics Professional at Thomas Ball

Opinions expressed by Grroups members are their own.