The volume, variety, and velocity of big data have opened up a new world of possibilities for human progress. Finding new cures for cancer, discovering unknown particles, and even preventing crime are just a few of the promises this new world holds.
These exciting prospects have led some to believe that big data, artificial intelligence (AI), the internet of things (IOT) and cloud will be sufficient to solve any number of technological, societal, and scientific challenges. Accordingly, the scientific method, statistical thinking, domain expertise, and even common sense no longer play an important role in progress and human thriving, a point of view exemplified by “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete,” a 2008 article authored by the then chief editor of Wired, Chris Anderson.1
Yet the absence of fundamentals such as good modelling and statistical thinking have frequently led to wrong, biased results, sometimes with dire consequences. Researchers at Duke University published papers suggesting that DNA biomarkers could form the basis for creating individual treatments for breast cancer, then went on to treat women in trials based on their findings. These clinical trials, however, tragically failed to produce the expected results, with more women dying than the models predicted. Two statisticians who investigated the study discovered structural errors in the way the data had been prepared and analyzed.2
A lighter example of what can happen when good statistical thinking is absent occurred with Amazon’s algorithmically generated list price of USD 23.7M for the book The Making of a Fly, a scientific text on genetics that contains neither an original Picasso nor a hidden cache of diamonds. The absurd price resulted from a price war between booksellers’ algorithms; humans ultimately interceded and set the book’s price back to its fair value of USD 106.
These instances are by no means the only refutations to Mr. Anderson’s declaration that the scientific method is obsolete. In the 10 years since the article’s publication, evidence has abounded that sound statistical practices such as ensuring data quality, designing experiments correctly, establishing a clear methodology to approaching problems, and applying domain-specific knowledge are essential to the success of any AI project and, on a larger scale, to the success of organization-wide digital transformation.
The Statistics Division of the American Society for Quality offers this definition of statistical thinking:
Statistical thinking is a philosophy, not a number of sequential steps. It is foremost, how we approach and think about a problem and not about algorithms, equations or even data. It is the capacity to understand a challenge and translate it into a Data Science methodology. Having data without domain specific knowledge is much less likely to produce meaningful and actionable results.3
When applying AI to a big data problem, consider the below guidelines to employ statistical thinking and avoid the kind of bad results created by the Duke cancer researchers and Amazon’s algorithm.
- Clearly articulate the business or scientific problem you want to solve and define the scope of the project correctly.
- Invest time in translating the problem you want to solve into a data science methodology starting with a correct data model. Determine the answers to questions such as: How is the data being collected? What are the process hierarchies? How were the measurements physically collected?
- Do not take for granted the quality of the data. A good rule of thumb is to spend about 85 percent of your project time understanding, cleaning, and transforming your data.
- Start with the basics: descriptive statistics and graphs. Calculating measures of location, variation, and range, as well as plotting the data in meaningful graphs, should be part of any analytics workflow. This work will allow your team to detect problems in the data such as unexpected trends and spurious associations that will affect the AI learning process and in turn lead to biased results.
- Prepare your studies with a sound design of trials and experiments. The endgame of a study is detecting, as accurately as possible, the effect of the parameters under scrutiny. In order to make this possible, we need to dilute, as much as possible, all variation due to parameters we have measured but whose effect we are not interested in and all the known and unknown parameters we have not measured.
- Carefully consider not only explanatory variables but also response variables. Big data problems often entail data that come from a variety of very different sources, with different stakeholders having different objectives and perceptions about the potential outcomes.
- Use hypothesis testing, with control groups, whenever possible. Resist the temptation to jump directly to common performance metrics (e.g., reducing churn by 5 percent); using the scientific method provides valuable insights about the process being modelled and ends up saving time.
- Validating models is a crucial part of any AI project, especially in predictive analytics projects. The goal of the models is to produce accurate predictions for future data rather than a perfect with the training data that is available.
- Update models to ensure they stand the test of time. Data frameworks may work appropriately when models are delivered, yet the assumptions under which those models are developed change with time. Customers start behaving differently, new parameters influence outcomes, and organizations and society change.
- Do not underestimate the importance of domain expertise. Problems have a history; it is essential to the success of any AI endeavour to embed domain knowledge to avoid naive approaches.
- Implement procedures to monitor whether the AI model and framework outcomes are contributing to the goals for which they were created.
Achieving viable results with data science depends on adherence to the scientific method and cannot be achieved merely by crunching data with powerful central processing units and graphics processing units. Underestimating the importance of sound knowledge, common sense, and human creativity can lead to inconclusive, biased, dangerous, and occasionally humorous conclusions.
3Statistics Division of the American Society for Quality. Glossary and Tables for Statistical Quality Control. 3rd ed. Milwaukee, WI: Quality Press; 1996.