# BIG DATA CAN LIE: SIMPSON’S PARADOX

Mind Matters | 4/15/2019 | Staff
Level 3

There are many ways data analytics can lead to wrong conclusions. Sometimes a dishonest data cruncher interprets data to further her agenda. But misleading data can also come from curious flukes of statistics. Simpson’s Paradox1, 2 is one of these flukes.

Here’s an example. Baseball player Babe has a better batting average3 than Mickey in both April and May. So, in terms of batting average, Babe is a better baseball player than Mickey. Right?

No.

It turns out that Mickey’s combined batting average for April and May can be higher than Babe’s. In fact, Babe can have a better batting average than Mickey every month of the baseball season and Mickey may still be a better hitter. How? That’s Simpson’s Paradox.

#### Data - Months

Here are the data for the two consecutive months:

For both the months of April and May, both Mickey and Babe had 100 at bats. For April, Mickey’s stat was 45 hits out of 90 at bats and Babe’s was 8 out of 10. The batting averages (per cent times 1000) was 500 for Mickey and 800 for Babe. Babe’s batting average was much higher for April. The same is true for May. Babe’s 27 hits for 90 at bats gives her a batting average of 300 while Mickey’s average, 2 out of 10, gives him an a lower average of 200. But the aggregate data tell a different story. Mickey’s total of 47 hits gives him an average of 470 while Babe’s batting average of 350 comes from only 35 hits.

#### Simpson - Paradox - Importance - Interpretation - Results

Simpson’s Paradox illustrates the importance of human interpretation of the results of data mining. Nobel Laureate Ronald Coase said, “If you torture the data long enough, it will confess.” Unsupervised Big Data can torture numbers inadvertently because aritifical intelligence is ignorant of the significance of the data.

What’s Going On?

#### Data - Viewed

Data viewed in the...
(Excerpt) Read more at: Mind Matters
Wake Up To Breaking News!
Tagged: