In this article we are going to learn about Simpson’s Paradox and explore it with Python programming.
Simpson’s Paradox is where data appears to tell a different story when treated as a whole than when its component parts are considered. There are some famous cases of this paradox which you can read about on Wikipedia. The purpose of this article is to explore the paradox by means of a specific example and the use of Python programming.
The data for this example is taken from a YouTube video by a Harvard Lecturer as part of a Statistics 110 course. It is a fictitious example about the success of operations performed by two doctors from the Simpsons TV show.
The table below shows an example of Simpson’s Paradox arising from some fictitious data for the success of operations performed by two doctors from the Simpsons TV show.
Success Rates for Operations by Two Fictitious Doctors
Dr Hilbert | Dr Nick | |
---|---|---|
Heart Surgery | 70/90 (77%) | 2/10 (20%) |
Band Aid Removal | 10/10 (100%) | 81/90 (90%) |
Total | 80/100 (80%) | 83/100 (83%) |
Study the table and see if you can spot the “paradox”.
The issue lies in that fact that the individual success rates for Dr Hilbert are higher for both heart surgery and for band aid removal, yet somehow Dr Nick has a higher overall success rate!
How can this be?
In general the issues involved when Simpson’s Paradox is present can go quite deep, as there may be several factors at play. However, for this article, we are going to stick with the basics. From the point of view of arithmetic, the problem can be seen like this:
Think about the data in terms of 4 regions
A | B
-----
C | D
If we have (A > B) & (C > D)
then (A + C) > (B + D)
.
The paradox arises due to the aggregation step – A + C
are added as if when adding fractions, a/b + c/d
were equal to (a + b) / (c + d)
.
If that’s too complex to follow, just observe that we are not adding the percentages for each column directly, as in the table below.
Dr Hilbert | Dr Nick | |
---|---|---|
Heart Surgery | 70/90 (77%) | 2/10 (20%) |
Band Aid Removal | 10/10 (100%) | 81/90 (90%) |
Total | 160/90 (177.78%) | 99/90 (110%) |
Adding the columns as shown above would give a different impression of what the data conveys. Although a 177.78%
success rate may not make much sense mathematically, it may give a clearer picture of how the performance of the two doctors compares. This is not the only alternative way to aggregate the data though. For example, average ratios could be used (88.5%
for Dr Hilbert, 55%
for Dr Nick) or weighted averages, which take into account which operation is more difficult.
There is no one correct interpretation of the data when Simpson’s Paradox is present!
The moral of the story is, when working with data, think very carefully about how it is comprised. Sometimes looking at aggregated data is useful but in other situations it can obscure the what is really happening.
Python Code Listing for Detecting Simpson’s Paradox
Below is a Python program which can detect Simpson’s Paradox.
import numpy as np
import pandas as pd
def aggregate_data(df, conversion_col, factor_1_col, factor_2_col):
"""
Takes a frame of individual-level data and aggregates it for Simpson's Paradox detection.
"""
df_ = df[[conversion_col, factor_1_col, factor_2_col]]
gb = df_.groupby([factor_1_col, factor_2_col]).agg([np.sum, lambda x: len(x)])
# gb index is currently MultiIndex.
gb.columns = [conversion_col, "total"] # rename columns for aggregated data.
return gb.reset_index()
def simpsons_paradox(df, conversion_col, total_col, factor_1_col, factor_2_col):
"""
Determine if simpsons paradox is present.
"""
# Find the global optimal:
gbs = df.groupby(factor_1_col).sum()
print("## Global rates (%): ")
print(round((gbs[conversion_col] / gbs[total_col] * 100), 2))
print()
global_optimal = (gbs[conversion_col] / gbs[total_col]).argmax()
# Check for optimal via segments
df_ = df.set_index([factor_2_col, factor_1_col])
rates = (df_[conversion_col] / df_[total_col]).unstack(-1)
print("## Local rates (%):")
print(round(rates * 100, 2))
print()
# Find the local optimals
local_optimals = rates.apply(lambda x: x.argmax(), 1)
if local_optimals.unique().shape[0] > 1:
print("## Simpson's Paradox not detected.")
print("## Segmented rates do not have a consistent optimal choice.")
print("## Local optimals:")
print(local_optimals)
print("## Global optimal: ", global_optimal)
return False
local_optimal = local_optimals.unique()[0]
print("## Global optimal: ", global_optimal)
print("## Local optimal: ", local_optimal)
if local_optimal != global_optimal:
print("## Simpson's Paradox detected.")
return True
else:
print("## Simpson's Paradox not detected.")
return False
if __name__ == "__main__":
# Generate data
d = []
d += ([('Dr Hilbert', 'Heart Surgery', 1)] * 70) # successful heart surgery
d += ([('Dr Hilbert', 'Heart Surgery', 0)] * (90 - 70)) # unsuccessful heart surgery
d += ([('Dr Hilbert', 'Band Aid Removal', 1)] * 10)
d += ([('Dr Hilbert', 'Band Aid Removal', 0)] * (10 - 10))
d += ([('Dr Nick', 'Heart Surgery', 1)] * 2)
d += ([('Dr Nick', 'Heart Surgery', 0)] * (10 - 2))
d += ([('Dr Nick', 'Band Aid Removal', 1)] * 81)
d += ([('Dr Nick', 'Band Aid Removal', 0)] * (90 - 81))
df = pd.DataFrame(d, columns=['doctor', 'operation', 'success'])
gb = aggregate_data(df, 'success', 'doctor', 'operation')
simpsons_paradox(gb, 'success', 'total', 'doctor', 'operation')
Some of the code is quite involved. There are some comments to help you understand what is going on, but if you are having trouble with any particular part, let me know and I’ll try and help. It is definitely a good idea to tweak the data and see what effect it has. You could also try with different data, such as that from the famous kidney stone treatment example on Wikipedia.
This article has explored Simpson’s Paradox – what it is, and how to use Python to detect whether it exists in provided data. I hope you have found it interesting and helpful.
Happy computing!