Healthcare Standards: On the Power Distribution of Healthcare Data Breaches

Friday, February 3, 2012

On the Power Distribution of Healthcare Data Breaches

A tweet, article and report all recently came to my attention on the impact of Healthcare Data Breaches, specifically on the number of patients affected. As a math and statistical hobbyist, I was interested in looking at the data for myself, because I found the reported impacts to be rather sensational.

One of the things that I suspected about data breaches is that their distribution by size is related to some sort of power law distribution. These events after all, are somewhat like other disasters (and also some non-disasters). One would expect that the number of people effected by them, or the overall cost of them to being inversely related to their frequency. I took the data and classified the breaches by size, then counted the number of breaches in each size bucket. The size buckets I used were (approximately):

500 - 1,500
1,500-5,000
5,000-15,000
15,000-50,000
50,000-150,000
150,000-500,000
500,000-1,500,000
1,500,000-5,000,000

I started at 500 because the public data includes only breaches affecting 500 or more patients. I picked the ranges that I did because midpoints of each range distribute evenly over a log scale. When I plotted out the frequency on a log-log graph and computed at a power trendline, this is what I got:

I won't say this is proof of my thesis, because I took the easy way out and let Excel compute the power trend, rather than using an appropriate estimation technique. What I saw was actually good enough for me to assume some sort of power law distribution.

Those who work with power laws should be familiar with the relationship between power laws, Pareto distributions and Zipf distributions. I suspect my hypothesis can be refined even further and is worthy of a paper. As I said, I'm a hobbyist, not a professional mathematician. I'll skip the paper for now, as there are too many other things on my plate. I do invite others to take a look at it, and if you do write a paper as a result of this post, please let me know about it.

Having reached this conclusion, the next question I wanted to answer is whether or not data breaches are increasing or decreasing. The overall impact on patients in 2011 was certainly larger as the report suggests, but this doesn't really indicate where the trend is going. Because of the long tailed distribution, some events (e.g., breaches affecting a million or more patients) are expected to occur very infrequently, which means that they won't show up in year-over-year statistics frequently enough to judge the impact.

What I did next was plot the trends for number of breaches by size, and got this:

What this graph shows is that trend for all breach sizes except the very largest is headed downward. And given that there are only three points in the 1.5 to 5 million range, it's really hard to say what the trend is for those breaches because they are so infrequent. There are only three breaches over 1.5 million in size, which is barely enough to even think about trending.

Overall, the number of breach reports is also trending downwards: