The Supreme Court was in the news today for deciding that it was constitutional to collect DNA via cheek swab from people under arrest.
In 2009, Maryland collected cheek swab DNA from Alonzo King and saw that his DNA matched an unsolved rape case from six years previous. He was later convicted for the rape crime, based off of that evidence.
Using some basic statistical techniques, we can show how DNA is often extremely misunderstood, and how this decision could cause many more justice problems in the future, even aside from the 4th-amendement concerns.
First, it might be interesting to ask yourself some hypothetical questions. If a suspect's DNA matches the DNA collected at a crime scene, would you say it is likely that that person committed the crime?
Second, what if it were 100% likely that the DNA at the crime scene was from whoever actually committed the crime, and not an innocent third party?
Third, what if it were 100% likely there were no errors in the DNA gathering - either at the crime scene, or from the suspect, and if the DNA signature was extremely uncommon among the general population?
I haven't done any surveys myself, but my suspicion is that most of our CSI-watching population would consider the suspect guilty.
In statistics, there is a formula known as Bayes' Theorem. It involves how to relate evidence with a hypothesis. It looks like this:
P(H|e) = p(e|H) p(H) / ( p(e|H) p(H) + p(e|~H) p(~H) )
You don't have to look at it too closely if you don't want, but it is a formula for calculating the likelihood of a hypothesis, given (the vertical bar | means "given") a piece of evidence.
I don't know the details of the Alonzo King case and if there was also other evidence linking him to the crime, but it is easy to imagine a hypothetical case where a DNA match is the only evidence of a crime. So let's imagine a hypothetical crime with a DNA match and no other evidence. The evidence is that the DNA collected at the crime scene matches the suspect's DNA, and the hypothesis is that the suspect committed the crime.
First, what is p(e|H)? In other words, what is the probability of that evidence existing if the suspect truly did commit the crime? Well, we will assume that the DNA gathering process is perfect. No errors in gathering the cheek swap, no errors in gathering the DNA at the crime scene, 100% likelihood that the crime scene DNA is actually from the assailant and not from an innocent third party, and 100% likelihood that the DNA from the crime scene truly does match the signature of the suspect. In short, p(e|H) = 100%.
What about p(e|~H)? In other words, the probability that that evidence exists if the suspect did NOT commit the crime? Well, DNA is pretty accurate. After a quick google search, I'm seeing estimates that a DNA match has an accuracy of around 1 out of 2 million. In other words, only 1 in 2 million people might share the same DNA signature given the the size of our DNA data sets. So, p(e|~H) = 0.0000005, or 0.00005% .
It looks pretty damning so far, but here's the tricky bit - what is p(H)? In other words, what is the probability that the suspect committed the crime, aside from the DNA match?
And this is exactly where the problem can happen. In our hypothetical case, there is no other evidence. There's not a regional correlation, they don't know each other, there's no motive. If we restrict it to the United States, we might say that there is roughly a 1 out of 100 million chance that the suspect committed the crime. So, p(H) = 0.00000001, or 0.000001%. In Bayesian statistics, these are known as the prior odds.
This hypothetical matches the three questions listed above, the scenario that I believe would lead most people to assume guilt. But if you plug these numbers into the formula above, then you get only a 1.96% chance that the suspect committed the crime. And this is because the prior odds were so unlikely - even less likely than the DNA match itself.
The most common error in Bayesian statistics is forgetting to take prior odds into account. Statistics are counter-intuitive. A rock solid DNA match does not necessarily imply guilt.
Now it is also true that if the prior odds are more likely, this can change the formula's behavior dramatically. If it can be established that the crime was definitely committed by a smaller pool of people, and the suspect is part of that pool, then all of a sudden a DNA match can make it very likely that the suspect is guilty. For instance, using the above odds, if the suspect is one out of only 10,000 people that could have committed the crime, then that DNA match will mean that it is 99.5% likely that the suspect committed the crime.
I do not know the general competence level of prosecutors and defense attorneys out there, but it seems to me that it would be easy for juries and lawyers to misunderstand the math in this way. If they find a DNA match, they may think that that alone is sufficient to establish guilt - and it would be intuitively easy to convince juries of it too. The accuracy of DNA can be seen as so sacred that if someone asks "well what OTHER evidence do you have?", they can come across as a ridiculous apologist.
There are plenty of other potential problems with using DNA to prove guilt, of course - there is so much room for error. But even all that is aside from the problems that can happen if the science and the math is done correctly. If you you convict 200 people, and each actually have a 99.5% chance of being guilty, chances are one of them will actually be innocent!