Conformal Prediction

July 30, 2009

In terms of experience, a strange event (or an event that is unlikely to happen) is rare; otherwise we would not call it strange. Likewise, a normal event is frequent; otherwise it would not be normal. Let me assume you have experience in the game of throwing a fair dice and you have no idea of probabilities: if you throw a dice ten times and you get ten sixes, you would find this event very strange, as it would be very unlikely to happen. It can happen but it is a rare event. Opposite to that, you would expect to get a random sequence of numbers (i.e. [6,2,5,6,5,5,5,5,1,3]) when you throw the dice 10 times. Why do you find this sequence more normal than the sequence of ten sixes? Well, consider this sequence: [3,1,5,6,4,2,5,1,4,6], or this [1,1,1,4,5,3,2,6,4,2]. The group of sequences with most of the numbers appearing in a “random” way is much larger than the group of sequences which contain only sixes (in fact, there is only one possible sequence for the latter group). Consequently, the probability is much less for the sequence of all sixes to happen, and therefore you observe many more times the “random” sequences as you build up your experience on dice throwing.

We use our experience in our every day life to make simple decisions. Sometimes we are so confident about our decisions that we don’t even think about them. For example, when you walk into your home, your mind has already decided what it is about to observe behind the door, because you observe the same thing everyday (we can say here you are 100% sure about the observation since you were 100% correct previously), Once you walk in and you observe the same thing, then everything is normal and you don’t pay attention to it. Nevertheless, if a thief took your TV set, you would observe something strange about the scene (the fact that the TV is missing!) and you will notice. Now your correct observations have dropped and maybe your confidence too.

Our confidence in our decisions is related to the number of times we have made correct decisions on similar past situations. If I see black clouds, then I can say with some confidence that it is going to rain, since in the past, black clouds caused rain (it’s a normal event). The more normal the event, the more confident we are that it is going to happen. Additionally, the stranger it is that the opposite will happen (i.e. not going to rain when we see black clouds), again the more confident we are that the contrary is going to happen (since otherwise it would be strange).

In the field of Artificial Intelligence (AI) we try to mimic and build intelligent behavior into machines (because we would like to have intelligent machines for some reason). Machines can now make decisions based on past experience (feeding the machine with lots of past data). We call this field “Machine Learning”, as the machine “learns” about the data, and then makes decisions based on the learnt data (when we say “learn” we mean the algorithm builds a model about the data (see for example Artificial Neural Networks), which is a “behavior” that we believe we do somehow in our brain as well, when we learn something).

A little bit of maths

We have also formalized the idea of confidence in the decisions (or predictions) of the machine, using the “Conformal Prediction” framework which is based on the randomness theorem and the idea of conformity or non-conformity (strangeness). For example, suppose I have 1000 pictures of the sky for the last 1000 days. For each day, I measure how black the clouds are, with a scale from 0 to 10 (0 if there are no clouds at all, 1 if the clouds are white, 2 if there is some gray, and so on). I also label each picture with “NR”, if it didn’t rain, or with “R”, if it did rain. Then, for each day I give a strangeness score, which depends on how black the clouds were and whether it rained or not. That is, if it rained and there were no clouds at all (very strange isn’t it?) then the strangeness score should be the highest. Thus, the score I would give would be something like:

On the other hand, if it didn’t rain, the strangeness score would just be:

So now I have defined a strangeness measure (we call it non-conformity measure). I can use this information in the following way to predict whether it is going to rain or not: Take a picture of the sky of the current day. Assume first (in scientific words I make a hypothesis), that it is going to rain (no matter what). Calculate the strangeness score of this day as:

Then use the following p-value function to calculate how likely it is that it is going to rain:

where Iαi≥αn is  the indicator function, equal to 1 if αi≥αn and equal to 0 otherwise. This function calculates the total number of days that were stranger from the current day (with the assumption that this will be a rainy day), over the total number of days. If there are no clouds at all, and I assume that it is going to rain then the score αn would be very high (10). If this kind of strangeness has never happened before, the p(R) value would be the lowest possible, since all the other scores would be lower than αn. That is: p(R)=1/n (because only αn≥αn).

Now let’s assume (make another hypothesis) that it is not going to rain. We calculate the strangeness score as

Since this is the same picture, the score now is very low (0). If we apply the same function

we expect p(NR) to get a much higher value, as it is more likely that many other days would be stranger than 0. In fact, 0 is the lowest value possible here and that would give us the highest p-value possible, which is 1.

Now, we have two p-values: p(R) and p(NR), which give us the likelihood that it is going to rain or not going to rain respectively. The one is very low and the other one is very high. Well, if you take the one that gives the highest value, then you can make a prediction about the day.

But there is more. How confident can you be about your prediction? As I mentioned earlier, the more normal the event and the stranger the opposite event is, the more confident you are about your prediction. Here, we can use the second largest p-value to define our confidence. The smaller the second p-value, the more confident we are about our first p-value. Of course, we don’t just use this because we said so. The p-value function satisfies the following property:

which says that the probability of the p-value of the true assumption to be less than or equal any number ε (between 0 and 1), is less than or equal ε.

What? I will repeat this: The probability that the correct assumption (p(R) or p(NR)) will give a p-value less than or equal a number ε, is less than or equal to the number ε.  This property comes from the i.i.d assumption: that our data are identically and independently distributed (the weather example may not be trully i.i.d, but many other real-world data are). Now, if I set ε to be my second largest p-value (which is 1/n in our example), then I can say that the probability of this being the true assumption (that it is in fact going to rain!) is less than or equal 1/n. With an experience of  n=2000 days, 1/n = 0.0005 which is less than or equal 0.05%. So the probability of being wrong about my prediction that it is not going to rain is at most 0.05%. Therefore my wrong predictions will be bounded by this number, and I can finally say that my confidence of my prediction that it is not going to rain is 1-0.0005 = 99.95% confidence. With this method the confidence will automatically adjust such that the error will not exceed the confidence level in the long run. For example, if there are a few clouds in the image the confidence of the prediction will drop a little bit depending on the past data (i.e. if some few clouds in the past have caused rain then the p-value p(R) would increase and thus the confidence of the prediction p(NR) would decrease).

You can try this at home

Throw the dice 100 times (or 1000 times if you are really bored) and write down the sequence of numbers. Calculate the strangeness of each dice as 1 – [the number of times the number appears in the sequence over the total number of dice]. Then try to predict the next dice: Assume the value 1 and calculate its strangeness. Then find its p-value. Assume the value 2 and calculate its strangeness. Then find its p-value. Do this for all the possible values. Then, find the largest p-value and the second largest p-value. Your prediction is the one which gives the highest p-value, but what is your confidence?

Conclusion

In the Conformal Prediction framework, when we say that the machine has 95% confidence to be correct about its predictions or decisions then it is implied that the rate of the wrong predictions will be bounded to 5%. In addition, if the machine is not certain for a single prediction, then it can provide a “predictive region” with alternative predictions in order to satisfy a given confidence level.

The ongoing research focuses on how to improve the confidence of the machine to the maximum level possible. We try different Machine Learning methods and we combine them together to define non-conformity measures. We apply our method on medical diagnosis (i.e. using DNA data to predict cancer, using ultrasound images of carotid plaque to predict stroke, using MRI data to predict psychological disease, using X-ray data to predict osteoporosis, etc). Another interesting application is to predict opponent hand ranges in poker, complemented with confidence and the fact that the error rate will be bounded (very important in poker if you want to make long term profit as the professional players do).


Follow

Get every new post delivered to your Inbox.