Summer 09 Revision COMP10412 AI Part 2: Probabilistic Reasoning

In the previous post we talked about probability theory, and talked about agents having a probability distribution to represent their degrees of belief.  We also talked about the axioms K1 and K2, that is:

K1 – If a is always true, then p(a) = 1.
K2 – If a AND b are never true at the same time, then the probability of a OR b being true is p(a) + p(b).

These axioms become more important when we talk about the idea of a ‘dutch book’.

Dutch Books – Why Use Probability Distributions?

Lets consider an artificially intelligent agent walks into a betting shop.  No, this isn’t the start of some bad joke. In this betting shop, the agent will bet on propositions such as ‘Horse A will win the horse race’.  The tickets pay out £1 if the proposition is true, and no money if the proposition is not true.  But what price should the agent buy the ticket for?  This depends upon his degree of belief in the proposition. If the agent’s degree of belief, for example, is 0.6 that the horse will win, then he will be willing to pay 60p on the ticket.

But there’s two horse races going on tomorrow. Let ‘a’ and ‘b’ be the propositions ‘Horse A will win the first race’ and ‘Horse B will win the second race’.  Our agent has been programmed with the following degrees of belief:\

p(a) = 0.6
p(b) = 0.5
p(neither a nor b) = 0.25
p(not both a and b) = 0.7

Based on these beliefs, our agent takes out the following four bets:

Bet 1 pays £1 if horse ‘A’ wins the first race.  This ticket costs 60p.
Bet 2 pays £1 if horse ‘B’ wins the second race.  This ticket costs 50p.
Bet 3 pays £1 if horses ‘A’ and ‘B’ both lose their races.  This ticket costs 25p.
Bet 4 pays £1 as long as not both ‘A’ and ‘B’ win their races.  This ticket costs 70p.

So our agent has spent a total of 60+50+25+70 = £2.05

But there’s something weird about these bets.  Lets analyse it more closely:

If both horse A and horse B win, then our agent wins bets 1 and 2, but loses bets 3 and 4.  He wins £2.
If horse A wins and horse B loses, then our agent wins bets 1 and 4, but loses bets 2 and 3.  He wins £2.
If horse A loses and horse B wins, then our agent wins bets 2 and 4, but loses bets 1 and 3.  He wins £2.
If both horse A and horse B lose, then our agent wins bets 3 and 4, but loses bets 1 and 2.  He wins £2.

In any case, he wins £2.  But that can’t be right, he spent £2.05 on the ticket, so no matter what the outcome is, he’s lose money!

We call this situation a ‘Dutch Book’. And the reason that a Dutch Book was able to be made against the agent is all because of his initial degrees of belief! They violate the axioms of probability theory, K1 and K2, in some way.

In general, if an agents degrees of belief violate K1 and K2, then a Dutch Book can be made against it.  If an agents degrees of belief conform to K1 and K2, then a Dutch Book is impossible.  This example of Dutch Books shows us why we should represent our degrees of belief as probability distributions.

Why Conditionalise Probability Distributions?

Alright, lets take another example.  Let ‘a’ be the proposition that horse A wins the first race, and ‘b’ be the proposition that horse B wins the second race.  The ticket now says:

You get a payout of £1 if both horse A and horse B win.  You get no payout if horse A loses and horse B wins.  You get your money back if horse B loses.

How much would you spend on that ticket? This number is your conditional degree of belief in ‘a’ given ‘b’.

The proposal is that if an agents degrees of belief are given by a probability distribution, p, then its conditional degree of belief in ‘a’ given ‘b’ should be given by p(a|b) when p(b) is not 0.

Now lets consider the following scenario.  ‘a’ and ‘b’ are propositions as before.  p is our initial probability distribution, and pa is the probability distribution after the first race but before the second race, if a has won.

The agent buys the following tickets:

Bet 1: £1 if a and b both win.  0 if a wins but b loses.  Money back if a loses.  This ticket costs £p(b|a).
Bet 2: £x if a wins, otherwise nothing.  This ticket costs £(x * p(a))

After the first race, if a has won then the agent has already won £x, and has the following bet that used to be Bet 1:

£1 if b wins, otherwise nothing.

The agent then sells this ticket for £pa(b)

Now the payouts for the agent are as follows:

If a has won, then the payout will be his winnings on bet 2 (x*(1-p(a))) [ie. the winnings (£x) minus the amount spent on the ticket (x*p(a)], minus the amount he spent on bet 1 (p(a|b)), plus the amount he sold bet 1 for after race 1 (pa(b)).  So his total winnings are… x – xp(a) + pa(b) – p(b|a).

If a loses, then he has lost bet 2 and got his money back on bet 1.  So his total losses are the amount that he spent on bet 2, which was x * p(a).

If a loses, then his payoff is going to be negative, ie. he’s made a loss, as long as x is greater than 0 (Which it probably will be).

If a wins, then his payoff is going to negative if:

x – xp(a) + pa(b) – p(b|a) < 0
x (1 – p(a)) + pa(b) – p(b|a) < 0
x (1 – p(a)) < p(b|a) – pa(b)
x < (p(b|a) – pa(b)) / (1 – p(a))

So if we choose a value of ‘x’ between 0 and (p(b|a) – pa(b)) / (1 – p(a)), then we have made a dutch book on this agent, because no matter whether a wins or loses, he will always have a negative payoff.

Notice the numerator: (p(b|a) – pa(b)).  If the agents degrees of belief DO conform to axioms K1 and K2, then these two terms will be equal to each other and it will read x<0.  So in order to create a dutch book, a value of x must be chosen that’s greater than 0 and less than 0.  Clearly, this isn’t possible.

In this scenario, a dutch book can be avoided by making the agents degree of belief in p(b|a) equal to pa(b). This shows why we need to conditionalise, because if we don’t, then we’re going to end up in a dutch book situation.

The Monty Hall Paradox

This is a famous paradox, that shows how you have to be careful with Bayesian Updating.

An agent is on a TV competition.  At the end of the show, the host, Monty Hall, shows you over to three doors.  Behind one of the doors is a car, and behind the other two are goats.  Clearly, the aim is to pick the door with the car.  The doors are labelled A, B, and C.

You are asked to select a door, and the prize you get is behind the door you choose.  But after you’ve chosen a door, Monty goes and opens one of the other doors.  Behind that door is a goat.  Then he gives you the opportunity to either switch your selection, or stick with your selection.  The question: Do you stick, or switch?

To answer this, lets consider the three propositions, ‘A’, ‘B’, and ‘C’, which represent the propositions that the car is behind the respective door.

The probability that the car is behind A and not B is equal to the probability that the car is behind door A, obviously, which is equal to 1/3.  The probability that the car is NOT behind door B is 1 – p(B), which, obviously, is 2/3, because it could be behind A or C, each with a 1/3 probability.

So we chose door A to begin with, then Monty went and opened door B.  We now have the information that it is NOT B.  We can use Bayes Theorem to update our probability distribution as such:

p(A|~B) = p(A AND NOT B) / p(NOT B) = (1/3) / (2/3) = 1/2.  So this is telling us that the probability that it is behind door A now is 1/2, and the probability that it is behind door C now is 1/2.

But… wait.  If we run a simulation, with the computer randomly switching or sticking, then we find that the switchers win about 2/3 of the time, and the stickers win about 1/3 of the time… Why is this?!

The solution is to get the right representation.  Lets say that K~B is the proposition ‘I KNOW the car is not behind door B’.

So, we start out, we choose A.  Now, we can work out p(K~B|A), ie. the probability that door B will be revealed to not have the car behind it, given that the car IS behind door A.  If the car is behind door A, Monty will either open door B or door C.  So the probability that we will KNOW that the car is not behind door B is 1/2.

If, on the other hand, the car ISN’T behind door A, then it will either be behind door B or door C.  Monty will open one of these two doors.  So again, p(K~B|~A), ie. the probability that door B will be revealed to now have the car behind it, given that the car is NOT behind door A, will be 1/2.

From these two pieces of data, we can work out, using Bayes Theorem, the value of p(A|K~B), as follows:

p(A|K~B) = (p(K~B|A)p(A) / (p(K~B|A)p(A) + p(K~B|~A)(1-p(A)))
p(A|K~B) = (1/2 * p(A)) / (1/2 * p(A) + 1/2 * (1 – p(A)))

Obviously p(A) will be 1/3, because it could be behind any door.

p(A|K~B) = (1/2 * 1/3) / (1/2 * 1/3 + 1/2 * 2/3)
p(A|K~B) = (1/6) / (3/6)
p(A|K~B) = 1/3

So the probability of the car being behind door A, given that we know that the car is NOT behind door B, is 1/3.  This means, because there is a partition, and we know the probability of it being behind door B is 0, then the probability of the car being behind door C must be 2/3.

If you want to know more about the Monty Hall problem, check the wikipedia page for it out here.  It’s quite cool.  And there’s some proofs on there a LOOOT more convincing than mine.

Summer 09 Revision COMP10412 AI Part 1: Robot Localisation

Artificial Intelligence is the field of computer science that concerns itself with getting better and more positive responses to the question ‘Can machines think?’.  Over the years there have been a number of approaches to artificial intelligence, the current best method that is growing in popularity is the use of probability theory in AI, and this probabilistic reasoning is spreading to almost all areas of Artificial Intelligence.

Basics of Probability Theory

A probability distribution is a function that takes a proposition and gives a probability in the range of 0 and 1.  For example, say you have a proposition that “If I toss a coin, it will come up tails”.  If you put this in the probability distribution, you should get an output of 0.5.  Or you could use a two-headed coin, which would have a different probability distribution that would give an output of 0.

There are two basic facts about probability distributions which make it a probability distribution, given propositions ‘a’ and ‘b’.

K1 – If a is always true, then p(a) = 1.
K2 – If a AND b are never true at the same time, then the probability of a OR b being true is p(a) + p(b).

p(a) can represent an agents degree of belief that a is true.  If p(a) = 1, then the agent believes a is always true.  If p(a) = 0, then the agent believes a is always false.

K1 and K2 are intuitively reasonable and can allow us to prove many more things about probabilities, such as:

The probability of a not happening is 1 – p(a).
If b is always true when a is true, then p(a) is less than or equal to p(b).
If a and b are always either true or false at the same time, then p(a) = p(b).
The probability of a OR b being true equals p(a) + p(b) – p(a AND b).

Partitions

Say you have a number of propositions, a1, a2, a3, a4, etc.

These propositions are ‘mutually exclusive’ if no two of the propositions is ever true at the same time.
These propositions are ‘jointly exhaustive’ if the sum of p(a1) and p(a2) and p(a3) etc. is 1.
These propositions form a ‘partition’ if they are both mutually exclusive and jointly exhaustive.

An example of a partition could be the propositions ‘The coin will land on heads’ and ‘The coin will land on tails’.  p(heads) = 0.5 and p(tails) = 0.5.  They can never be true at the same time so they are mutually exclusive, and they add to 1 so they are jointly exhaustive.  Therefore, these two propositions form a partition.

Conditional Probability

Conditional probability is when you have two propositions, and know that one is true, and want to know the probability of the other GIVEN that the other is true.

p(a|b) can be read as ‘the probability of a given that b is known to be true’.  It can be worked out with the formula p(a|b) = p(a AND b) / p(b).

Another way of thinking about it is that the proposition (a|b) is the degree of belief that an agent would have in a if and when it got the information that b was certainly true.  Some basic facts that follow from this are the following:

p(a|b) is always between 0 and 1, because it is a proposition.
If p(a) is 0, then p(a|b) = 0 as well, because it doesn’t matter whether b is true, a will never be true.
If b is true whenever a is true, then p(a|b) = 1, because if b is known to be true then a must be true.
If b and c are always true or false at the same time, then p(a|b) = p(a|c).

Conditionalising

So we have the probability distribution p, and we can use that to find the probability of something being true.  But we’re going to define another probability distribution, pb, to be the probability distribution assuming that b is true (given that p(b) is greater than 0).  So basically, pb(a) = p(a|b).  This is the probability distribution obtained by ‘conditionalising’ on b.  Whenever an agent receives a piece of information, it conditionalises it and works out a new probability distribution in this way.

Note that it’s also true that (pc)b(a) = pb AND c(a) = (pb)c(a), so it doesn’t matter in what order we conditionalise the information, if an agent receives more than one piece of information at a time.  The fundamentals of probabilistic reasoning in AI is that an agent keeps receiving data and keeps working out a new probability distribution based on the previous probability distribution and the information that it has received.

Bayes Theorem

The next theorem is probably the most important, but most simple, equation in probability theory.

p(b|a) = p(a|b)p(b) / p(a).

Or in its alternative, often more useful, form:

p(bi|a) = p(a|bi)p(bi) / ( p(a|b1)p(b1) + p(a|b2)p(b2) + … + p(a|bn)p(bn) )

This is typically applied when a is an observation statement and b1, b2, …, bn are all different hypotheses that can explain that observation.  We’ll see what this means in practical terms in the next section on robot localisation.  For now, just know that this is a very useful theorem.

Robot Localisation

Say we have a robot, in a room.  The room has obstacles in.  The robot has a map in its memory that tells it where the obstacles are, but the robot doesn’t know where it is in the room.  The robot has three sensors on the front, left, and right sides.  It can move forward, or it can turn left and right.

First lets talk about the surroundings.  Lets make it a coordinate system, so the robot is in a square room, from (0,0) in one corner to (99,99) in the opposite corner.  In addition to this, we have to know which direction the robot is facing, so we’ll also split up the direction into the range (0,99), where 0 is him facing one way, increasing as he turns more clockwise.

So lets let lijt be the proposition that ‘the robot is currently in location (i,j) facing in direction t’.

The collection of propositions l0,0,0, l0,0,1, l0,0,2, …, l99,99,98, l99,99,99 form a partition.  The robot can be in any one of 100×100x100 = 1,000,000 poses, but it must be in one and only one.  So these propositions all add up to 1.

We can always get less accurate information, say, li,j which represents the probability of it being in position (i,j) regardless of the direction it is facing is equal to the total sum of li,j,0, li,j,1, …, li,j,99.

Initial State of Belief

So now we know how the robot will represent its belief of what pose it is in, we now consider what the initial data should be.  For simplicity, we assume that the robot has no idea where it is in the room.  It could be in any pose, with equal probability.  However, the robot knows that there are objects in the room and it knows where the objects in the room are.  So it knows it cannot be on top of an object, obviously.  So it needs to be set up with its initial beliefs such that all positions where it knows there is an object must have a probability of 0 of it being there.  The poses that represent it NOT being on an object must all have the same probability and form a partition.

This might seem a bit complicated, but consider it smaller scale, say a room that’s only 2×2. If there are no objects, the probability of it being in any coordinate is 1/4 because there are 4 coordinates.  If there is something blocking the coordinate (1,1) then clearly it can’t  be on (1,1), and there are 3 coordinates that it can be on, so each will have a probability of 1/3.  It’s basically the same as this but on a 100×100 grid, and taking direction into account as well.  Some coordinates have a probability of 0, but all the rest have the same probability and form a partition.

Sensor Readings

So we’ve set the initial state, now we need to consider how we’re going to update the robot’s belief state so that when it gets information, it updates the probability of where it is accordingly.  We’ve already said that the robot has 3 sensors, one on the front, right, and left.  So lets say that ‘o’ represents the proposition that a certain sensor has reported a distance.  For example, ‘Left sensor reports a distance of 32 to nearest object or wall’.

But sensors are not accurate, they are noisy.  So we need to model the values that the sensors return.  We’ll say that they are representative of a normal distribution, and the sensor returns an integer between 0 and 2x (where x is the true distance), with mean x and standard deviation of about x/3.

Now we’ve been given an observation, o, we need to update our probability distribution, p, so that it takes o into account, ie. we need to work out the probability distribution po.  Remember that po(li,j,t) = p(li,j,t|o). So we need to work this value out for every single value of i, j, and t, and update the 100×100x100 matrix storing our degrees of belief of where we are.

Using the alternative form of Bayes Theorem, we can see that:

p(li,j,t|o) = p(o|li,j,t)p(li,j,t) / ∑i’,j’,t’ p(o|li’,j’,t’)p(li’,j’,t’)

And we have to work this value out for every single one of our 1,000,000 poses.  Which is a pretty ugly formula, but it makes sense.  Lets go through it term by term, starting with the top.  The top consists of p(o|li,j,t)p(li,j,t).

p(li,j,t) is the robots current belief that it is in pose i,j,t, so we can simply extract that value from the 100×100x100 matrix for each value.

p(o|li,j,t) is the probability of observation o, if we assume that the robot is currently in pose i,j,t.  This is tricky and depends on the accuracy of the sensors, but we now have a model for this.  Say, for example, that pose i,j,t is in the middle of the room facing a wall 50 squares away, and the forward sensor returns a value of 2, then this value will be very low because the probability of the sensor returning a value of 2 given that the robot is 50 squares away from the wall is very very low, on a normal distribution.

The bottom is the sum of, for all values of i,j,t, the above two values.  So first you would loop through the matrix and work out p(o|li,j,t)p(li,j,t) for every possible pose i,j,t.  Then you would sum up these numbers, and that would be the bottom of the equation every time.  ∑i’,j’,t’ p(o|li’,j’,t’)p(li’,j’,t’) is a constant for each observation o.

Robot Actions

But the robot is not static, the robot can move forward, and turn left and right, as we’ve seen.  Lets use ‘a’ to represent the proposition of the robot performing an action, such as ‘Move forward 10 squares’.  It may not be as simple as that.  For example, what if the floor is wet, or the motors are running out of power, or the robot runs into a wall.  It may run less than or greater than the 10 squares it’s supposed to.  We can model this with another normal distribution.

So if we have li,j,t representing the robots belief that it is at position (i,j,t), we can use l`i,j,t to represent the proposition “After the action, the robots new pose will be (i,j,t)”.

The engineer programming the robot should be able to estimate the probabilities p(l`i,j,t|a AND li’,j’,t’).

To update the matrix in response to an action, we can use the following formula:

p(l`i,j,t|a) = ∑i’,j’,t’ p(l`i,j,t|li’,j’,t’ AND a)p(li’,j’,t’).

Where p(l`i,j,t|a) means the probability of the robots pose being (i,j,t) given that the action a has happened, p(li’,j’,t’) is the probability of the robot being in post (i’,j’,t’), and p(l`i,j,t|li’,j’,t’ AND a) is the probability of the robot being in pose (i,j,t) given that action a has happened AND the robot used to be in position (i’,j’,t’).

Summary

So to sum up,the robot should start with an equal belief that it could be in any position.

The robot has a 3 dimensional matrix, 100×100x100, storing its x-coordinate, y-coordinate, and direction.

Sensor readings and actions are not static, they may be less or more than what it should be.

The robots belief matrix is updated every time there is an observation or the robot follows some action.

To update on an observation this formula should be followed:

p(li,j,t|o) = p(o|li,j,t)p(li,j,t) / ∑i’,j’,t’ p(o|li’,j’,t’)p(li’,j’,t’)

And to update on an action this formula should be followed:

p(l`i,j,t|a) = ∑i’,j’,t’ p(l`i,j,t|li’,j’,t’ AND a)p(li’,j’,t’).