Summer 09 Revision COMP10412 AI Part 2: Probabilistic Reasoning

In the previous post we talked about probability theory, and talked about agents having a probability distribution to represent their degrees of belief.  We also talked about the axioms K1 and K2, that is:

K1 – If a is always true, then p(a) = 1.
K2 – If a AND b are never true at the same time, then the probability of a OR b being true is p(a) + p(b).

These axioms become more important when we talk about the idea of a ‘dutch book’.

Dutch Books – Why Use Probability Distributions?

Lets consider an artificially intelligent agent walks into a betting shop.  No, this isn’t the start of some bad joke. In this betting shop, the agent will bet on propositions such as ‘Horse A will win the horse race’.  The tickets pay out £1 if the proposition is true, and no money if the proposition is not true.  But what price should the agent buy the ticket for?  This depends upon his degree of belief in the proposition. If the agent’s degree of belief, for example, is 0.6 that the horse will win, then he will be willing to pay 60p on the ticket.

But there’s two horse races going on tomorrow. Let ‘a’ and ‘b’ be the propositions ‘Horse A will win the first race’ and ‘Horse B will win the second race’.  Our agent has been programmed with the following degrees of belief:\

p(a) = 0.6
p(b) = 0.5
p(neither a nor b) = 0.25
p(not both a and b) = 0.7

Based on these beliefs, our agent takes out the following four bets:

Bet 1 pays £1 if horse ‘A’ wins the first race.  This ticket costs 60p.
Bet 2 pays £1 if horse ‘B’ wins the second race.  This ticket costs 50p.
Bet 3 pays £1 if horses ‘A’ and ‘B’ both lose their races.  This ticket costs 25p.
Bet 4 pays £1 as long as not both ‘A’ and ‘B’ win their races.  This ticket costs 70p.

So our agent has spent a total of 60+50+25+70 = £2.05

But there’s something weird about these bets.  Lets analyse it more closely:

If both horse A and horse B win, then our agent wins bets 1 and 2, but loses bets 3 and 4.  He wins £2.
If horse A wins and horse B loses, then our agent wins bets 1 and 4, but loses bets 2 and 3.  He wins £2.
If horse A loses and horse B wins, then our agent wins bets 2 and 4, but loses bets 1 and 3.  He wins £2.
If both horse A and horse B lose, then our agent wins bets 3 and 4, but loses bets 1 and 2.  He wins £2.

In any case, he wins £2.  But that can’t be right, he spent £2.05 on the ticket, so no matter what the outcome is, he’s lose money!

We call this situation a ‘Dutch Book’. And the reason that a Dutch Book was able to be made against the agent is all because of his initial degrees of belief! They violate the axioms of probability theory, K1 and K2, in some way.

In general, if an agents degrees of belief violate K1 and K2, then a Dutch Book can be made against it.  If an agents degrees of belief conform to K1 and K2, then a Dutch Book is impossible.  This example of Dutch Books shows us why we should represent our degrees of belief as probability distributions.

Why Conditionalise Probability Distributions?

Alright, lets take another example.  Let ‘a’ be the proposition that horse A wins the first race, and ‘b’ be the proposition that horse B wins the second race.  The ticket now says:

You get a payout of £1 if both horse A and horse B win.  You get no payout if horse A loses and horse B wins.  You get your money back if horse B loses.

How much would you spend on that ticket? This number is your conditional degree of belief in ‘a’ given ‘b’.

The proposal is that if an agents degrees of belief are given by a probability distribution, p, then its conditional degree of belief in ‘a’ given ‘b’ should be given by p(a|b) when p(b) is not 0.

Now lets consider the following scenario.  ‘a’ and ‘b’ are propositions as before.  p is our initial probability distribution, and pa is the probability distribution after the first race but before the second race, if a has won.

The agent buys the following tickets:

Bet 1: £1 if a and b both win.  0 if a wins but b loses.  Money back if a loses.  This ticket costs £p(b|a).
Bet 2: £x if a wins, otherwise nothing.  This ticket costs £(x * p(a))

After the first race, if a has won then the agent has already won £x, and has the following bet that used to be Bet 1:

£1 if b wins, otherwise nothing.

The agent then sells this ticket for £pa(b)

Now the payouts for the agent are as follows:

If a has won, then the payout will be his winnings on bet 2 (x*(1-p(a))) [ie. the winnings (£x) minus the amount spent on the ticket (x*p(a)], minus the amount he spent on bet 1 (p(a|b)), plus the amount he sold bet 1 for after race 1 (pa(b)).  So his total winnings are… x – xp(a) + pa(b) – p(b|a).

If a loses, then he has lost bet 2 and got his money back on bet 1.  So his total losses are the amount that he spent on bet 2, which was x * p(a).

If a loses, then his payoff is going to be negative, ie. he’s made a loss, as long as x is greater than 0 (Which it probably will be).

If a wins, then his payoff is going to negative if:

x – xp(a) + pa(b) – p(b|a) < 0
x (1 – p(a)) + pa(b) – p(b|a) < 0
x (1 – p(a)) < p(b|a) – pa(b)
x < (p(b|a) – pa(b)) / (1 – p(a))

So if we choose a value of ‘x’ between 0 and (p(b|a) – pa(b)) / (1 – p(a)), then we have made a dutch book on this agent, because no matter whether a wins or loses, he will always have a negative payoff.

Notice the numerator: (p(b|a) – pa(b)).  If the agents degrees of belief DO conform to axioms K1 and K2, then these two terms will be equal to each other and it will read x<0.  So in order to create a dutch book, a value of x must be chosen that’s greater than 0 and less than 0.  Clearly, this isn’t possible.

In this scenario, a dutch book can be avoided by making the agents degree of belief in p(b|a) equal to pa(b). This shows why we need to conditionalise, because if we don’t, then we’re going to end up in a dutch book situation.

The Monty Hall Paradox

This is a famous paradox, that shows how you have to be careful with Bayesian Updating.

An agent is on a TV competition.  At the end of the show, the host, Monty Hall, shows you over to three doors.  Behind one of the doors is a car, and behind the other two are goats.  Clearly, the aim is to pick the door with the car.  The doors are labelled A, B, and C.

You are asked to select a door, and the prize you get is behind the door you choose.  But after you’ve chosen a door, Monty goes and opens one of the other doors.  Behind that door is a goat.  Then he gives you the opportunity to either switch your selection, or stick with your selection.  The question: Do you stick, or switch?

To answer this, lets consider the three propositions, ‘A’, ‘B’, and ‘C’, which represent the propositions that the car is behind the respective door.

The probability that the car is behind A and not B is equal to the probability that the car is behind door A, obviously, which is equal to 1/3.  The probability that the car is NOT behind door B is 1 – p(B), which, obviously, is 2/3, because it could be behind A or C, each with a 1/3 probability.

So we chose door A to begin with, then Monty went and opened door B.  We now have the information that it is NOT B.  We can use Bayes Theorem to update our probability distribution as such:

p(A|~B) = p(A AND NOT B) / p(NOT B) = (1/3) / (2/3) = 1/2.  So this is telling us that the probability that it is behind door A now is 1/2, and the probability that it is behind door C now is 1/2.

But… wait.  If we run a simulation, with the computer randomly switching or sticking, then we find that the switchers win about 2/3 of the time, and the stickers win about 1/3 of the time… Why is this?!

The solution is to get the right representation.  Lets say that K~B is the proposition ‘I KNOW the car is not behind door B’.

So, we start out, we choose A.  Now, we can work out p(K~B|A), ie. the probability that door B will be revealed to not have the car behind it, given that the car IS behind door A.  If the car is behind door A, Monty will either open door B or door C.  So the probability that we will KNOW that the car is not behind door B is 1/2.

If, on the other hand, the car ISN’T behind door A, then it will either be behind door B or door C.  Monty will open one of these two doors.  So again, p(K~B|~A), ie. the probability that door B will be revealed to now have the car behind it, given that the car is NOT behind door A, will be 1/2.

From these two pieces of data, we can work out, using Bayes Theorem, the value of p(A|K~B), as follows:

p(A|K~B) = (p(K~B|A)p(A) / (p(K~B|A)p(A) + p(K~B|~A)(1-p(A)))
p(A|K~B) = (1/2 * p(A)) / (1/2 * p(A) + 1/2 * (1 – p(A)))

Obviously p(A) will be 1/3, because it could be behind any door.

p(A|K~B) = (1/2 * 1/3) / (1/2 * 1/3 + 1/2 * 2/3)
p(A|K~B) = (1/6) / (3/6)
p(A|K~B) = 1/3

So the probability of the car being behind door A, given that we know that the car is NOT behind door B, is 1/3.  This means, because there is a partition, and we know the probability of it being behind door B is 0, then the probability of the car being behind door C must be 2/3.

If you want to know more about the Monty Hall problem, check the wikipedia page for it out here.  It’s quite cool.  And there’s some proofs on there a LOOOT more convincing than mine.

Summer 09 Revision COMP10412 AI Part 1: Robot Localisation

Artificial Intelligence is the field of computer science that concerns itself with getting better and more positive responses to the question ‘Can machines think?’.  Over the years there have been a number of approaches to artificial intelligence, the current best method that is growing in popularity is the use of probability theory in AI, and this probabilistic reasoning is spreading to almost all areas of Artificial Intelligence.

Basics of Probability Theory

A probability distribution is a function that takes a proposition and gives a probability in the range of 0 and 1.  For example, say you have a proposition that “If I toss a coin, it will come up tails”.  If you put this in the probability distribution, you should get an output of 0.5.  Or you could use a two-headed coin, which would have a different probability distribution that would give an output of 0.

There are two basic facts about probability distributions which make it a probability distribution, given propositions ‘a’ and ‘b’.

K1 – If a is always true, then p(a) = 1.
K2 – If a AND b are never true at the same time, then the probability of a OR b being true is p(a) + p(b).

p(a) can represent an agents degree of belief that a is true.  If p(a) = 1, then the agent believes a is always true.  If p(a) = 0, then the agent believes a is always false.

K1 and K2 are intuitively reasonable and can allow us to prove many more things about probabilities, such as:

The probability of a not happening is 1 – p(a).
If b is always true when a is true, then p(a) is less than or equal to p(b).
If a and b are always either true or false at the same time, then p(a) = p(b).
The probability of a OR b being true equals p(a) + p(b) – p(a AND b).

Partitions

Say you have a number of propositions, a1, a2, a3, a4, etc.

These propositions are ‘mutually exclusive’ if no two of the propositions is ever true at the same time.
These propositions are ‘jointly exhaustive’ if the sum of p(a1) and p(a2) and p(a3) etc. is 1.
These propositions form a ‘partition’ if they are both mutually exclusive and jointly exhaustive.

An example of a partition could be the propositions ‘The coin will land on heads’ and ‘The coin will land on tails’.  p(heads) = 0.5 and p(tails) = 0.5.  They can never be true at the same time so they are mutually exclusive, and they add to 1 so they are jointly exhaustive.  Therefore, these two propositions form a partition.

Conditional Probability

Conditional probability is when you have two propositions, and know that one is true, and want to know the probability of the other GIVEN that the other is true.

p(a|b) can be read as ‘the probability of a given that b is known to be true’.  It can be worked out with the formula p(a|b) = p(a AND b) / p(b).

Another way of thinking about it is that the proposition (a|b) is the degree of belief that an agent would have in a if and when it got the information that b was certainly true.  Some basic facts that follow from this are the following:

p(a|b) is always between 0 and 1, because it is a proposition.
If p(a) is 0, then p(a|b) = 0 as well, because it doesn’t matter whether b is true, a will never be true.
If b is true whenever a is true, then p(a|b) = 1, because if b is known to be true then a must be true.
If b and c are always true or false at the same time, then p(a|b) = p(a|c).

Conditionalising

So we have the probability distribution p, and we can use that to find the probability of something being true.  But we’re going to define another probability distribution, pb, to be the probability distribution assuming that b is true (given that p(b) is greater than 0).  So basically, pb(a) = p(a|b).  This is the probability distribution obtained by ‘conditionalising’ on b.  Whenever an agent receives a piece of information, it conditionalises it and works out a new probability distribution in this way.

Note that it’s also true that (pc)b(a) = pb AND c(a) = (pb)c(a), so it doesn’t matter in what order we conditionalise the information, if an agent receives more than one piece of information at a time.  The fundamentals of probabilistic reasoning in AI is that an agent keeps receiving data and keeps working out a new probability distribution based on the previous probability distribution and the information that it has received.

Bayes Theorem

The next theorem is probably the most important, but most simple, equation in probability theory.

p(b|a) = p(a|b)p(b) / p(a).

Or in its alternative, often more useful, form:

p(bi|a) = p(a|bi)p(bi) / ( p(a|b1)p(b1) + p(a|b2)p(b2) + … + p(a|bn)p(bn) )

This is typically applied when a is an observation statement and b1, b2, …, bn are all different hypotheses that can explain that observation.  We’ll see what this means in practical terms in the next section on robot localisation.  For now, just know that this is a very useful theorem.

Robot Localisation

Say we have a robot, in a room.  The room has obstacles in.  The robot has a map in its memory that tells it where the obstacles are, but the robot doesn’t know where it is in the room.  The robot has three sensors on the front, left, and right sides.  It can move forward, or it can turn left and right.

First lets talk about the surroundings.  Lets make it a coordinate system, so the robot is in a square room, from (0,0) in one corner to (99,99) in the opposite corner.  In addition to this, we have to know which direction the robot is facing, so we’ll also split up the direction into the range (0,99), where 0 is him facing one way, increasing as he turns more clockwise.

So lets let lijt be the proposition that ‘the robot is currently in location (i,j) facing in direction t’.

The collection of propositions l0,0,0, l0,0,1, l0,0,2, …, l99,99,98, l99,99,99 form a partition.  The robot can be in any one of 100×100x100 = 1,000,000 poses, but it must be in one and only one.  So these propositions all add up to 1.

We can always get less accurate information, say, li,j which represents the probability of it being in position (i,j) regardless of the direction it is facing is equal to the total sum of li,j,0, li,j,1, …, li,j,99.

Initial State of Belief

So now we know how the robot will represent its belief of what pose it is in, we now consider what the initial data should be.  For simplicity, we assume that the robot has no idea where it is in the room.  It could be in any pose, with equal probability.  However, the robot knows that there are objects in the room and it knows where the objects in the room are.  So it knows it cannot be on top of an object, obviously.  So it needs to be set up with its initial beliefs such that all positions where it knows there is an object must have a probability of 0 of it being there.  The poses that represent it NOT being on an object must all have the same probability and form a partition.

This might seem a bit complicated, but consider it smaller scale, say a room that’s only 2×2. If there are no objects, the probability of it being in any coordinate is 1/4 because there are 4 coordinates.  If there is something blocking the coordinate (1,1) then clearly it can’t  be on (1,1), and there are 3 coordinates that it can be on, so each will have a probability of 1/3.  It’s basically the same as this but on a 100×100 grid, and taking direction into account as well.  Some coordinates have a probability of 0, but all the rest have the same probability and form a partition.

Sensor Readings

So we’ve set the initial state, now we need to consider how we’re going to update the robot’s belief state so that when it gets information, it updates the probability of where it is accordingly.  We’ve already said that the robot has 3 sensors, one on the front, right, and left.  So lets say that ‘o’ represents the proposition that a certain sensor has reported a distance.  For example, ‘Left sensor reports a distance of 32 to nearest object or wall’.

But sensors are not accurate, they are noisy.  So we need to model the values that the sensors return.  We’ll say that they are representative of a normal distribution, and the sensor returns an integer between 0 and 2x (where x is the true distance), with mean x and standard deviation of about x/3.

Now we’ve been given an observation, o, we need to update our probability distribution, p, so that it takes o into account, ie. we need to work out the probability distribution po.  Remember that po(li,j,t) = p(li,j,t|o). So we need to work this value out for every single value of i, j, and t, and update the 100×100x100 matrix storing our degrees of belief of where we are.

Using the alternative form of Bayes Theorem, we can see that:

p(li,j,t|o) = p(o|li,j,t)p(li,j,t) / ∑i’,j’,t’ p(o|li’,j’,t’)p(li’,j’,t’)

And we have to work this value out for every single one of our 1,000,000 poses.  Which is a pretty ugly formula, but it makes sense.  Lets go through it term by term, starting with the top.  The top consists of p(o|li,j,t)p(li,j,t).

p(li,j,t) is the robots current belief that it is in pose i,j,t, so we can simply extract that value from the 100×100x100 matrix for each value.

p(o|li,j,t) is the probability of observation o, if we assume that the robot is currently in pose i,j,t.  This is tricky and depends on the accuracy of the sensors, but we now have a model for this.  Say, for example, that pose i,j,t is in the middle of the room facing a wall 50 squares away, and the forward sensor returns a value of 2, then this value will be very low because the probability of the sensor returning a value of 2 given that the robot is 50 squares away from the wall is very very low, on a normal distribution.

The bottom is the sum of, for all values of i,j,t, the above two values.  So first you would loop through the matrix and work out p(o|li,j,t)p(li,j,t) for every possible pose i,j,t.  Then you would sum up these numbers, and that would be the bottom of the equation every time.  ∑i’,j’,t’ p(o|li’,j’,t’)p(li’,j’,t’) is a constant for each observation o.

Robot Actions

But the robot is not static, the robot can move forward, and turn left and right, as we’ve seen.  Lets use ‘a’ to represent the proposition of the robot performing an action, such as ‘Move forward 10 squares’.  It may not be as simple as that.  For example, what if the floor is wet, or the motors are running out of power, or the robot runs into a wall.  It may run less than or greater than the 10 squares it’s supposed to.  We can model this with another normal distribution.

So if we have li,j,t representing the robots belief that it is at position (i,j,t), we can use l`i,j,t to represent the proposition “After the action, the robots new pose will be (i,j,t)”.

The engineer programming the robot should be able to estimate the probabilities p(l`i,j,t|a AND li’,j’,t’).

To update the matrix in response to an action, we can use the following formula:

p(l`i,j,t|a) = ∑i’,j’,t’ p(l`i,j,t|li’,j’,t’ AND a)p(li’,j’,t’).

Where p(l`i,j,t|a) means the probability of the robots pose being (i,j,t) given that the action a has happened, p(li’,j’,t’) is the probability of the robot being in post (i’,j’,t’), and p(l`i,j,t|li’,j’,t’ AND a) is the probability of the robot being in pose (i,j,t) given that action a has happened AND the robot used to be in position (i’,j’,t’).

Summary

So to sum up,the robot should start with an equal belief that it could be in any position.

The robot has a 3 dimensional matrix, 100×100x100, storing its x-coordinate, y-coordinate, and direction.

Sensor readings and actions are not static, they may be less or more than what it should be.

The robots belief matrix is updated every time there is an observation or the robot follows some action.

To update on an observation this formula should be followed:

p(li,j,t|o) = p(o|li,j,t)p(li,j,t) / ∑i’,j’,t’ p(o|li’,j’,t’)p(li’,j’,t’)

And to update on an action this formula should be followed:

p(l`i,j,t|a) = ∑i’,j’,t’ p(l`i,j,t|li’,j’,t’ AND a)p(li’,j’,t’).

Summer 09 COMP10042 Computation 1: Patterns

In the last blog post we looked at what the various components in the module, mostly languages, symbols, and alphabets.  Now we’re going to look in more detail at how to define a language from symbols.

To define languages more easily we’re going to look at patterns.  Patterns can also be called regular expressions.  Computers, after all, can’t recognise set notation symbols such as {, }, or ∈.

The Kleene Star

So lets say we want the computer to recognise all instances of 01 repeated, so it would match 01, 0101, 010101, 01010101, 0101010101, etc.  We notice that we need to concatenate 01 an arbitrary number of times.  We can do this with (01)*.  This is a pattern.  The * is called the Kleene Star and will apply an arbitrary number of concatentations to whatever precedes it, in this case 01.  Now the computer just has to compare: Is the first string equal to 0?  Is the next equal to 1?  And so on until the end of the string.

Other examples of patterns could be 1*0, which is the language of all strings consisting of any number of 1s, followed by a 0.  Or there may be two Kleene Stars, such as the pattern 1*0*.  This is any number of 1s followed by any number of 0s.

Alternative

So we’ve looked at using the Kleene Star to create languages.  But that’s not enough for some patterns.  For example, what if we wanted the language of all strings consisting of just 0s and just 1s, ie. the language {0, 00, 000, 1, 11, 111, 1111, 0000, …}

We can also use the alternative, |, to represent a choice.  For example, 0|1 is the language consisting of the string 0 and the string 1.  The language 0*|1* represents the language described above – All strings that are either an arbitrary number of 0s or an arbitrary number of 1s.

Building up Patterns

We can define patterns recursively in the following way:

Base cases:

The character ø is the empty pattern.
The character ϵ is a pattern consisting of the empty word.
Every letter from a specified alphabet ∑ is a pattern.

Recursive cases:

If p and q are both patterns, then so is (pq).  This is concatenation.
If p and q are both patterns, then so is (p|q).  This is the alternative.
If p is a pattern, then so is (p*).  This is the kleene star.

Matching words with Patterns

In order to match a word with a pattern, we can do it recursively, as follows:

Base cases:

The empty word ϵ matches the pattern ϵ.
The pattern p is a single character and s is that character.

Recursive cases:

The pattern p is a concatenation p=(p1p2) and there are words s1 and s2 such that s1 matches p1, s2 matches p2, and s = s1s2.
The pattern p is an alternative p=(p1|p2) and s matches p1 or p2, or both.
The pattern p is of the form p = (q*) and s can be written as s = s1s2s3…sn such that s1, s2, s3, …, sn all match q.

(Note that no word matches the empty pattern ø).

Languages of Patterns

Given a pattern, p, we can refer to the language of all words that match that pattern as L(p).  Or, again, we can define the language recursively, as follows:

Base cases:

L(ø) = ø.
L(ϵ) = {ϵ}.
L(x) = {x} for all x in ∑.

Recursive cases:

L(p1p2) = L(p1)L(p2)
L(p1|p2) = L(p1) ∪ L(p2)
L(p*) = (L(p))

A regular language is any language that can be described by a pattern. Note that there are some languages that can’t be represented by a pattern, such as the language of all strings that consist of an arbitrary number of 0s followed by the SAME number of 1s, ie. the language {01, 0011, 000111, 00001111, …}.  This is because there is no way of remembering how many 0s there were.  We will find other ways to express these languages later, in Context Free Grammars.

So from simple patterns we’ve shown how to build them up into more complex patterns, then we saw how to recursively tell if a word matches a pattern, and finally how to show and build up the languages of a pattern.  This basically concludes what patterns are, and you should by now have some idea of how to build up a pattern given a rule.  There’s a number of examples to ensure you understand in Andrea’s notes on page 19.

And of course, as usual, there are the fantastic Badgerati’s notes available at http://badgerati.wordpress.com/2009/05/11/comp10042-computation-regular-expressions/

Summer 09 COMP10042 Computation 1: Sets & Languages

In order for a computer to organise a textual input, the computer has to be able to recognise certain strings, such as, in parsing a programming language, it has to be able to find ‘if’ or ‘while’.

The first part of Computation deals with how we describe to a computer what string to look for, how to come up with the right description to solve the problem, and how to do more advanced things such as check whether a piece of code is correctly formed, making sure that opening and closing brackets match, or checking that a webpage is written in valid HTML.

Symbols, Words and Strings – Definitions

Input to a computer is given in symbols, such as letters and numbers. A collection of symbols is an alphabet.  An alphabet is usually denoted by the symbol ∑.

Symbols can be combined into strings, which can be called ‘words’.  We can have words consisting of any number of symbols, including zero.  A word with zero symbols is called the empty string, and is referred to by ϵ.

A space can be a symbol, and we can use the symbol ı_ı to show it. (Though in these blog posts, I’ll just use _ since there’s no single character for ı_ı).

The length of a string can be shown by a modulus symbol.  For example, |s| is the length of the string s.  |ϵ| = 0, obviously, because it is the empty string.

Languages

A language is a set of words that can be created with an alphabet. Some examples of languages are the empty language, {}, the language consisting of the empty word {ϵ}, or languages consisting of some number of strings, {ϵ,1,aba,abba,abbba,abbbba}.  A language is referred to with the symbol L.

We can use set notation to define a language.  For example, say we wanted the language of all words that could be made with the alphabet of the symbols ‘a’, ‘b’, and ‘c’.  We can define this language as {s | s is a word over the alphabet {a,b,c}}.  The | should be read as ‘where’, so this is ‘All languages consisting of the word ’s’, where s is a word over the alphabet {a,b,c}’

Concatenation

Two strings can be concatenated, that is, put together.  So the strings Hel and lo could be concatenated into the single string Hello.  To show this, we use a concatenation symbol as such: Hel · lo = Hello.

For multiple concatentation, we can use a power symbol.  For example, h³ = hhh. h° = ϵ (ie. the string consisting of 0 numbers of h).  We can also use n to denote an arbitrary number of concatenations.  For example, an is the language {ϵ, a, aa, aaa, aaaa, aaaaa, …}.

Using set notation, we can write it as L = {an | n ∈ N}.  This can be read as ‘The language L consists of all strings an where n is a whole number’.  We could also exclude the empty string by saying ‘where n is a positive whole number’ with the symbol N+.

We can also concatenate two languages as follows: L1 · L2 = {st | s ∈ L1 and t ∈ L2}.  ie. ‘the set of all concatenated strings where s is a string from L1 and t is a string from L2′.

Summer 09 COMP10052 Distributed Systems 1: VoIP 1

What is a distributed system?

Put simply, a distributed system is a number of pieces of hardware and software that communicate with each other but use different processors.  They can communicate over a network such as a LAN, or the internet. Distributed systems are subject to a number of problems, however, such as variable latency (Two messages sent between two computers may not take the same amount of time to arrive), and unpredictable failures such as a message not being sent.

In this posting we’ll be investigating some of the issues of VoIP, transmitting voice over the internet.  We’ll look at how it works, some of the problems involved with it, and how to overcome those problems.

Basics of sound and voice – Analogue

Voice is just variation in air pressure, travelling from a speaker to a listener at the speed of sound. A microphone can convert this pressure into voltage, and the computer can store it as a sound waveform. This waveform is the sum of a number of sine waves of different frequencies, measured in Hz.

Sound Waveform
Humans can hear up to 20kHz. CDs typically have a ‘bandwidth’ (frequency range) of 50Hz to 20 kHz. Phone lines typically have a ‘bandwidth’ of 300Hz to 3.4kHz, which explains why speech over a telephone sounds a lot more unnatural.

Old Fashioned Phone System

The old fashioned phone system used analogue transmission. The wires carried different voltages depending on the air pressure in the receiver. There was very little delay – about 1 millisecond per 100 miles.

In the 1960s, digital transmission started between exchanges, and most phone speech is now transmitted digitally. Analogue transmission is still used between the exchange and the house. Circuit switching is still used in a connection oriented manner – setting up a connection between the sender and the recipient.

Basics of sound and voice – Digital

In order to transmit something digitally, there has to be some way it can be represented using 0s and 1s.  In order to do this, an analogue signal from the microphone has to be ’sampled’.  This means the program has to take the input regularly and convert all of the data since the last sample into binary.  The more often a signal is sampled, the better the quality.

CDs sample at around 44100 times per second, and store 16 bits per sample.  Phones sample less frequently, at only 8000 times per second, and only store 8 bits per sample, which means that the phone line is transmitting 64 kb/s of sound data. This is a famous standard for digitising speech called ITU-G711.

Unfortunately, 64 kb/s is far too high for mobile phones, and for VoIP.  This is why other standards have to be used.  A mobile phone will usually use a 9.6kb/s standard.

Picture 4

Buffering

The sound cards themselves control the sampling rates (If the processors controlled them, they’d have to be interrupted 8000 times a second, which would slow the computer down considerably).  They have a separate clock, and use buffers to store the inputs and outputs.  A buffer is basically an array that is filled by the processor and emptied by the sound card, or vice versa.

Imagine a leaky bucket filled up by a water tap.  The inputs from the processor fill up the bucket every time the processor isn’t busy.  These inputs then come out of the bottom of the bucket through a hole in the same order that they came in, in a constant stream, despite the input from the processor being intermittent.

It’s very important that the bucket doesn’t empty or overflow.  Thinking about the sound, if the bucket emptied then there would be no more sound until more was put in the bucket.  If the bucket overflowed then some sound would be lost.

Problems

But there’s a problem.  The sampling rates are controlled by crystals, and are accurate to 0.01%.  An 8000Hz sampling rate could in fact be 7999Hz or 8001 Hz.  This is a problem, because this could mean one system is getting an extra 2 samples per second.

Slower clocks can run out of samples (buffer underflow), faster clocks could get too many samples (buffer overflow).  This is a fundamental problem with distributed systems, because there is no global clock that defines how fast everything goes.  Everything works at their own speed.

If we add a network in, we introduce even more problems.  Networks send data in a number of packets, and there’s a danger of packets being lost (not making it to the recipient) or delayed (arriving after one that was sent later).  So, we need to set up a communication.  Lets first look at protocol layers.

TCP/IP Protocol Layers

The TCP/IP Protocol layers are described below, at which different protocols are relevant.

TCPIP Protocols

1) The top-most layer is the Application layer.  Protocols here include http, POP3, FTP, etc.  These are protocols in which two applications can talk to each other over a distributed system.

2) The second later is called the Transport layer.  Protocols here include TCP and UDP.  These are the basis for the protocols in the Application layer, and are closer to how data is actually sent.

3) Lower down there is the Network layer.  In this layer the IP protocol is of particular relevance.  This is the basis for the protocols in the Transport layer.  TCP and UDP have to use the IP protocol to communicate.

4 and 5) Even lower there is the Data-Link layer, and at the bottom is the Physical layer.  These layers are concerned with physical protocols, such as Ethernet (for wired networks) or IEEE802.11 (for wireless networks).  It is on these protocols that IP depends.

So as we move down from the Application layer to the Physical layer, we can see that messages are passed down.  A message using an Application protocol has to be handled by a Transport protocol, which has to be handled by a Network protocol, which has to be handled by a Physical protocol.

Lets look at some of these layers in more detail.

Network layer (Internet Protocol)

The Internet Protocol, or IP for short, deals with the routing of IP packets.  A packet is an amount of data that is sent in one part.

IPPacket

A packet consists of a header of around 20 bytes, and the data which can be any length.  The header contains information about the packet such as the header length, the data length, the ‘check-sum’, the source IP address, destination IP address, etc.  Anything that needs to be known to get the packet intact to the right place.

A checksum is a number of extra bits that can detect errors in the data.  This is sometimes referred to as a cyclic redundancy check (CRC).  In IP packets, a checksum algorithm is as follows:

1) Split up the data into 16 bit ‘words’.
2) Add together the binary values of all of these words.
3) If the result is longer than 16 bits, take anything past 16 bits and add it to the result.

This results in the 16 bit checksum.

The CRC is a check to make sure that the header is correct.  Suppose that the header is the decimal number 149.  To find the CRC, divide it by, say, 7, and express the remainder in binary.  This gives 110 (6) as the CRC, so those are the check bits.

The same division is done at the receiving end.  If the remainder is different, then we know that something is wrong and an error has occurred.  The number ‘7′ would have to be agreed upon in advance.

This does not detect ALL errors.  Any combination of errors that add or subtract multiples of 7 from the header wouldn’t be detected as they have the same remainder, but the majority of errors will be found.  In practice, a much higher number than 7 is used, and binary arithmetic is used rather than simple integer division.

These checks allow an error in an IP header to be detected.  If an error has occurred, the data is simply discarded.  Errors are rare in wired networks, but very frequent in wireless networks.

The IP protocol is connectionless – it just sends a message one way then forgets about it.  It’s unreliable, no guarantees that the message was sent.  Data can easily be lost, delayed, or damaged.  IP is a fundamental part of the internet, and many private networks.

Data-Link and Physical layers

The physical layer protocols just sends relevant voltage pulses of 0s and 1s along physical wires or wirelessly.  This is where bit errors can occur.

The data link layer is responsible for finding and possibly correcting bit errors.

Transport Layer: TCP, UDP, RTP, and RTCP

The transport layer has protocols which use the IP layer to achieve data transfer in a way that is more suitable for application layer protocols.  The most important of these are TCP (Transmission Control Protocol) and UDP (User Datagram Protocol).  Two others that are adapted to real time applications such as VoIP are RTP (Real Time Protocol) and RTCP (Real Time Control Protocol).

TCP uses IP to create a connection-oriented transmission.  This is slower, but more reliable, and should be used for data that must be sent but delay won’t matter too much.  Port numbers are used to tell the difference between messages arriving, and things such as sequence numbers if a number of packets is being sent are implemented.

TCP is reliable because of a mechanism for acknowledging receipt of a packet, and retransmitting packets if necessary. This can take time and can create delays, so TCP is not really suited to VoIP.

UDP is simple, connectionless, and unreliable.  You send a message then forget about it.  It sends the source port number, the destination port number, the length of the data, a check, and the data itself.  It’s used for applications which don’t need or don’t have time to wait for acknowledgment or retransmission.  UDP could be used in VoIP because voice is not as sensitive to bit errors and lost packets.

RTP & RTCP is useful because it allows the transmitter to know how many packets are getting through, and what the delay is.  It’s specifically designed for real time applications like VoIP.  A time stamp is added to a UDP packet.  RTCP on the receivers end can send reports back to the sender every 5 seconds or so, giving a general idea of how many packets are being lost.

Summer 09 Revision COMP10020 Maths Part 1: Functions

What is a function?

Functions are things that take inputs and produce outputs. All functions have a certain well defined set of inputs that can be accepted by the function, and when the function is run multiple times with the same input then it will always deliver the same output.

An example of a function is 2x. It has a well defined set of inputs (the set of all numbers) and if you run it multiple times with the same number you’ll always get the same output. 2->4, 3->6, 9->18, etc.

Function notation

The inputs and outputs of a function are both sets, referred to as the ’source’ and the ‘target’ respectively. (Or the ‘domain’ and the ‘codomain’). A function can be shown as follows: ƒ:S -> T where ƒ is the function, S is the source, and T is the target. The output value of the function ƒ given the input value of x can also be written as ƒ(x). The behaviour of the function can be written alongside the function as follows: x|->2x (For the function ƒ(x)=2x)

Function Equivalence

For two functions ƒ1:S1->T1 and ƒ2:S2->T2 to be equal (ƒ1=ƒ2), they must satisfy three conditions. They must have the same source, S1 = S2. They must also have the same target, T1 = T2. Finally, they must both have the same behaviour, ie. ƒ1(x)=ƒ2(x) for all possible input values.

Identity Function

For any possible input set there is a function ƒ:S->S, ie. every element of S, if input into function ƒ, will return itself. This is called the identity function, and can be denoted by 1s or ids.

Injection, Surjection, and Bijection

A function ƒ:S->T can have certain properties. Injectivity, surjectivity, and bijectivity are properties that functions can have.

A function is surjective if for every element in the target, there is at least one element in the source that maps onto the target.
A function is injective if for every element in the target there is at most one element in the source that maps onto the target.
A function is bijective if it is both injective and surjective, ie. for every element in the target there is one and only one element in the source that maps onto the target.

The following diagrams show an injective, a surjective, and a bijective function, respectively.

Examples

Assume that the source and the target are the sets of all real numbers.

Injective: ‘For every element in the target there is at most one element in the source that maps onto the target.’

  • ƒ(x) = 2x+1 (There is no number that can be output from two different inputs)
  • ƒ(x) = ex

Surjective: ‘For every element in the target there is at least one element in the source that maps onto the target.’

  • ƒ(x) = 2x+1 (For any possible output, there is always an input that can produce it)
  • ƒ(x) = x

Bijective: ‘For every element in the target there is one and only one element in the source that maps onto the target.’

  • ƒ(x) = 2x+1 (As you have seen above, it is both surjective and injective.)
  • ƒ(x) = x

Composition

If we take two functions, ƒ:R->S1 and g:S2->T, we can say they are ‘composable’ when S1 = S2. This means there is a composite function h:R->T. We can denote this by h = g ∘ ƒ. In terms of behavious, h(x) = g(ƒ(x)).  h(x) has the source R, which is also the source of ƒ, and the target T, which is also the target of g.  It can also be written as gƒ.

g∘ƒ:R->T. Note that it is now impossible to determine what set S was from the composite h.

A function composed with the identity function is equal to itself, that is to say that ƒ∘1S = ƒ.

Double Composites

Composite functions can be found for multiple functions as well. Consider the following functions: ƒ:S->L, g:L->R, h:R->T. It’s possible to find a single function that has a source of S and the target of T. This will be given by hg∘ƒ.

Composite functions are associative, ie. ((hg)∘ƒ) = (h∘(g∘ƒ)). For this reason, it is usually written without the brackets as hg∘ƒ.  Or even more commonly, just as hgƒ.

And if you want more notes, check out the notes of Badgerati.

Long time since I’ve touched this..

I just got a comment, comments are rare for me. It regards physics. I’ve not touched physics in a very long time, but I’ll try to answer it as best as possible.

“how does the variation of amplitude with distance from the source differ for a progressive wave and a stationary wave? i hope that u wud find me an ans for this question as soon as possible”

(I’ve used standing wave instead of stationary wave because that’s what I’m used to, and the fact that I’m not sure if it’s stationary or stationery and can’t be bothered googling.)

Amplitude is the maximum displacement from equilibrium that a point on the wave can have. In a progressive wave, every point on the wave from the source to the end of the wave has the exact same amplitude, as if you consider each point, it is moving up and down the same amount as the point next to it, but is slightly out of phase. Thus, the amplitude from the source in a progressive wave is a constant value.

It’s different for standing waves, because they have nodes and antinodes. At antinodes, they have an amplitude of zero. At nodes, they have their maximum amplitude, and their amplitude is different for all points between the nodes and antinodes. The source will usually [always? Can't remember {No, not always, because you can make a standing wave with a slinky by having the node at one end.}] be at an antinode. So in a standing wave, from the source, the amplitude is not a constant. From the source, the amplitude will increase (in a sinusoidal manner(?)) until the first node, and then decrease (in a sinusoidal manner(?)) until the second antinode. This then repeats based on which harmonic the standing wave is at (ie. how many nodes it has).

So to sum up: The amplitude from the source in a progressive wave is a constant value, whereas in a standing wave the amplitude varies. From the source, the amplitude will increase in a sinusoidal manner until the first node, and then decrease again to zero amplitude in a sinusoidal manner to the next antinode.

To put it simply though, if the question’s a lot easier than I’m making it out to be, “Progressive waves have a constant amplitude while in a standing wave, the amplitude is variable depending on where the point lies between a node and an antinode.”

I’d illustrate this with graphs and charts and all manner of flashy things, but I’m far too lazy busy for that.

I hope that this answers your question. Actually, I hope more that I’m right, it’s been a while since I did this kind of thing. It sounds right, anyway. The kind of thing I’d say.

Further Pure Mathematics MEI 2 – Hyperbolic Functions

Hyperbolic Functions

Sinh, cosh, tanh are hyperbolic functions.

Sinh x = 0.5(e^x-e^-x)
Cosh x = 0.5(e^x+e^-x)
Tanh x = (e^2x-1)/(e^2x+1)

To integrate hyperbolic functions, put them in exponential form or assume multiple of 1 and integrate by parts.

Differentiation of inverse hyperbolic functions (arsinh, arcosh, artanh) is referenced in the formula booklet.

Differential of cosh = sinh. Differential of sinh = -cosh. This is opposite to normal trigonometrical differentiatials.

Further Pure Mathematics MEI 2 – Matrices

Matrices

A 3×3 matrix is a matrix of the form: (a,b,c|d,e,f|g,h,i)

The determinant of a 3×3 matrix can be found by Sarrus’ Method: (aei+bfg+cdh)-(ceg+fha+ibd)

To find the inverse M^-1 of a 3×3 matrix M, a number of steps have to be followed:

- Find the determinant
- Find the cofactors of every cell in the matrix (This means the determinant of the 2×2 matrix left after eliminating the row and column of the cell of which the cofactor is being found)
- Alter the signs of the cofactors according to the matrix (+,-,+|-,+,-|+,-,+)
- Transpose the matrix so that each row becomes a column and vice versa, so (a,b,c|d,e,f|g,h,i) becomes (a,d,g|b,e,h|c,f,i).
- Divide the resultant matrix by the determinant of the original matrix

The matrix M-(lamda)I = (a-lamba,b,c|d,e-lamda,f|g,h,i-lamda).  The characteristic equation of a matrix is equating the determinant to 0, ending up with an equation in terms of lamda.

Eigenvalues are values of lamda for which det(M-(lamda)I)=0.

To obtain eigenvectors, first obtain simultaneous equations as follows:

ax+by+cz=(lamda)x
dx+ey+fz=(lamda)y
gx+hy+iz=(lamda)z

These can be solved to find the relationship between x, y, and z, and this can be put into the form of a 3×1 eigenvector.

To verify if a matrix is an eigenvector of a matrix, postmultiply the matrix by the eigenvector.  If the output is a scalar multiple of the eigenvector, then it is valid.

Once eigenvalues and eigenvectors are known, it is possible to combine them together like so:

Eigenvectors (a,b,c) (d,e,f) (g,h,i)
Eigenvalues j, k, l respectively

P = (a,d,g|b,e,h|c,f,i)
D = (j,0,0|0,k,0|0,0,l)

Now the matrix M can be written as PDP^-1.  This is useful for finding powers of M as M^n = PD^nP^-1 and D is a diagonal matrix so D^n = (j^n,0,0|0,k^n,0|0,0,l^n)

Cayley Hamilton theorem states that every square matrix satisfies its own characteristic equation.

Further Pure Mathematics MEI 2 – Complex Numbers

Complex Numbers

A complex number is a number of the form a+bj where j is sqrt(-1).

It can be written as a+bj (Real and Imaginary parts), as c(cos d + j sin d) (Where c is the modulus and d is the argument, c cos d is the real part and c sin d is the imaginary part), or as ce^(jd).

cos theta + j sin theta can be written as e^(j(theta)).

An argand diagram is a graph of real part (x-axis) against imaginary part (y-axis).

The modulus of an imaginary number is its distance from the origin to its point on an argand diagram, and can be found by sqrt(a^2+b^2) or c.

The argument of a complex number is the angle that it makes, in radians, with the x axis. Real numbers have an argument of 0 as they all lie on the x axis. The argument can be found by arctan(b/a) or d.

If w is a complex number a+bj then w* is the complex conjugate a-bj.

j^2 = -1

(cos theta + j sin theta)^n = cos n(theta) + j sin n(theta). This is known as De Moivre’s theorem.

To find the roots of a complex number of the form a+bj, the modulus and argument must be worked out. There are as many solutions as the root number (eg. to find the nth roots of a+bj then there are n answers). The modulus of all the roots is the same and is the nth root of the modulus of the complex number. The argument of the first root can be found by taking the argument of the complex number and dividing it by n. Then to find each of the other arguments for the other roots, add (2pi/n) to the argument each time.

If given two series where:

C = a cos theta + b cos 2theta + c cos 3theta …
S = d sin theta + e sin 2theta + f sin 3theta …

To find another equation for C or S, first multiply S by j and then add C, so that you end up with an infinite series C+jS. Then group together common values of cos and sin theta and use De Moivre’s theorem to group them together into a single series. This series should be recognisable as being arithmetic or geometric in the formula booklet, which can then be used to sum the infinite series.