Summer 09 COMP10042 Computation 1: Patterns

In the last blog post we looked at what the various components in the module, mostly languages, symbols, and alphabets.  Now we’re going to look in more detail at how to define a language from symbols.

To define languages more easily we’re going to look at patterns.  Patterns can also be called regular expressions.  Computers, after all, can’t recognise set notation symbols such as {, }, or ∈.

The Kleene Star

So lets say we want the computer to recognise all instances of 01 repeated, so it would match 01, 0101, 010101, 01010101, 0101010101, etc.  We notice that we need to concatenate 01 an arbitrary number of times.  We can do this with (01)*.  This is a pattern.  The * is called the Kleene Star and will apply an arbitrary number of concatentations to whatever precedes it, in this case 01.  Now the computer just has to compare: Is the first string equal to 0?  Is the next equal to 1?  And so on until the end of the string.

Other examples of patterns could be 1*0, which is the language of all strings consisting of any number of 1s, followed by a 0.  Or there may be two Kleene Stars, such as the pattern 1*0*.  This is any number of 1s followed by any number of 0s.

Alternative

So we’ve looked at using the Kleene Star to create languages.  But that’s not enough for some patterns.  For example, what if we wanted the language of all strings consisting of just 0s and just 1s, ie. the language {0, 00, 000, 1, 11, 111, 1111, 0000, …}

We can also use the alternative, |, to represent a choice.  For example, 0|1 is the language consisting of the string 0 and the string 1.  The language 0*|1* represents the language described above – All strings that are either an arbitrary number of 0s or an arbitrary number of 1s.

Building up Patterns

We can define patterns recursively in the following way:

Base cases:

The character ø is the empty pattern.
The character ϵ is a pattern consisting of the empty word.
Every letter from a specified alphabet ∑ is a pattern.

Recursive cases:

If p and q are both patterns, then so is (pq).  This is concatenation.
If p and q are both patterns, then so is (p|q).  This is the alternative.
If p is a pattern, then so is (p*).  This is the kleene star.

Matching words with Patterns

In order to match a word with a pattern, we can do it recursively, as follows:

Base cases:

The empty word ϵ matches the pattern ϵ.
The pattern p is a single character and s is that character.

Recursive cases:

The pattern p is a concatenation p=(p1p2) and there are words s1 and s2 such that s1 matches p1, s2 matches p2, and s = s1s2.
The pattern p is an alternative p=(p1|p2) and s matches p1 or p2, or both.
The pattern p is of the form p = (q*) and s can be written as s = s1s2s3…sn such that s1, s2, s3, …, sn all match q.

(Note that no word matches the empty pattern ø).

Languages of Patterns

Given a pattern, p, we can refer to the language of all words that match that pattern as L(p).  Or, again, we can define the language recursively, as follows:

Base cases:

L(ø) = ø.
L(ϵ) = {ϵ}.
L(x) = {x} for all x in ∑.

Recursive cases:

L(p1p2) = L(p1)L(p2)
L(p1|p2) = L(p1) ∪ L(p2)
L(p*) = (L(p))

A regular language is any language that can be described by a pattern. Note that there are some languages that can’t be represented by a pattern, such as the language of all strings that consist of an arbitrary number of 0s followed by the SAME number of 1s, ie. the language {01, 0011, 000111, 00001111, …}.  This is because there is no way of remembering how many 0s there were.  We will find other ways to express these languages later, in Context Free Grammars.

So from simple patterns we’ve shown how to build them up into more complex patterns, then we saw how to recursively tell if a word matches a pattern, and finally how to show and build up the languages of a pattern.  This basically concludes what patterns are, and you should by now have some idea of how to build up a pattern given a rule.  There’s a number of examples to ensure you understand in Andrea’s notes on page 19.

And of course, as usual, there are the fantastic Badgerati’s notes available at http://badgerati.wordpress.com/2009/05/11/comp10042-computation-regular-expressions/

Summer 09 COMP10042 Computation 1: Sets & Languages

In order for a computer to organise a textual input, the computer has to be able to recognise certain strings, such as, in parsing a programming language, it has to be able to find ‘if’ or ‘while’.

The first part of Computation deals with how we describe to a computer what string to look for, how to come up with the right description to solve the problem, and how to do more advanced things such as check whether a piece of code is correctly formed, making sure that opening and closing brackets match, or checking that a webpage is written in valid HTML.

Symbols, Words and Strings – Definitions

Input to a computer is given in symbols, such as letters and numbers. A collection of symbols is an alphabet.  An alphabet is usually denoted by the symbol ∑.

Symbols can be combined into strings, which can be called ‘words’.  We can have words consisting of any number of symbols, including zero.  A word with zero symbols is called the empty string, and is referred to by ϵ.

A space can be a symbol, and we can use the symbol ı_ı to show it. (Though in these blog posts, I’ll just use _ since there’s no single character for ı_ı).

The length of a string can be shown by a modulus symbol.  For example, |s| is the length of the string s.  |ϵ| = 0, obviously, because it is the empty string.

Languages

A language is a set of words that can be created with an alphabet. Some examples of languages are the empty language, {}, the language consisting of the empty word {ϵ}, or languages consisting of some number of strings, {ϵ,1,aba,abba,abbba,abbbba}.  A language is referred to with the symbol L.

We can use set notation to define a language.  For example, say we wanted the language of all words that could be made with the alphabet of the symbols ‘a’, ‘b’, and ‘c’.  We can define this language as {s | s is a word over the alphabet {a,b,c}}.  The | should be read as ‘where’, so this is ‘All languages consisting of the word ’s’, where s is a word over the alphabet {a,b,c}’

Concatenation

Two strings can be concatenated, that is, put together.  So the strings Hel and lo could be concatenated into the single string Hello.  To show this, we use a concatenation symbol as such: Hel · lo = Hello.

For multiple concatentation, we can use a power symbol.  For example, h³ = hhh. h° = ϵ (ie. the string consisting of 0 numbers of h).  We can also use n to denote an arbitrary number of concatenations.  For example, an is the language {ϵ, a, aa, aaa, aaaa, aaaaa, …}.

Using set notation, we can write it as L = {an | n ∈ N}.  This can be read as ‘The language L consists of all strings an where n is a whole number’.  We could also exclude the empty string by saying ‘where n is a positive whole number’ with the symbol N+.

We can also concatenate two languages as follows: L1 · L2 = {st | s ∈ L1 and t ∈ L2}.  ie. ‘the set of all concatenated strings where s is a string from L1 and t is a string from L2′.