comp314, lecture 7

regular expressions, theory and practice

lecturer: daniel sandler

I. THEORY

A. THE CHOMSKY LANGUAGE HIERARCHY

Regular ⊂ Context-Free ⊂ Recursive ⊂ Recursively Enumerable

B. ALPHABETS

An alphabet ∑ is a finite set of symbols

A string ω over ∑ is a sequence of FINITE length of symbols from ∑

The empty string ε has no symbols; is on any alphabet; is the only string over the alphabet ∑ = ∅.

∑* is the set of all strings ω over ∑.

A language L over alphabet ∑ is all L in ∑*.

Kleene set operations:

L^0 = { ε }
L^1 = L
L^2 = L • L
L^3 = L • L • L
L^k = { ω1ω2…ωk such that ωi in L forall 1 ≤i≤k }
    = L • L • … • L
      `-----v-----'
            k

L* = UNION of all L^i i≥0
   = L^0 ∪ L^1 ∪ L^2 ∪ L^3 …

        (closed under concatenation •)

L+ = UNION of all L_i i≥1

III. REGULAR LANGUAGES

Regular expressions over alphabet ∑:

1. ∅ is a regexp (regardless of ∑)
2. σ is a regexp forall σ e ∑ (σ == symbol)
3. ε (empty) is a regexp
4. if r1 and r2 are regexps, so are
    - (r1 + r2)  (PCRE: r1|r2)
    - (r1 • r2)  (PCRE: r1r2)
    - (r1*)
5. Nothing else.

So regardless of ∑, you get two REs for free: (∅ and ε)

The language of some expression E, denoted L(E):

1. if E = ∅ then L(E) = ∅
2. if E = ε then L(E) = { ε }
3. if E = σ (σ in ∑) then L(E) = { σ }
4. if E = (E1 + E2) then L(E) = L(E1) UNION L(E2)
5. if E = (E1 • E2) then L(E) = L(E1) • L(E2)
6. if E = (E1*) then L(E) = (L(E1)*)

Definition: A language L is a REGULAR LANGUAGE if there exists a regular expression E such that L(E) = L.

IV. SOME EXAMPLES

...


Regular expressions in practice

Introduction

How many people had heard of regexps before this lecture?

How many before college?

Why?

(PHP, perl, python)

ed

It all started with UNIX ed, by Ken Thompson (one of the first two "Bearded Unix Dudes", in part because he and Dennis Ritchie WROTE UNIX at Bell). ed contained the first widely-used regular expression engine intended for human/interactive text processing.

→ grep. (etym: ed command - :g/PAT/p -- g/re/p)

$ grep PATTERN [file ...]

The syntax we use today is due to Ken Thompson. Note that in the context of ed-style regexps, we are no longer matching the entire string but searching for the (largest) matching region anywhere in the input string (the input usually being a single line of text). (See ^ and $ for how to constrain matches to those which use up the whole string.)

The input to the matcher is always a whole line of text in ed/sed/grep.

.   any character in the alphabet, which is ASCII or UTF-8 or whatever
x   the literal character x
x*  0 or more x's
x+  1 or more x's
^   beginning of string (zero-length match)
$   end of string (zero-length match)
[...]  character set
[^...] inverse character set
[a-z]  character set, range abbreviation
E1|E2  disjunction of patterns E1 and E2
\(...\)  "capture" -- lets you group expressions for order-of-ops, but also, capture for later
x\{i,j\} match x i to j times (inclusive)
\t, \n, etc.  C-esque special chars

Corollaries:

.*  0 or more .'s (any string)
^$  the empty string


Recent additions:

x?  0 or 1 x's
(), {}  (no backslash)

Perl

And then there was Perl.

Perl has syntax that makes babies cry, but it was a huge step up from shell scripting or sed/awk scripting, and much easier to use than C for string-processing tasks (not to mention: no compilation!).

Its dialect of regexps was cooked into a C library (PCRE, "perl-compatible regular expressions") and informs everything that came afterward, including Java.

PCRE improvements on Thompson's RE language:

x*?   match x* non-greedily
\d     any decimal digit
\D     any character that is not a decimal digit
\s     any whitespace character
\S     any character that is not a whitespace character
\w     any "word" character
\W     any "non-word" character
\b    word boundary
\1    Capture Backreferences (uh oh: no longer regular!)
(?imsx)  Options: case insensitive, multiline, dotall, extended
(?:foo)  Group foo without capturing
lookaheads, lookbehinds, ...

More: PCRE reference

Interactive demo

Playing with regular expressions:

  1. start vim
  2. type <ESC>:set incsearch
  3. type a / and then enter a regexp (note that you'll need to add a backslash to parens and pipes; this is the ed/sed/vi way, sadly)

Regular expressions in Java

Very similar to PCRE.

The crusty API: Pattern & Matcher

import java.util.regex.*;
Pattern E = Pattern.compile(regex);

Matcher M = E.matcher(str);

while (M.find()) {
   // MatchResult interface, implemented by M
   matched = M.group();
   g1 = M.group(1); // first capture
}

The simpler API: Pattern.matches(regex, str)

The simplest API: str.matches(regex) ; str.split(regex)

Pattern.quote(str) → RE-safe literal

See the javadocs for java.util.regex.Pattern for a reference to the Java regexp dialect.


[Markdown source]