Regular Expressions

Regular Expressions stand on the language automaton and formal languages. The idea of Regular Expression (RE) similar to the idea of Knuth Morris Pratt (KMP) Algorithm, but the goal of Regular Expressions is try to find such a given pattern in a string while KMP only capable to find such substring. To do that, we need more abstract machine to capture expressive pattern that support necessary logic such as ‘or’ and repetition. We can construct such abstract machine by implementing Nondeterministic Finite Automaton (NFA) on Digraph structure. A term nondeterministic refers that digraph might contains multiple path contrast to the construction of DFA which only has single path.

Similar to arithmetic symbol in mathematic, we can define such similar pattern in the languages as syntax. Table below give four patterns with examples:

or = |

Name Notation Match
Concatenation {A B} AB
Or A or B A, B
Closure AB* A, AB, ABBB
Parentheses C (AC or B) D CACD, CBD

Table 1: Language patterns

Another important feature of language is shortcut expression. Using a certain shortcut can save us from being type unnecessary long expression. Table below give example of shortcuts:

or = |

Name Notation Example Shortcut for Match
Wildcard . A . B
Specified Set Enclosed in [] [AEIOU]*
Range Enclosed in [] separated by - [A-Z]
Complement Enclosed in [] separated by ^ [^AE or OU]*
At least 1 + (AB)+ (AB)(AB)* AB ABABAB
O or 1 ? (AB)? ‘any’ or AB ‘any’ AB
Specific Count in {} (AB){3} (AB)(AB)(AB) ABABAB
Range Range in {} (AB){1-2} (AB) or (AB)(AB) AB ABAB

Table 2: Shortcut expression

Building an NFA corresponding to an RE is similar to constructing arithmetic expression using Djikstra’s two-stack algorithm. The differences to Djikstra are:

  • RE does not have an explicit operator for concatenation.

  • RE have a unary operator, for closure (*)

  • RE have only one binary operator, for or (|)

The basic NFA construction consist a digraph G and epsilon-transition (transition in edge without scanning text). Below the description of basic rule to construct NFA for RE:

  • Concatenation is every match transition.

  • Use stack to handle Parentheses: Push each left parenthesis on the stack and pop each time encountered with right parenthesis.

  • Handle Closure in two condition: After single character or after right parenthesis.

  • Handle Or expression by implementing two epsilon-transition.

An image below illustrated a full NFA of RE (.AB((C|DE)F)*G)

Below the implementation of NFA of RE in java:

public class NFA
{
    private char[] re;        // Match transitions
    private Digraph G;        // Epsilon transitions
    private int M;            // Number of states

    public NFA(String regexp)
    {
        Stack<Integer> ops = new Stack<Integer>();
        re = regexp.toCharArray();
        M = re.length;
        G = new Digraph(M+1);

        for (int i = 0; i < M; i++) {
            int lp = i;

            if (re[i] == '(' || re[i] == '|')
                ops.push(i);
            else if (re[i] == ')') {
                int or = ops.pop();
                if (re[or] == '|') {
                    lp = ops.pop();
                    G.addEdge(lp, or+1);
                    G.addEdge(or, i);
                } else {
                    lp = or;
                }
            }

            if (i < M-1 && re[i+1] == '*') {
                G.addEdge(lp, i+1);
                G.addEdge(i+1, lp);
            }

            if (re[i] == '(' || re[i] == '*' || re[i] == ')')
                G.addEdge(i, i+1);
        }
    }

    public boolean recognizes(String txt)
    {
        Bag<Integer> pc = new Bag<Integer>();
        DirectedDFS dfs = new DirectedDFS(G, 0);

        for (int v = 0; v < G.V(); v++)
            if (dfs.marked(v)) pc.add(v);

        for (int i = 0; i < txt.length(); i++) {
            Bag<Integer> match = new Bag<Integer>();

            for (int v : pc)
                if (v < M)
                    if (re[v] == txt.charAt(i) || re[v] == '.')
                        match.add(v+1);

            pc = new Bag<Integer>();
            dfs = new DirectedDFS(G, match);

            for (int v = 0; v < G.V(); v++)
                if (dfs.marked(v)) pc.add(v);
        }

        for (int v : pc)
            if (v == M) return true;

        return false;
    }
}

The critical transition of constructing NFA is when we dealing with right parenthesis. Similar to Djikstra two-stack, we need to store every meta character “*, |, (” into stack. When hit right parenthesis character ‘)’, we popped out item in the stack, but if popped out character is ‘|’, we need extra handling to do or operation which is has two epsilon-transition.

The total running time of our implementation of Regular Expression using NFA would be MN which is M time for constructing NFA and N time for doing scanning over text.

results matching ""

    No results matching ""