Implementing a regular expression to state machine parser. (part 1: introduction)

The idea behind this post is the generation of regular expressions. After writing a regular regular expression i read a book about data structures. In this book the author writes about implementing a regular expression parser with the help of a state maschine. The basic idea is to build a state maschine out of the regex expression, and then let the regex text flow through that state machine. If the text passes the state machine it corresponds with the regex expression, if not it throws an error. That principle can be transformed to a simple custom language. So in this post i first will show how to implement a subset of the regular expressions. In a second post i will describe the implementation of the RegexToStateMachine project. Then i will show how to transform a custom language that has control commands to send and receive data to a state machine. This control commands will be parsed and checked with the generated state maschine.

Now i will start with theoretical description of the process of generating a state machine out of a subset of the  regular expressions. This are the definied commands:

  1. c – Matches any literal character c
  2. . – Matches any single character
  3. * – Matches one or more occurrences of the previous character or characters
  4. () – Brackets to group the characters
  5. | – Logical or

First we have to think about how we could build a dynamic state machine out of a given set of characters. That is very simple. We parse each character and build a State object for it and we build a state transition from the current state to the next state. If we do so we get a state maschine where we have only one state transition from one state to another state.

For example if we have the regular expression:

aabba

we have the state maschine:

aabba

Now we have the text “aabba”, that text would flow through that state maschine and would throw now error. Every other text would fail and throw an error. But what is with regular expressions that contains repeating. For example the following expression.

a* (a can be repeated n times)

If we want a state machine that works for such a regular expression we need a different state transition. The following graphic shows how that state transition has to look.

singe-charcter closure

The state * has a jump back to the previous character. If we have the text “aaaaa” it would work with that state machine. But what is with the following regular expression.

(abc)*

For that expression the state machine has to look like the one on that following graphic. So we need to include a jump back to the last open bracket. .

closue expression

If we have the text “abcabcabc” it would work with that state machine. Now we have to check how the | operator that stands for logical or works. If we have the following regular expression.

(ab|cd)

We need the following state machine with the following transitions. So the state machine has to jump to the next character behind the pipe operator. .

or expression

If we have a program that can dynamical build a state machine for all that operations we have to ensure that we can combine all that operations so that the following regular expression would also work.

(ab|ef)*

The following graphic shows the state machine that combines the or (|) operation with the repeat (*) operation.

repeat and or
As we see there are many texts that would fit that regular espression. For example “abab” or “efefef”. Now we have definied the operations for the regular expression to state machine parser. So with this transformations we now know how the state machine has to look for the defined commands. What we now need is a program that converts the expression in the corresponding state machine and checks if a given text flows through the generated state machine.

The description of a possible implementation is shown in the second part of that post series. The source code is available at RegexToStateMachine.