Compiling - Lexical Analysis
Description
Using regular grammar is enough to describe a token. More concisely, we can use regex (regular expression) to describe a token.
Regular Definition: A recursive regex of symbols and regex, for example, unsigned integer can be defined like:
$$ digit = [0-9] \ unsigned = digit {digit} $$
Recognising Tokens
Finite Automata (FA) can be used to recognise tokens. FA is a 5-tuple:
$$ FA = (Q, \Sigma, \delta, q_0, F) $$
where $Q$ is a set of states, $\Sigma$ is a set of input symbols, $\delta$ is a transition function, $q_0$ is the start state, and $F$ is a set of final states.
The transition function can be described by a transition table or a transition diagram. Diagrams are more intuitive for humans, while tables are the best data structure for computers.
The input strings that accepted by an FA are called the language of the FA.
FA will not stop unless there’re no more input symbols, even it reaches a final state.
FA can be deterministic (DFA) or non-deterministic (NFA). DFA has only one transition for each symbol, while NFA can have multiple transitions for a symbol. Formally, in DFA, $\delta: S \times \Sigma \rightarrow S$, and in NFA, $\delta: S \times \Sigma \rightarrow 2^S$. DFA and NFA is in fact equivalent, and NFA can be converted to DFA by the subset construction algorithm.
Besides, conventional NFA cannot transit to another state without input symbol, so we can add $\varepsilon$-transitions to NFA. $\varepsilon$-transitions are transitions without consuming input symbols. NFA with $\varepsilon$-transitions can be converted to NFA without $\varepsilon$-transitions by the $\varepsilon$-closure algorithm, so it’s also equivalent to DFA.
DFA Implementation
// DFA.cc: A simple implementation of Deterministic Finite Automata
#include <iostream>
#include <map>
#include <set>
#include <vector>
class DFA {
private:
std::set<int> states;
std::set<char> alphabet;
std::map<std::pair<int, char>, int> transition;
int initial_state;
std::set<int> accept_states;
public:
DFA(const std::set<int> &states, const std::set<char> &alphabet,
const std::map<std::pair<int, char>, int> &transition, int initial_state,
const std::set<int> &accept_states)
: states(states), alphabet(alphabet), transition(transition),
initial_state(initial_state), accept_states(accept_states) {}
bool run(const std::string &input) {
int current_state = initial_state;
for (char c : input) {
if (alphabet.find(c) == alphabet.end()) {
return false;
}
auto key = std::make_pair(current_state, c);
if (transition.find(key) == transition.end()) {
return false;
}
current_state = transition.at(key);
}
return accept_states.find(current_state) != accept_states.end();
}
};
Equivalence Of FA & Regex
FA and regex are equivalent, and we can convert regex to NFA by the Thompson’s construction algorithm, and convert DFA to regex by the state elimination algorithm.
The equivalence of FA and regex are the basis of the lexical analysis: Regex $\rightarrow$ $\varepsilon$-NFA $\rightarrow$ DFA.
Lex: Lexical Analysis Generator
Lex is a lexical analysis generator. It takes a regex and generates a DFA. Lex consists of a Lex language and a Lex compiler.
Lex language defines the tokens and the actions to be performed when a token is recognised. Lex compiler generates a DFA from the Lex language.
As DFA should run by a program, so in fact, Lex compiler outputs a C code that implements the DFA. The C code can be compiled and linked with the main program. The main program can call the DFA to recognise tokens.
Lex is, to some extent, a “meta-programming” tool.