Compiling - Syntax Analysis

Compiling - Syntax Analysis

Top-Down Parsing

Top-down parsing is based on derivation.

The 2 selection in the derivation is:

  • Which non-terminal to expand
  • Which production to use

Left-most derivation is one of the methods to derive a string from a grammar. It always expands the left-most non-terminal in the current string. There’s right-most derivation, vice versa.

The reverse of left-most derivation is right-most reduction, vice versa.

Recursive Descent Parsing

Recursive descent parsing is an automatic top-down parsing method, which gives a group of procedures to parse a string, each procedure corresponds to a non-terminal. This is called a recursive descent parser since a non-terminal procedure may call other non-terminal procedures.

Problems of Top-Down Parsing

In top-down parsing, the parser must choose the correct production to use. But when a non-terminal has multiple productions, and the productions have common prefixes, the parser cannot decide which production to use. This may cause backtracking, leading to inefficiency.

If there’s a production like $A \to Aa$, or this can be derived with more productions, the parser will keep expanding $A$ and $A$ until it reaches the end of the string. This is called left recursion.

Eliminating Left Recursion

  1. Direct Left Recursion: Assume we have a production $A \to A\alpha| \beta, (\alpha \ne \varepsilon, \beta \text{ does not start with } A)$. We can turn this to $A \to \beta A’$ and $A’ \to \alpha A’| \varepsilon$.
  2. Indirect Left Recursion: Expand the non-terminals to turn an indirect left recursion to a direct left recursion, then eliminate it.

Left Factoring

If there are multiple productions of a non-terminal, and they have common prefixes, we can factor out the common prefixes to a new non-terminal, then put the decision on the new non-terminal. Then the original non-deterministic decision is now deterministic.

Non-Deterministic Parsing

Even if we have eliminated the left recursion and common prefix, there may still be non-deterministic decisions in the parsing process when there’re some non-terminals with their productions beginning with a non-terminal. So, we need to use predictive parsing to solve this.

Predictive Parsing

Predictive parsing is a top-down parsing method that peeks several tokens ahead to decide which production to use, avoiding backtracking.

LL(1) Grammar

S-grammar: A grammar that each production is beginning with a terminal, and each productions of one non-terminal have different first sets.

Follow set: The set of terminals that can appear immediately after a non-terminal in the derivation.

Select set: The set of terminals that can appear immediately after a non-terminal in the derivation, including $\varepsilon$.

q-grammar: A grammar that each production is beginning with a terminal, and each productions of one non-terminal have select sets having no intersection, otherwise the production is like $A \to \varepsilon$.

First set: The set of terminals that can appear at the beginning of a string derived from a non-terminal.

The method to construct the first set is conducting the following steps until no new terminals are added to the set (for example, $FIRST(X)$):

  1. If $X$ is a terminal, $FIRST(X) = {X}$.
  2. If $X \to \varepsilon$ is a production, then add $\varepsilon$ to $FIRST(X)$.
  3. If $X$ is a non-terminal, and $X \to Y_1Y_2\cdots Y_k$ is a production, then add $FIRST(Y_1)$ to $FIRST(X)$. But if $Y_1$ can derive $\varepsilon$, then add $FIRST(Y_1) - {\varepsilon}$ to $FIRST(X)$, and add $FIRST(Y_2)$ to $FIRST(X)$, and so on.

LL(1) grammar: A grammar that each pair of productions share the same left-hand side $A \to \alpha | \beta$ meets the following conditions:

  1. $\alpha$ and $\beta$ have no common starting terminals.
  2. If $\alpha$ can derive $\varepsilon$, then $FIRST(\beta)$ and $FOLLOW(A)$ have no common terminals, vice versa.
  3. Only one of $\alpha$ and $\beta$ can derive $\varepsilon$.

Then, we can construct a FOLLOW set for each non-terminal. We can do this by doing the following steps until the FOLLOW set of each non-terminal is stable:

  1. Put $$$(end of input) in $FOLLOW(S)$, where $S$ is the start symbol.
  2. If there’s a production $A \to \alpha B \beta$, then add $FIRST(\beta) - {\varepsilon}$ to $FOLLOW(B)$.
  3. If there’s a production $A \to \alpha B$, or $A \to \alpha B \beta$ and $\beta$ can derive $\varepsilon$, then add $FOLLOW(A)$ to $FOLLOW(B)$.

We can solve the SELECT set by doing this:

  1. If there’s a production $A \to \alpha$, then add $FIRST(\alpha)$ to $SELECT(A \to \alpha)$.
  2. If there’s a production $A \to \alpha$, and $\alpha$ can derive $\varepsilon$, then add $FOLLOW(A)$ to $SELECT(A \to \alpha)$.

Then, we can construct a parsing table for the LL(1) grammar.

Recursive Predictive Parsing

Recursive predictive parsing is a method based on recursive descent parsing, but it uses a parsing table to decide which production to use by reading one or more tokens ahead.

Non-Recursive Predictive Parsing

The non-recursive predictive parsing explicitly maintains a stack to simulate the recursive descent parsing.

Bottom-Up Parsing

Bottom-up parsing is based on reduction, it constructs a parse tree from the leaves to the root. Bottom-up parsing is left-most reduction, which is a reverse to right-most derivation.

$\Alpha$ $\Beta$ $\Gamma$ $\Delta$ $\Epsilon$ $\Zeta$ $\Eta$ $\Theta$ $\Iota$ $\kappa$ $\lambda$ $\mu$ $\nu$ $\xi$ $\omicron$ $\pi$ $\sigma$ $\tau$ $\upsilon$ $\phi$ $\psi$ $\omega$

$\alpha$ $\beta$ $\gamma$ $\delta$ $\epsilon$ $\zeta$ $\eta$ $\theta$ $\iota$ $\kappa$ $\lambda$ $\mu$ $\nu$ $\xi$ $\omicron$ $\pi$ $\sigma$ $\tau$ $\upsilon$ $\phi$ $\psi$ $\omega$