Parsing 101 with the Infragistics Parsing Framework

Monday, October 22, 2012

Parsing 101 with the Infragistics Parsing Framework

Infragistics - Infragistics Parsing Framework - Parsing 101

This is the first in a series of posts I am going to write about the parsing framework shipped as a CTP in the 12.2 release of our WPF and Silverlight products. This framework is used internally by the new XamSyntaxEditor control on those platforms. This post will be an introduction to both parsing in general as well as the type of parser provided by us.

Introduction

The Infragistics parsing framework allows you to define the grammar for a language from which a parser will be created to analyze documents in that language. If you are already familiar with parsing and LR parsers, you can just read the “Supported Languages”, “Ambiguities” and “Performance” sections.

Note: it is helpful to think of a document as a single horizontal sequence of text rather than text which goes from top to bottom as it does in an editor. This is because I will be introducing concepts such as “top-down parsing” and “bottom-up parsing” which relate to the top and bottom of a tree structure and not the top and bottom of a document in an editor.

Lexical Analysis

The first two phases required for analyzing a document in a certain language are lexical analysis and syntax analysis. I am not going to get into too much detail on lexical analysis, but basically, it is the process of scanning a document and grouping characters into meaningful units called tokens. By doing so, lexical analysis imposes a structure on the string of characters of a document to create a string of tokens. For example, the lexical analysis of a C# document might create tokens for keywords, identifiers, integral constants, braces, comments, whitespace, and other elementary units of the C# language. Most C# tokens are significant, but tokens such as whitespace or comments are not. The significance of each token is defined by the grammar. So in some languages (such as VB), whitespace can be significant. The syntax analysis phase only considers significant tokens. This makes the job of the grammar writer much easier because he or she does not need to define all the areas in which comments and whitespace are allowed. Doing so might make a grammar definition overly complicated. From now on, I may refer to lexical analysis as lexing or to the lexical analyzer as the lexer.

Syntax Analysis

Similar to how the lexer imposes a structure on a string of characters, syntax analysis, or parsing, imposes a structure on the string of tokens created by the lexer. The structure created by the parser in this phase is a parse tree, or concrete syntax tree, representing the grammatical structure of the tokens. How that structure is formed depends on the rules defining what is allowed by the language being parsed. Here is a typical parse tree for a C# document:

...

..."

In my previous post I mentioned ISV's branching out? Case in point...

Would you ever have expected something like this from Infragistics 10, 5, 3 years ago? Neither would have I. I think it's great seeing them extend themselves this way. From things like this to providing support for other platforms to thinking outside the VS box. Don't get me wrong, I love the VS box, but a strong ISV community is good for everyone...