« XML Serialization Reminders | Main | HttpWebRequest POST Expect 100-continue »

Parsing simple grammars using only a single Regex

This example parses the typical web search text grammars including support for search words, parentheses, and/or/not boolean operands and literals with embedded quotes:

 

                       Regex reTokens = new Regex

                        0     (@"(?six)

                  1     2     ((?<literal>""(?:[\w\s] | """")*"")

                        3     |(?<not>(?<=\s)not\s+ | -\s* | !\s*)

                        4     |(?<or>\s+or\s+ | \s*,\s* | \s*\|\s*)

                        5     |(?<and>\s+and\s+ | \s*&\s* | \s+)

                        6     |(?<lparen>\s*\(\s*)

                        7     |(?<rparen>\s*\)\s*)

                        8     |(?<word>[\w\*]+)

                              )*

                              ");

 

This regex parses the entire input into tokens. You end up with Group[1] containing a Capture for each token in order. Groups 2 through 8 correspond to each token type with a Capture for every token of that type. Drawback: Can’t iterate through Group[1] and easily determine token type. So build a captureMap as follows:

                        Hashtable captureMap = new Hashtable();

                        foreach (string tokenType in reTokens.GetGroupNames())

                              if (!Regex.IsMatch(tokenType, @"\d+")) // Groups 0 & 1 end up with names "0" and "1".

                                    foreach (Capture c in m.Groups[tokenType].Captures)

                                          captureMap.Add(c.Index, tokenType);

And then construct a vector of tokens:

                        Token[] tokens = new Token[m.Groups[1].Length];

                        int i = 0;

                        foreach (Capture c in m.Groups[1].Captures) {

                              string tokenType = (string) captureMap[c.Index];

                              if (tokenType == "literal" || tokenType == "word")

                                    tokens[i++] = new TokenValue(tokenType, c.Value);

                              else

                                    tokens[i++] = new Token(tokenType);

                        }

 

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)