This example parses the typical web search text grammars including support for search words, parentheses, and/or/not boolean operands and literals with embedded quotes:
Regex reTokens = new Regex
0 (@"(?six)
1 2 ((?<literal>""(?:[\w\s] | """")*"")
3 |(?<not>(?<=\s)not\s+ | -\s* | !\s*)
4 |(?<or>\s+or\s+ | \s*,\s* | \s*\|\s*)
5 |(?<and>\s+and\s+ | \s*&\s* | \s+)
6 |(?<lparen>\s*\(\s*)
7 |(?<rparen>\s*\)\s*)
8 |(?<word>[\w\*]+)
)*
");
This regex parses the entire input into tokens. You end up with Group[1] containing a Capture for each token in order. Groups 2 through 8 correspond to each token type with a Capture for every token of that type. Drawback: Can’t iterate through Group[1] and easily determine token type. So build a captureMap as follows:
Hashtable captureMap = new Hashtable();
foreach (string tokenType in reTokens.GetGroupNames())
if (!Regex.IsMatch(tokenType, @"\d+")) // Groups 0 & 1 end up with names "0" and "1".
foreach (Capture c in m.Groups[tokenType].Captures)
captureMap.Add(c.Index, tokenType);
And then construct a vector of tokens:
Token[] tokens = new Token[m.Groups[1].Length];
int i = 0;
foreach (Capture c in m.Groups[1].Captures) {
string tokenType = (string) captureMap[c.Index];
if (tokenType == "literal" || tokenType == "word")
tokens[i++] = new TokenValue(tokenType, c.Value);
else
tokens[i++] = new Token(tokenType);
}