More actions
No edit summary |
|||
Line 47: | Line 47: | ||
== Future improvement for QSL == | == Future improvement for QSL == | ||
From now on, QSL cover all the aspect of a questionnaire template. For any further improvement, it is reasonable to consider to wrap question text in double quote symbol ("). It will reduce the semantic prediction and the grammar file | From now on, QSL cover all the aspect of a questionnaire template. For any further improvement, it is reasonable to consider to wrap question text in double quote symbol ("). It will reduce the semantic prediction and the grammar file looks nicer. | ||
[[category:Technical guidelines]] | [[category:Technical guidelines]] |
Revision as of 08:52, 30 September 2009
Introduction
Since the first version of QSL grammar file, there have been many changes and the current grammar is completely different from the original. Unlike other projects, project related to language seems to be not easy to follow. In order to make it clear, i write an article dedicated to technique we use to solve QSL grammar.
Before we continue
It would be easier when you have experience with grammar of a language but i think regular expression is all we need to get into QSL grammar file. ANTLR library and ANTLR Works are also needed.
Sequence of processing QSL script
The lexical analyzer, or lexer, breaks up the input stream into tokens. The parser feeds off this token stream and tries to recognize the sentence structure. Next, we translate the sentences get from parser in to abstract structure tree (AST). Finally, we parse AST into data object. This article focuses on lexical level as it caused many problem need to solved (From my point of view, parser grammar of QSL is pretty clean and easy to understand)
Structure of QSL grammar file
Normally, there are individual file for lexer grammar and parser grammar but in QSL, we use a mixed grammar file. It means that all lexer rules and parser rules are defined in the same file. This is the original structure and decide to keep it since from my own experience, it is easier to work in ANTLR Works.
From the bottom of QSL lexer
The lexer has many mode and they work like switches so that we can direct the input stream to match the possible tokens. Besides, we have a flag that plays the most important role in the grammar : lineMode.
Logic of lineMode flag
This flag is true in case we scan each line of the input, searching for keyword or just a text line (question text, answer option text and sub question text). lineMode is false when the input cursor is inside of block (properties or expression).
Question text is not wrapped
Question text(including answer option text, sub question text) used to be wrapped with '##' when the the very first version of QSL came out. Later, in an update, we decided to remove '##'. This is a big change and effect the tactical of handling lexer.
Algorithm of lexer
There is no explicit way to distinguish between question text and keyword and it leads to our special algorithm. When lineMode is true and we are at the first position of a line, we try to match all the keyword and symbol to find the correct one. If none keyword is matched, then it is consider as a line of question text.
Notice that ANY_CHARACTER is tokens represent for a line of text.
Negative effect
Although this algorithm does work at the moment, but it still has some potential problem as we see:
What if question text contain QSL keyword?
We all know that question text is an HTML content. It means that it can contain almost every possible character/word. But in case a new text line begin with an QSL keyword? Our beloved algorithm will be broken. Actually in reality, this case is rare but in logic, there is a potential problem need to be considered.
The complexity of DFA dicision making
Token LINE is a special token in QSL grammar and our algorithm focus on it. The problem is that DFA decision making for this token is complicated
Whenever we expand the grammar (add new keyword for example), the complexity is increased and leads to "Code too large" problem.
Heavy semantic prediction
Back to the meaning of lexical level, it analyzes the input and return tokens for parser. In QSL Lexer, almost all the tokens have semantic prediction (we use condition and flag to direct the input cursor). Most of semantic prediction we use in lexer come from Parser (we decide tokens base on it's role in a complete sentence). From my own experience, that can make someone who is new to QSL grammar confused and the lexer grammar looks like hard to understand.
Identified problems when working with QSL grammar in Java environment
"Code too large"
Cause
Java compiler could not compile code if there is any function larger than 64 kB in size. Some time our rule is complex and the code behind it could up to more than 64kB. No file could be compiled and we can't use the generated file for our purposes.
Solution
Composite grammar
In this way, we split the grammar file into some smaller file. It is the solution mentioned in ANTLR official page. Unluckily, we tried but not succeed with it.
Write our own lexer for a particular tokens
This solution works pretty perfectly when we combine with semantic prediction in grammar file. If we can use our flag to direct the input cursor into a path that only can match a token, it is very effective way to solve the "Code too large" problem. When working project "VN2615QNR - Condition for QSL language", we use ANTLR API method to handle 3 tokens : EXPRESSION, ANY_CHARACTER and COMMENT. It could take much time but it also bring many useful experiences.
Future improvement for QSL
From now on, QSL cover all the aspect of a questionnaire template. For any further improvement, it is reasonable to consider to wrap question text in double quote symbol ("). It will reduce the semantic prediction and the grammar file looks nicer.