As explained in the ADP for MCFL page, the main change compared to ADP for CFL is the addition of rewriting functions.
This is how a nice syntax could look like
S = nil << empty | left << B S | pair << P S S >> \ (p1,p2) s1 s2 -> p1 s1 p2 s2 | knot << K K S S S S >> \ (k11,k12) (k21,k22) s1 s2 s3 s4 -> k11 s1 k21 s2 k12 s3 k22 s4 ... h B = base << 'a' | base << 'u' | base << 'c' | base << 'g' P = bpair << ('a','u') | bpair << ('u','a') | [...] K = knot1 << P K >> \ (p1,p2) (k1,k2) -> (k1 p1, p2 k2) | knot2 << P
Of course, rewriting functions should also be assignable to separate identifiers to allow reuse:
rewriteKnot1 (p1,p2) (k1,k2) = (k1 p1, p2 k2) K = knot1 << P K >> rewriteKnot1 | knot2 << P
This is how it actually looks (at the moment)
rewritePair, rewriteKnot :: Dim1 s = tabulated1 $ yieldSize1 (0,Nothing) $ nil <<< empty >>> id ||| left <<< b ~~~ s >>> id ||| pair <<< p ~~~ s ~~~ s >>> (\ [p1,p2,s1,s2] -> [p1,s1,p2,s2]) ||| knot <<< k ~~~ k ~~~ s ~~~ s ~~~ s ~~~ s >>> (\ [k11,k12,k21,k22,s1,s2,s3,s4] -> [k11,s1,k21,s2,k12,s3,k22,s4]) ... h b = tabulated1 $ base <<< 'a' >>> id ||| base <<< 'u' >>> id ||| base <<< 'c' >>> id ||| base <<< 'g' >>> id p = tabulated2 $ bpair <<< ('a','u') >>> id2 ||| bpair <<< ('u','a') >>> id2 ||| [...] rewriteKnot1 :: Dim2 rewriteKnot1 [p1,p2,k1,k2] = ([k1,p1],[p2,k2]) k = tabulated2 $ yieldSize2 (1,Nothing) (1,Nothing) $ knot1 <<< p ~~~ k >>> rewriteKnot1 ||| knot2 <<< p >>> id2
As the grammars use parser combinators, their non-terminals are functions. In Haskell, variable identifiers (like a function identifier) must start with a lower-case letter while constructor identifiers (like a data type) must start with an upper-case letter. (Haskell 98 Lexical Structure)
ADP compilers like GAP-C can do an automatic table design
analysis and then decide which non-terminals should be tabulated. As adp-multi
doesn't do any static analysis (except limited yield size analysis, see below)
the programmer explicitly has to state which tables should be tabulated. The
tabulated2 specify the table dimensions which means
tabulated1 will use a two dimensional table and
tabulated2 will use
a four dimensional table.
Note: At the moment, choosing the wrong version of
tabulated* doesn't yield to
compilation but runtime errors. This typing problem is explained further
below and should be solved in future versions.
| is reserved as a keyword, and
|| is defined as a function in Prelude
which could theoretically be used (by hiding and redefining it, causing more trouble).
||| is also defined in
Control.Arrow, the conflicts are limited to a minimum
which is why it is used.
<< would be possible but for reasons of symmetry to
is defined as
yieldSize2 (or: How does yield size analysis work?)
The original Haskell-ADP implementation has several combinators to connect
two non-terminals (
+~+,...). Those different
variants are necessary to tell the top-down parsing mechanism what the minimum
yield sizes are -- otherwise the subword ranges would never get smaller in a recursion,
leading to endless recursion.
In GAP-C this isn't necessary anymore because the compiler does a yield size analysis and then knows the minimum and possibly maximum yield sizes of each nonterminal.
In adp-multi, a restricted yield size analysis was built in. Restricted means that
it doesn't handle cycles because this would require a full blown abstract syntax tree
analysis of the grammar (as in GAP-C). Therefore the grammar writer has to tell the
parser where the cycles are and then explicitly cut them off by using
yieldSize2, respectively. Cutting off means that the minimum and maximum yield size
of a nonterminal is specified manually, therefore skipping the yield size analysis.
For symmetry reasons,
>>> was chosen for rewriting function.
>> optional vs required
In original Haskell-ADP, the combinators for connecting two nonterminals were themselves
responsible for creating the right subword indices. In adp-multi, this isn't possible
anymore due to the rewriting functions. Therefore, the operator
>>> is responsible
for creating all possible subword indices for a complete rule. In cases where no
rewriting is required, the
>>> must still be applied with the identity function
so that the subword indices are generated. It is not yet clear if this could be
done implicitly by using Haskell's type classes.
\ (p1,p2) s1 s2 vs
Having a rule like
val <<< p ~~~ p >>> rewrite and making it typesafe means
that the two nonterminal parsers have to be applied one after another both
rewrite. This seems rather complicated, considering that
each nonterminal parser can have a different dimension. I'm sure that this is
solvable with some advanced Haskell extensions like Type Families, GADTs, etc.
For now, to produce a working prototype, the type safety was loosened a bit and lists are used in all such cases. This means that related errors will only be detected at runtime.
If someone has hints on that, I would be very thankful!
Why can't terminal symbols be included in rewriting functions?
The formalism of MCFGs allows to write rewriting functions like the following:
Another (less usual) way is:
Only the last two ways can be used in adp-multi:
k = val <<< 'a' ~~~ 'b' ~~~ k >>> g g [x3,x4,x1,x2] = ([x1,x3],[x4,x2])
k = val <<< ('a','b') ~~~ k >>> g g [x3,x4,x1,x2] = ([x1,x3],[x4,x2])
This little restriction made the implementation a lot easier and shouldn't cause any real inconvenience. It might even help in more quickly recognizing all parts of a rule without looking at the rewriting functions.