EBNF and its not-so-usefulness
At 00:02 2000-07-31 +0100, Chris Angus wrote:
>
> Dear Chris and Matthew
>
> Perhaps I can put my oar in here as someone who has made significant use of
> both (E)BNF and XML.
>
> EBNF is a language used to express the formal grammar of (predominantly)
> computer processable languages such as programming languages. As such it is
> suited to defining a language such as KIF. (It is also, incidentally, the
> language in which XML was defined.) EBNF has nothing to say about the
> semantics of the language, it simply defines the sequences of characters
> that form grammatically legal sentences in the language.
I agree with some of these points but disagree with others.
Regarding "ENBF has nothing to say about the semantics of the language" and
assuming the "language" is the artificial language "KIF", then the first
statement is true.
Regarding XML being defined in EBNF, this is false. Likewise, standards for
programming languages are not defined in EBNF. People should pay very close
attention to the precise standards words. Various types of "extended" BNF
are used in these documents (I'll mention the "standard" EBNF below).
However, language standards are very careful *not* to use BNF because there
is little semantic weight (in this E-mail, I am referring the the sematanics
of interoperability, portability, or compatibiliy, as defined by the
information technology standard). For example, the following is the first
production rule of XML:
2.1 Well-Formed XML Documents
A textual object is a well-formed XML document if:
1. Taken as a whole, it matches the production labeled document.
2. It meets all the well-formedness constraints given in this
specification.
3. Each of the parsed entities which is referenced directly or
indirectly within the document is well-formed.
Document
[1] document ::= prolog element Misc*
Matching the document production implies that:
1. It contains one or more elements.
2. There is exactly one element, called the root, or document
element, no part of which appears in the content of any other element. For
all other elements, if the start-tag is in the content of another element,
the end-tag is in the content of the same element. More simply stated, the
elements, delimited by start- and end-tags, nest properly within each other.
As a consequence of this, for each non-root element C in the document,
there is one other element P in the document such that C is in the content
of P, but is not in the content of any other element that is in the content
of P. P is referred to as the parent of C, and C as a child of P.
So the XML spec contains the BNF production "document ::= prolog elemewnt
Misc*", but that is only *informative* wording. The *normative* wording of
the XML spec is its text (which supports my statement that BNF, if used, is
only a small percentage of the wording). In other words, EBNF is *not* used
in the formal specification of XML ... the syntax is inferred from the
natural language wording.
In programming language standards, the same is true: they use BNF as
informative wording, but not normative (or they use BNF for norative
wording, but have lots of supporting natural language text). Programming
language standards may include BNF grammars for automatic validation in
tools like YACC, however, parsing is only a small part of the problem when
compared to interpretation, e.g., it requires lots of supporting code to
make a YACC grammar useful.
Regarding XML itself, XML might be useful for a *coding* (an information
representation) of KIF. Just as one might have prefix and infix versions of
KIF (they're just different codings), the XML version would be a different
coding. These codings, along with APIs (application programming interfaces)
and protocols, are called "bindings". On 2000-05-31, I gave a summary of
these features of standardization in E-mail to this list. If anyone needs a
copy of that E-mail, I'll send it to them.
Regarding the EBNF standard, ISO/IEC 14977:1996 "Information Technology --
Syntactic Metalanguage -- Extneded BNF", I have mentioned that I was one of
the reviewers of that document in 1995. There are a good number of
ambiguities in that notation. As far as a I know, no one is using that
notation for programming languages. Considering that many reviewers didn't
notice that two pages were missing from the document ... it might suggest
that many people didn't take it seriously.
Other problems with 14977 were related to the lack of utility of such a
notation, e.g., what are the semantics of matching a BNF production? (This
is probably *the* most important feature of BNF.) Tools/notations like YACC
make the semantics very clear, but 14977 does not. To finish things off,
they introduced terms like "meta-identifier" rather than famliar terminology
like "non-terminal symbol" (gratuitous use of the prefix "meta" is a sure
giveaway to faulty thinking :-)), and add examples with "metaproduction
rules" and "hyper rules" rather than common two-phase lexer/parser
processing (add "hyper" to the list of faulty-thinking indicators :-)).
As you can see from my comments below, there are signifcant problems with
14977. These comments were the full text of my review. Some of the minor
comments have been fixed, including adding syntactic exceptions and my
complaint that a syntactic metalangauge should be powerful enough to
describe itself.
-FF
-----------------------------------------(cut here)-----------------
Date: Thu, 30 Nov 95 18:19:52 GMT
From: frank@farance.com (Frank Farance)
Subject: Review of ISO DIS 14977
PROCEDURAL ISSUES
The Fast Track procedures are being used for this document. Since there are competing standards (e.g., POSIX.2 YACC and LEX) and the document itself was distributed with missing pages, the normal procedures should be used.
GENERAL ISSUES
The POSIX.2 YACC and POSIX.2 LEX systems provide an equivalent, if not better, capability. Why do we need "yet another YACC"?
The document was printed in *very* tiny type. This made reading photocopies difficult. Also, the small type made it difficult to distinguish the special characters in use (e.g., see the note in subclause 4.6). The courier font should be used for actual program examples.
The document should be formatted for 8.5x11 paper with wide margins. This would assure that it could be copied on both 8.5x11 paper and A4 paper.
Use the word "shall" in assertions.
There are no semantics associated with the syntax. It isn't clear what happens when there is a match, a non-match, or an ambiguous match.
How are ambiguous or duplicate rules resolved?
SPECIFIC ISSUES
References: Need complete references to ISO standards, including the year, e.g., ISO 1539:19xx.
Clause 0: Should have acronym for language, e.g., YAY for "yet another YACC".
Subclause 0.4, item H: Remove reference of "left-to-right".
Subclause 4.7: There are no semantics associated with syntactic exceptions.
Clause 7: There is no mechanism for specifying a range of characters, e.g, "[A-Z]".
Subclause 7.5: ISO 10646 and other character set standards should be supported, not just UK 7-bit.
Clause 8: It is not clear why the syntactic metalanguage is multiply defined in several different definitions.
Clause 8: The syntactic metalanguage should be powerful enough to describe itself. In the examples, the syntactic metalanguage requires extensions to describe itself (see the use of "??" extensions).
Clause 8: The syntactic metalanguage should provide examples of its capabilities, by providing examples of known, difficult descriptions (e.g., Hollerith constants) or examples that the Wirth paper (referenced in the standard) describes, i.e., the problems this standard solves.
Appendix A: Should remove all references to "hyper-rules" (just as confusing as the shift, control, alt, meta, and hyper keys on keyboards) and "metaproduction rules".
Appendix A: The extension of the grammar rules isn't clear because the semantics associated with this grammar have not been defined. In the example:
metaproduction rule:
INTREAL = integer | real ;
hyper-rules:
INTREAL expression = INTREAL value,
{('+'|'-'|'*'|'/')},
INTREAL value ;
it isn't clear that this hyper-rule expands to two rules ("INTREAL" is a single iterator across a whole rule) or eight rules ("INTREAL" is iterated for each instance). Both interpretations might make sense.
-----------------------------------------(cut here)-----------------
-----------------------------------------------------------------------
Frank Farance, Farance Inc. T: +1 212 486 4700 F: +1 212 759 1605
mailto:frank@farance.com http://farance.com
Standards, products, services for the Global Information Infrastructure