Thread Links Date Links
Thread Prev Thread Next Thread Index Date Prev Date Next Date Index

EBNF and its not-so-usefulness




At 00:02 2000-07-31 +0100, Chris Angus wrote:
> 
> Dear Chris and Matthew
> 
> Perhaps I can put my oar in here as someone who has made significant use of
> both (E)BNF and XML.
> 
> EBNF is a language used to express the formal grammar of (predominantly)
> computer processable languages such as programming languages.  As such it is
> suited to defining a language such as KIF.  (It is also, incidentally, the
> language in which XML was defined.)  EBNF has nothing to say about the
> semantics of the language, it simply defines the sequences of characters
> that form grammatically legal sentences in the language.

I agree with some of these points but disagree with others.

Regarding "ENBF has nothing to say about the semantics of the language" and 
assuming the "language" is the artificial language "KIF", then the first 
statement is true.

Regarding XML being defined in EBNF, this is false.  Likewise, standards for 
programming languages are not defined in EBNF.  People should pay very close 
attention to the precise standards words.  Various types of "extended" BNF 
are used in these documents (I'll mention the "standard" EBNF below).  
However, language standards are very careful *not* to use BNF because there 
is little semantic weight (in this E-mail, I am referring the the sematanics 
of interoperability, portability, or compatibiliy, as defined by the 
information technology standard).  For example, the following is the first 
production rule of XML:


        2.1 Well-Formed XML Documents

        A textual object is a well-formed XML document if: 
        1. Taken as a whole, it matches the production labeled document. 
        2. It meets all the well-formedness constraints given in this 
specification. 
        3. Each of the parsed entities which is referenced directly or 
indirectly within the document is well-formed. 

                Document
                [1] document ::= prolog element Misc*

        Matching the document production implies that: 
        1. It contains one or more elements.
        2. There is exactly one element, called the root, or document 
element, no part of which appears in the content of any other element. For 
all other elements, if the start-tag is in the content of another element, 
the end-tag is in the content of the same element. More simply stated, the 
elements, delimited by start- and end-tags, nest properly within each other. 
As a consequence of this, for each non-root element C in the document, 
there is one other element P in the document such that C is in the content 
of P, but is not in the content of any other element that is in the content 
of P. P is referred to as the parent of C, and C as a child of P.

So the XML spec contains the BNF production "document ::= prolog elemewnt 
Misc*", but that is only *informative* wording.  The *normative* wording of 
the XML spec is its text (which supports my statement that BNF, if used, is 
only a small percentage of the wording).  In other words, EBNF is *not* used 
in the formal specification of XML ... the syntax is inferred from the 
natural language wording.

In programming language standards, the same is true: they use BNF as 
informative wording, but not normative (or they use BNF for norative 
wording, but have lots of supporting natural language text).  Programming 
language standards may include BNF grammars for automatic validation in 
tools like YACC, however, parsing is only a small part of the problem when 
compared to interpretation, e.g., it requires lots of supporting code to 
make a YACC grammar useful.

Regarding XML itself, XML might be useful for a *coding* (an information 
representation) of KIF.  Just as one might have prefix and infix versions of 
KIF (they're just different codings), the XML version would be a different 
coding.  These codings, along with APIs (application programming interfaces) 
and protocols, are called "bindings".  On 2000-05-31, I gave a summary of 
these features of standardization in E-mail to this list.  If anyone needs a 
copy of that E-mail, I'll send it to them.

Regarding the EBNF standard, ISO/IEC 14977:1996 "Information Technology -- 
Syntactic Metalanguage -- Extneded BNF",  I have mentioned that I was one of 
the reviewers of that document in 1995.  There are a good number of 
ambiguities in that notation. As far as a I know, no one is using that 
notation for programming languages. Considering that many reviewers didn't 
notice that two pages were missing from the document ... it might suggest 
that many people didn't take it seriously.

Other problems with 14977 were related to the lack of utility of such a 
notation, e.g., what are the semantics of matching a BNF production?  (This 
is probably *the* most important feature of BNF.)  Tools/notations like YACC 
make the semantics very clear, but 14977 does not.  To finish things off, 
they introduced terms like "meta-identifier" rather than famliar terminology 
like "non-terminal symbol" (gratuitous use of the prefix "meta" is a sure 
giveaway to faulty thinking :-)), and add examples with "metaproduction 
rules" and "hyper rules" rather than common two-phase lexer/parser 
processing (add "hyper" to the list of faulty-thinking indicators :-)).

As you can see from my comments below, there are signifcant problems with 
14977.  These comments were the full text of my review.  Some of the minor 
comments have been fixed, including adding syntactic exceptions and my 
complaint that a syntactic metalangauge should be powerful enough to 
describe itself.

-FF

-----------------------------------------(cut here)-----------------
Date: Thu, 30 Nov 95 18:19:52 GMT
From: frank@farance.com (Frank Farance)
Subject: Review of ISO DIS 14977

PROCEDURAL ISSUES

The Fast Track procedures are being used for this document. Since there are competing standards (e.g., POSIX.2 YACC and LEX) and the document itself was distributed with missing pages, the normal procedures should be used.

GENERAL ISSUES

The POSIX.2 YACC and POSIX.2 LEX systems provide an equivalent, if not better, capability.  Why do we need "yet another YACC"?

The document was printed in *very* tiny type.  This made reading photocopies difficult.  Also, the small type made it difficult to distinguish the special characters in use (e.g., see the note in subclause 4.6).  The courier font should be used for actual program examples.

The document should be formatted for 8.5x11 paper with wide margins.  This would assure that it could be copied on both 8.5x11 paper and A4 paper.

Use the word "shall" in assertions.

There are no semantics associated with the syntax.  It isn't clear what happens when there is a match, a non-match, or an ambiguous match.

How are ambiguous or duplicate rules resolved?

SPECIFIC ISSUES

References:  Need complete references to ISO standards, including the year, e.g., ISO 1539:19xx.

Clause 0: Should have acronym for language, e.g., YAY for "yet another YACC".

Subclause 0.4, item H: Remove reference of "left-to-right".

Subclause 4.7: There are no semantics associated with syntactic exceptions.

Clause 7: There is no mechanism for specifying a range of characters, e.g, "[A-Z]".

Subclause 7.5: ISO 10646 and other character set standards should be supported, not just UK 7-bit.

Clause 8: It is not clear why the syntactic metalanguage is multiply defined in several different definitions.

Clause 8: The syntactic metalanguage should be powerful enough to describe itself.  In the examples, the syntactic metalanguage requires extensions to describe itself (see the use of "??" extensions).

Clause 8: The syntactic metalanguage should provide examples of its capabilities, by providing examples of known, difficult descriptions (e.g., Hollerith constants) or examples that the Wirth paper (referenced in the standard) describes, i.e., the problems this standard solves.

Appendix A: Should remove all references to "hyper-rules" (just as confusing as the shift, control, alt, meta, and hyper keys on keyboards) and "metaproduction rules".

Appendix A: The extension of the grammar rules isn't clear because the semantics associated with this grammar have not been defined.  In the example:

        metaproduction rule:
                INTREAL = integer | real ;

        hyper-rules:
                INTREAL expression = INTREAL value,
                {('+'|'-'|'*'|'/')},
                INTREAL value ;

it isn't clear that this hyper-rule expands to two rules ("INTREAL" is a single iterator across a whole rule) or eight rules ("INTREAL" is iterated for each instance).  Both interpretations might make sense.

-----------------------------------------(cut here)-----------------

-----------------------------------------------------------------------
Frank Farance, Farance Inc.     T: +1 212 486 4700   F: +1 212 759 1605
mailto:frank@farance.com        http://farance.com
Standards, products, services for the Global Information Infrastructure