Re: CG: Semantics, XML, and XQuery
Tom,
Your points are well taken, if one accepts the design assumptions
for XQuery, which are based on other assumptions about the use of XML.
Although I have a high regard for many of the highly intelligent and
creative computer scientists working with the W3C, I would question
some of their design assumptions.
It's important to distinguish the various purposes for which XML has
been proposed and to evaluate its suitability for each of them:
1. Document markup: This was the original purpose of GML, SGML,
HTML, and XML which have a 35-year history of success in tagging
documents for further processing, both syntactic and semantic.
2. Serialization: Another use for XML is to serve as a serialized or
linear representation for the internal data structures of various
computer software.
3. Intermediate language (IL): Many systems that have an external,
humanly readable language also have an intermediate language that
is less readable for humans, but more efficient for computer
processing.
4. Computational form: Languages that require complex or repeated
processing are usually translated to a more efficient internal form,
such as an IL or data structures consisting of lists, pointers,
arrays, or formats that are optimized for particular algorithms.
The articles I cited in my previous note did not distinguish these four
functions, but any evaluation of XML must consider which function is
intended:
1. Like its ancestors, XML is well suited to document markup. Today,
both Microsoft and OpenOffice use XML for tagging documents for
their office suites. I been have using the *ML family of markup
languages for all my word processing needs since the 1970s, and I
have never seen anything I like better for that purpose.
2. The line between markup and serialization is blurred because
Microsoft's old .doc files are serialized forms of the internal
data structures used by MS Word. Their new XML-based formats
replace the old serialization with an XML serialization. However,
the successful use of XML for serializing documents is closely
related to its use in document markup. That success does not imply
that XML is equally suited to serializing other kinds of data.
3. Intermediate languages were first developed for compilers and
interpreters, which often translate the external syntax to an IL
for further optimization and processing. Various ILs have been
successfully used for programming languages since the 1960s.
Examples inlcude the ILs used in the Gnu gcc, the Java JVM, and
the .Net CLI. Other examples include the ILs used in the machine
translation of natural languages, and the metalanguage ML for
theorem provers, which evolved into a family of functional
programming languages, such as CAML and OCAML. Just as ML was an
IL that became a programming language, some programming languages
such as LISP, FORTH, and Prolog have a clean structure that enabled
them to be used as ILs for a wide range of specialized purposes.
Although XML can also be used as an IL for some purposes, it is
a poor substitute for these other ILs, which have a 40-year history
of outstanding research and development behind them.
4. As a computational form, XML is good for applications that finish
all the processing in a single pass over the source, such as the
formatting, printing, and display of documents. But the XML source
is a highly inefficient representation for computations that require
repetitive processing of all or part of a text, selective processing
of only some parts of a text, or selective processing and comparison
of many different parts of many different texts.
I'm sure that different people working on XQuery started with different
assumptions, but as a point for discussion, I'd like to start with the
opening paragraph of the W3C's introduction to XQuery:
The mission of the XML Query project is to provide flexible query
facilities to extract data from real and virtual documents on the
World Wide Web, therefore finally providing the needed interaction
between the Web world and the database world. Ultimately,
collections of XML files will be accessed like databases. The
ambitious task of the XML Query (XQuery) Working Group is therefore
to develop the first world standard for querying web documents...
Source: http://www.w3.org/XML/Query
Two points to notice: first, XQuery is intended to provide "the needed
interaction between the Web world and the database world"; and second,
"collections of XML files will be accessed like databases". Point #1
implies that XQuery should be compatible with the database world, which
is dominated by relational databases accessed by SQL. That does not
mean XQuery must be identical to SQL in syntax, but it should be
interoperable with SQL at the semantic level. Point #2 implies that
the queries are expected to range over collections of documents, which
"will be accessed like databases" -- i.e., not by repeated search of
the original source files.
An even more serious concern is that the W3C contrasts the phrase "the
Web world" with "the database world". Haven't they noticed that almost
every server includes a relational database? For years, Sun used the
slogan "The network is the computer", and today the browser is the most
heavily used program on both desktop and laptop computers. The boundary
between the web, the database, and the application programs started to
disappear 10 years ago, and now it's nearly invisible.
All these points are strong arguments for building on the existing
relational technology used in databases while modifying the query
language to accommodate the XQuery data model; i.e., XQuery should be
a specialized front-end to SQL with a built-in ontology for the data
model. I agree with Chris Date and others that SQL has many flaws
that obscure its logical structure. Those flaws could be corrected
by designing XQuery as an alternate syntax for SQL that avoids using
the most egregious flaws of SQL, such as the NULLs.
Some people have argued for different query languages for accessing
relational and object-oriented databases. But at the logical level,
there is absolutely no difference between an RDB and an OODB. Exactly
the same information can be represented in either one, and the only
distinction is the computational efficiency in the access methods for
different usage patterns. For a summary of how the graph structures
of OODBs can be mapped to and from the RDB relations or tables, see
the discussion of how either form can be represented by the other:
http://www.jfsowa.com/logic/math.htm#Representing
This theoretical mapping is amply supported by practice. Both Oracle
and IBM's DB2 support SQL and XPath as query languages for exactly
the same data; and Objectivity, which has produced OODB software that
supports some of the largest DBs in the world, also supports both
SQL and XPath for the same data (in fact, the Objectivity developers
have admitted that most of their users prefer SQL because it's more
familiar to their programmers).
The question of the internal data formats is an efficiency consideration
that has nothing to do with the logic of the query language or its human
interface. That point was argued and settled in the 1970s during the
Codd-Bachman debates. Today, there's 30 years of R & D on methods
for automaticallly optimizing internal storage formats based on usage
patterns. It is counterproductive -- even foolish -- to force the users
to pay attention to the storage structures when they ask a question.
Finally, I'm concerned that there seems to be no collaboration or even
communication between the XQuery designers and the RDF and OWL designers
working on the semantic web. The RDF/OWL group has done a lot of work
on logic and ontology, which is not reflected in any of the XQuery work.
Similarly, the RDF/OWL group shows no attempt to relate their work to
either SQL databases or XQuery databases. The semantic web just seems
to be a separate world instead of an integral part of the total system.
With that lengthy preamble, I'd like to comment on your comments:
TP> ... Experience has shown that there can be benefits to having
> an all-xml syntax for various languages (verbose though it may be
> at times), so there has been some interest in an xml-syntax
> version for xquery as well.
I seriously doubt that claim. Most processing of HTML and XML data
is performed with languages such as Java, Javascript, Perl, Python,
and PHP, whose syntax has no relationship whatever to XML. If you
asked any user or developer of those languages to "webify" them in
an XML-based notation, they would laugh in your face.
TP> ... With lots of data flying around in xml format, it can be very
> useful to query one or an aggregation of xml documents without
> converting the data into relational form, storing it in a database,
> and getting the results back into xml format.
Please see point #4 at the beginning of my preamble. If you have one
simple query executed against one document, you would probably use the
FIND button in your browser or word processor. But for repetitive
processing of any kind, the XML source is the worst possible format
for computation. All of the major search engines map the source
documents to highly optimized and indexed databases, and since modern
desktop and laptop computers now have many gigabytes of disk storage,
software vendors (and OS developers) are providing tools that scan
all the documents on your disk and refer to them in optimized and
indexed databases.
TP> Considering how I use xslt these days, I don't know who I would be
> managing if I had to code everything and push it through relational
> database tables and SQL queries. I'm delighted with xslt for my
> purposes.
I certainly agree with the need for good tools for supporting XML
development, use, and transformations. I wouldn't expect users to
map the XML to an RDB, but if XQuery were better integrated with
SQL, the tools would automatically take advantage of whatever RDB
happened to be available. Nobody should have to use SQL -- I have
recommended controlled natural languages for human consumption.
TP> I don't know if I will ever have much need for the additional
> capability that xquery will bring, along with its extra complexity.
> I might, though, since it is designed to handle collections of xml
> files, and that could be handy, I imagine.
Yes, a good query system for documents of all kinds would be nice to
have, and many groups are working on providing them. Documents tagged
with XML might be easier to classify than ones that aren't, but the
XML-based documents will be a tiny fraction of the total number of all
documents for a long time to come. So the XML-based stuff will just
be a small drop in a much larger ocean.
TP> But my purposes are not enterprise-scale employee databases, for
> example. I don't know why one would want to give up "relational"
> databases and SQL for that kind of application.
The point I was making is that there is no clear dividing line between
"the Web world", "the database world", and "the application world".
Microsoft provides an RDB for use with the office suite, and OpenOffice
is designed to support MySQL. Instead of designing yet another DB to
support XQuery, it should be designed to use whatever DB is already
available on the system.
John Sowa