Thread Links Date Links
Thread Prev Thread Next Thread Index Date Prev Date Next Date Index

SUO: Automated or Semiautomated Ontology Development




Today, I happened to read two passages that reinforce one another
in a way that suggests a better way of doing ontology as well as many 
other tasks in knowledge representation.  Neither of the topics is
new, and I'm sure that many, if not most readers of these email lists
have come across them in one form or another.  But together, they
emphasize an important point.

The first is Jon Awbrey's summary of an Aristotelian insight:

JA> The suggestion is this:  If you find that you and your
 > colleagues having been arguing for an excessively long
 > time and without success about the nature of "what is",
 > then maybe it is time to back off from the fixation on
 > being, just a step or two, and start to look, with new
 > deliberation and due reflection, at the many different
 > categories of things that people say about "what is".

This suggestion is an excellent strategy for addressing the endless
arguments that arise on SUO list about the placement of various
categories in the upper levels (or for that matter, at any level)
of an ontology.  Even more importantly, it can lead to a procedure
that can be automated or at least semi-automated.  (And by semi-
automated, I mean a procedure that is mostly done by the computer,
but with occasional assistance and verification by humans.)

The second topic is an article about the Google ranking strategy,
which most readers have probably encountered many times.  The point
that the author, Cory Doctorow, emphasizes, however, is the reason
why Google's algorithm is so much more effective than the earlier ones.
Following are some excerpts from the article:

CD> Google's near-magical ordering of the Internet is built around the
 > notion that computers are good at doing repetitive, uncreative things
 > -- fetishistically counting things, for example -- and rotten at
 > understanding why they're being asked to do these boring tasks. By
 > contrast, human beings are great at understanding why they're doing
 > something, but they're woefully deficient in the do-the-same-thing-
 > perfectly-and-forever department....

 > AltaVista tried to get computers to do both the repetitive parts
 > (capturing billions of documents) and the creative parts (figuring out
 > what the documents are about). This yielded the largest collection of
 > randomly organized documents in the world....

 > Yahoo tried just the opposite, getting human beings to manually
 > identify and describe all the documents comprising what was meant to
 > be an exhaustive index of all the worthwhile pages on the Web. There
 > were "scaling issues" involved in this laudable effort (for "scaling
 > issues" here, substitute "catastrophic failures"), and over time,
 > Yahoo's directory dwindled to an increasingly marginal sliver of the
 > Internet's vastness....

I like the comment that "scaling issue" is just a synonym for
"catastrophic failure".  That is my major objection to Cyc -- it
suffers from "scaling issues", which are abundantly clear from
the 500 person-years that have been sunk into the effort so far,
with no end in sight.  I will charitably avoid applying the term
"catastrophic failure" to Cyc or to the Cyc wannabe called SUMO.

 > Google bridges the divide between human-generated indexes and
 > machine-generated analysis.

 > Y'see, the Web is full of people like you and me, making links between
 > documents; human beings, making decisions about documents, voting with
 > their links. When I link to some arbitrary document, it's an
 > indication that I think that it's in some way authoritative. When you
 > link to a document I wrote, you're indicating that I'm in some way
 > authoritative. The Internet is already structured in a meaningful way,
 > but that structure is obscured. Google teases out the relationship
 > between the URLs, examining the webs of authority....

 > It's a best-of-both-worlds solution. The computers at Google are asked
 > to tirelessly count and re-count the number and destination of links
 > on every page that Scooter, the Googlebot, can lay its user-agent on.
 > Those links are made by human beings, doing what they do best, link by
 > link, drip by drip, layering a film of order over the Internet....

 > Nearly every document on the Web has a human decision associated with
 > it for Google to glom onto; that's because nearly every document on
 > the Web has a human author. Human authors don't just put documents
 > onto the Web; they put them into the Web, into the meshed hairball
 > of incoming and outgoing links, indicating not only what keywords the
 > document contains, but also who the document's author believes is
 > authoritative, and vice versa....

Source: 
http://www.oreillynet.com/lpt/a//network/2002/03/08/cory_google.html

The point I want to emphasize is the relationship between Aristotle's
insight and Google's successful implementation:  both of them rely on
the fact that people use language for some purpose, and the way they
express themselves inevitably reflects those purposes:

   1. To answer the question of "what something is", Aristotle focused
      on what people say about that thing.  That shift in focus led to
      a strategy for discovering what people think is meaningful about
      the things in question.

   2. To answer the question of "what documents are important", Google
      focused on what documents people referenced when they wanted to
      express something meaningful to other people.

Notice the word "meaningful" in both of these two statements.  If you
walk up to somebody and ask him or her to define "meaning", you will
get two different kinds of reactions, both of which are useless:

   1. A blank stare or a stuttering, disjointed series of syllables
      from people who have never been exposed to a serious study of
      philosophy, linguistics, logic, or lexicography.

   2. A lengthy discourse of theory-laden abstractions from professors
      in any of the above subjects.  Furthermore, any two professors
      you ask are likely to give incompatible answers that have almost
      nothing in common.

Both the Aristotelian approach and the Google approach avoid the
useless questions by asking something very specific:  how do people
actually talk when they are trying to communicate meaningful information
to other people?  This is an approach that Google automated by looking
at how people write web pages for other people to read.

Last year, I presented a paper at an IJCAI workshop on the topic
of automating ontology development along the lines of the above
suggestions.  Following are the slides:

    http://www.jfsowa.com/pubs/autotalk.htm

The philosophical foundations are in my more recent paper on Signs,
Processes, and Language Games.  Section 7 of signproc.htm gives a more
elaborate explanation of the methods mentioned in autotalk.htm:

    http://www.jfsowa.com/pubs/signproc.htm

And my more recent talk about negotiation instead of legislation
discusses some technology that can aid in that kind of automation.

    http://www.jfsowa.com/talks/negotiat.htm

Bottom line:  Some people have considered me a "spoiler" because I
have raised embarrassing questions about their pet projects.  On the
contrary, I regard the automated approach to be the salvation they
have been vainly searching for.

John Sowa