RE: SUO: RE: RE: Re: Missing Ingredients
I must admit to being skeptical about using WordNet. I've finally realized
why. Two reasons.
1. As I've explained before, I'll bet I can give you a dozen definitions of
"customer", each of them a real one, used by a real company, such that the
WordNet definition of "customer" fails to discriminate among any of them. So
interesting, tough semantic problems are less likely to arise from an
analysis of WordNet.
2. If we use WordNet, where will we find the discoveries that bring up
ambiguities, vaguenesses, homonomies, etc. that we must then cleanse WordNet
of in ways I've already talked about? These discoveries come out of
recognized needs. In business, that comes when a decision maker finds that a
screen or report, purportedly containing the information he needs, really
contains less, or more or different information than what he had in mind.
(Yes, non-Peirceans can talk about what people have in mind, can realize
that meaning isn't inherent in phonetic and orthographic tokens but rather
in the minds of the interpreters.)
An example is in a series of three articles I recently wrote at
datawarehouse.com. The subject was zip codes. A certain zipcode was split
into two by the post office. Call it zipcode 12345, split into a new 12345
and also 12346. (This is how the post office does it. Talking to them about
12345 now being a homonym isn't going to change their practice (and for good
reasons, too).)
The split happened a couple of months ago. The big boss decision maker says
he wants to see the history of sales for his division, by the zip code of
the customers to whom those sales were made. The report lands on his desk.
Sales for 12345 are shown to have dropped by 60% two months ago.
This is known, in business database management, as the "as was vs. as is"
problem. In this case, it will probably be pretty easy to figure out what
happened. But often the results are more difficult to discover. Suppose that
what the boss sees is a list in which there is some highly derived dollar
figure, and in which sales by customer zipcode by month is only a small
contributing factor to creating that derived amount. Now the boss starts to
see a downward trend in certain entries on that list, that started two
months ago, but that even a year from now shows no signs of correcting
itself.
Well, anyway, the issue here is what "..... by zipcode" means. Does it mean
(a) by the definition of the zipcode as of some date in the past; or (b) the
definition of the zipcode as it is currently. ("Definition of the zipcode"
really means "referent of the proper name 12345", of course.)
Point is: using examples from databases where the semantics vary
considerably from one database to another, and also vary to a degree for the
same database, over time, will give us some meat to sink our teeth into. I
don't see how just using WordNet will do that.
-----Original Message-----
From: owner-standard-upper-ontology@majordomo.ieee.org
[mailto:owner-standard-upper-ontology@majordomo.ieee.org]On Behalf Of
Richard Cooper
Sent: Wednesday, October 22, 2003 3:49 PM
To: Murray Altheim
Cc: Tom Johnston; Jon Awbrey; SUO
Subject: RE: SUO: RE: RE: Re: Missing Ingredients
Murray Altheim wrote:
> Richard & Tom,
>
> I perhaps wasn't clear in my previous message regarding use of
> WordNet as a resource for building reasoning engines.
>
> It's not the WordNet doesn't have categories, doesn't have any
> structure, it's that in order to build a system that uses natural
> language, you need to build into it architecture the notion that
> each of these words has essentially *no* meaning outside of a
> given context. Meaning only comes via interpretation, it's not
> bound somehow essentially within the word. So the specific string
> of characters "parrot" has no meaning whatsoever. The meaning you
> think it has is there because *you've* interpreted it. When you
> go to a thesaurus, how do you choose the "correct" word? From
> experience, you do so by looking at the context in which you plan
> to use it, and make a judgement. The reverse, from a machine POV,
> is not straightforward. I think this is what Jon alluded to when
> he says we don't have a clue about how to tackle it.
That is certainly true of natural language, but I'm using WordNet
as a "lexical resource", as its called, not to interpret natural
language in all its complexity. The point is to select a project
scope that is within reach. WordNet is within reach, and fits
the requirement for defining the commonly held interpretations
people have for words. But that doesn't mean full natural
language in all its variations; its just a vocabulary of words
that have defined meanings, is computer processable, and can be
applied to ontology development.
The alternative is to have everybody who designs an ontology
make up their own words. Since an authoritative dictionary
like WordNet exists, its far better to use the definitions from
that source that to have an ontologist make up words.
<snip/>
> A child's understanding of "bird" is much different than a
> biologist's, and to my understanding, DNA evidence has completely
> mucked up zoological taxonomy, showing that there are no hard and
> fast boundaries between species, that "species" may be as flawed
> a concept as "race". So what does "bird" mean? Absent a specific
> context, a specific interpretation, nothing at all. Meaning doesn't
> exist on the printed page or in the words themselves -- it only
> exists in our heads.
Yes, but we have no instruments that can measure what goes on
in our heads. So a commonly used set of definitions is better
than no definition at all, or an unreferreed salad of words
chosen without regard to any natural language interpretations.
> Point is, there's a "pragmatic" and a Pragmatic way to go about
> solving this problem. What may seem like a pragmatic approach (to
> ignore the complexities of language, to jump nominalistically
> into the fray of WordNet) will likely in the end bite you in the
> ass. Or somebody else, maybe somebody named Osama.
Absolutely right. There is only so far we can take this project,
but that doesn't mean we can't get some useful results with some
reasonable amount of work. Just don't have such high expectations
that you think it will properly handle any utterance. Some is
better than none.
>Any reasoning
> engine that makes a mistaken assumption about the context of usage
> of a word, even if it guesses the base definition from a dictionary
> correctly, is going to create a flawed result. Am I talking about
> "bird" or "bird" or "bird" or "bird"? How would you know which one
> is which?
>
> I don't think natural language is yet at a point where we
> understand its complexities well enough, understand the vagaries
> of context and interpretation well enough, that we can perform
> machine-based reasoning on it, except in toy experiments. As I
> mentioned, one approach is to use a controlled vocabulary, or
> require that all machine-based reasoning operate upon known,
> agreed-upon identifiers for known-agreed upon concepts, sort of
> a business agreement about the use of terms in a vocabulary.
>
> Murray
The latter - a controlled vocabulary - is what I want from WordNet,
but not a tiny toy vocabulary.
As for known agreed upon identifiers - WordNet has them. It shows
each interpretation as a synonym set (a synset) for every noun,
verb, adjective and adverb.
As for known agreed upon concepts - WordNet has those also. They
include the dictionary relations of hypernymy/hyponymy, synonymy,
and so on.
So it gets my vote for the controlled vocabulary. However, languge
is time varying also. So having a database that formalizes the words
means we have a perishable product. So the actual benefit is short
lived without a continuing maintenance effort.
But at least this way we can get some useful results on a smaller
scale before trying to go any higher.
Rich