SUO: Re: article on the pitfalls of metadata
o~~~~~~~~~o~~~~~~~~~o~~~~~~~~~o~~~~~~~~~o~~~~~~~~~o
Frank, John, Matthew, Rich, et al.
Whatever we decide to call it next,
I think that the question of primary
importance about data has always been
a bit like this:
| Does this bunch of bits give us information about the real state
| of things in relation to the desired or intended state of things?
I think that this question goes to the heart of why we gather data and share it,
why we tell the stories that come out of our own experiences, why we go out of
our way to have the sorts of new experiences that we can tell stories about,
why we think that it will do any good to share those stories with people
who are bound to have more or less coincident or divergent experiences.
On the geek^3 side, all the classical definitions of information,
taken in their applied science contexts, make it "information_about",
that is, an "intentional", goal-related property of signs or signals.
Bits, data, signals, and so on are only properly called information to
the extent that they are "about" objective realities, and here the word
"objective" does a double duty, reminding us of real situations and the
goals that we have within them. More specifically, data is information
to the extent that it reduces our uncertainty about the way things are,
and the way to get to the way that we want things to be, and reducing
uncertainty is a subgoal on the way to making action intelligent.
So there is a kind of search involved in all this, but it is the search
that people carry out through the search-space of their realworld states.
I think that covers a lot of the underlying reasons why we go looking for
information in the first place, and why we make the trouble of experiment
for ourselves when we just can't get the needed information any other way.
So I think that one of the things that we need to think more decidedly about
is the function that information serves within these sorts of deliberate and
purposeful activities. That is the criterion that makes a decision possible
between "real" and "fake" data.
Old stories, I know, but sometimes it helps to remember them.
Jon Awbrey
o~~~~~~~~~o~~~~~~~~~o~~~~~~~~~o~~~~~~~~~o~~~~~~~~~o
Frank Farance wrote:
>
> At 08:07 2003-08-23 +0100, West, Matthew R SITI-ITPSIE wrote:
> >
> > Dear Rich,
> >
> > Well the article makes me realise that I am involved
> > in one of these initiatives in EPISTLE where we have
> > a data model and some 50,000 items of reference data.
> > So some comments from the coal face.
>
> Matthew-
>
> I agree with your points and I'd like to add some others. As you know,
> I'm involved in the ISO metadata standards committee (ISO/IEC JTC1 SC32 WG2).
> So here is my two cents' worth.
>
> In the referenced article:
>
> http://www.well.com/~doctorow/metacrap.htm
>
> I agree with *some* portions of what Cory Doctorow said, but I believe
> he said it imprecisely. Here is my summary of "what is wrong" (i.e.,
> the "crappy" things ...
>
> - Everyone has a "search engine" mentality when they think about things on the web.
> All their illustrations are from search engines. A typical comment is "when I look for
> something I want to get 10 hits rather than 10,000". The web isn't just about searching.
>
> - Adding "descriptive data" (i.e., "metadata") has some cost associated with it.
> Based on that cost and the quality one desires, "descriptive data" is added at the
> appropriate "quality level" (the notion of "appropriate" depends upon one's notation
> and expectation of "quality"). Unfortunately, there is no agreement on cost and quality,
> so from everyone's perspective, the "descriptive data" is inconsistent.
>
> - For a given object, there isn't a singular set of "descriptive data".
> This should come as no surprise that "metadata is in the eye of the beholder".
>
> - Most web search engines are terrible ***from the perspective of a librarian or
> someone creating catalog entries***. For example, let's say I have a document that
> has the following unordered attributes:
>
> "attribute1=value1, attribute2=value2, attribute3=value3"
>
> There is no reliable way to get this descriptive information into the web content
> so that a user can say (with a different ordering of the attributes):
>
> "please find attribute3=value3, attribute2=value2, attribute1=value1"
>
> Search engines ignore "META" tags because people lie. So even if we had high
> quality metadata, it would be useless for web searches.
>
> ----
>
> One of the points in the paper is: "metadata is data about data". This is wrong --
> pretty much all data is about data, so that definition doesn't reveal the delimiting
> (or essential) characteristics of "metadata". Although it sounds catchy, it is an
> imprecise definition. A better definition is:
>
> metadata: data that is used for description
>
> So an *essential characteristic* of "metadata" is that it is descriptive.
> This also means that one cannot look at something *in isolation* and know
> that it is "metadata" -- "metadata" only exists in relation to something
> else (the object of description).
>
> So nothing is inherently "metadata".
>
> Here are the seven "insurmountable obstacles" of "meta-utopia":
>
> 1 People lie
> 2 People are lazy
> 3 People are stupid
> 4 Mission: Impossible -- know thyself
> 5 Schemas aren't neutral
> 6 Metrics influence results
> 7 There's more than one way to describe something
>
> If one considers the creation of "descriptive data" as a data collection process,
> then the usual features of measurement, observation, error, etc. apply (think of
> a survey for data collection). So problems #1, #3, and #4 are simply errors in
> the data (Cory's illustration of the Neilsen television ratings demonstrates
> this point). A statistician might characterize the kinds of errors differently.
> Statistical methodology is a prime tool for improving metadata quality.
>
> Statistical methodology might help #6 and #7: the results of a survey depend upon
> how you ask the question(s). However, aside from the error estimation of (descriptive)
> data collection, most systems that support the collection of metadata also support the
> collection of more than one taxon, characteristic, property, etc.
>
> Even if the hierarchies are different, if the "terminology" is the same (or a "population"
> and its "characteristic" are the same; or the "data element" definition is the same), then
> it is possible to have meaningful data interchange. To use the example from the paper
> (problem #5), one vendor organizes data with the following hierarchy:
>
> Energy consumption:
> Water consumption:
> Size:
> Capacity:
> Reliability
>
> while another organizes it the following hierarchy:
>
> Color:
> Size:
> Programmability:
> Reliability
>
> So as long as there is common agreement on what "reliability" is and how it is
> measured (e.g., a standard), then the incompatible hierarchies are not a problem --
> and it is possible that the properties (values) associated with the characteristic
> "reliability" are compatible.
>
> In conclusion, nothing is inherently "metadata", it is merely data.
> Data is only metadata when there exists a descriptive relationship to
> some object. Since the establishment of a (descriptive) relationship to
> an object is not in question in Cory's paper, the only thing that is left
> is just the (descriptive) data. Thus, it is not surprising that techniques
> for improving data quality are similar to techniques for improving metadata
> quality: e.g., we need precise definitions of the meaning of the data (such
> as data semantics in ISO/IEC 11179) and we need a precise understanding of
> its measurement and error estimate <-- these apply to both data quality
> and metadata quality.
>
> I don't see these problems as insolvable -- I see them as manageable, like
> all other aspects of data, measurement, data collection, and data quality.
> I don't expect perfection. I don't write "perfect" code (whatever that is),
> I write "well-engineered" code. Likewise, I don't expect "perfect metadata",
> but I believe that "well-engineered metadata" is possible using the techniques
> of subject-matter methodologies, engineering methodologies, and engineering
> management.
>
> Not to mention, there are some ISO/IEC standards (11179) and technical reports (20943)
> on this topic of "metadata" that were developed by ISO/IEC JTC1 SC32.
>
> -FF
>
> ______________________________________________________________________
> Frank Farance, Farance Inc. T: +1 212 486 4700 F: +1 212 759 1605
> mailto:frank@farance.com http://farance.com
> Standards/Products/Services for Information/Communication Technologies
o~~~~~~~~~o~~~~~~~~~o~~~~~~~~~o~~~~~~~~~o~~~~~~~~~o