Thread Links Date Links
Thread Prev Thread Next Thread Index Date Prev Date Next Date Index

Re: Software to mark short textual responses



Aneesha, Galia, Hamish, et al.,

That is still a major research problem:

A> Anybody know of any open source software
 > (preferably in Java) that will be able to
 > compare a student's text response with that
 > entered by a teacher and decide whether it
 > is correct or incorrect.

LSA is one approach, but it is a statistical
method that only produces usable results for
large amounts of data (a paragraph or more).
For evaluating essays, LSA often gives a high
correlation with the results by experienced
teachers.  But it is useless for evaluating
single sentences or short phrases.

GA> But there are research attempts to apply LSA
 > in this way, comparing whether the two texts
 > contain the same words (in case this is the
 > decisive factor for "correctness").

The LSA method doesn't take into account the syntax,
and it throws away little words, such as "not",
which can make a very big difference in meaning.
It is also incapable of recognizing correct answers
that happen to use different words.

To a large extent, I agree with Hamish:

HH> If the answer to a question can be assessed by
 > a computer, and it isn't a multiple choice question
 > (which have a place if used with care), it isn't
 > worth asking, IMHO.

Unfortunately, the president of the US -- a prime
example of the failures of the US educational (not
to mention electoral) system -- has edicted that
every child be evaluated with uniform nationwide
examinations.

In any case, I was involved with the development of
an interesting prototype for evaluating the answers
in response to such exams.  (See below.)

John Sowa
______________________________________________________

For the standardized tests, each question would normally
be evaluated with several classrooms of students. For each
question, there would be about 50 different answers:

  * Some are completely correct, but stated in different ways.

  * Some are partially correct, and the teacher would evaluate
    the answer by saying what was missing.

  * Others are wrong in many different ways, and the teacher
    would give some appropriate evaluation.

For each question, there would be about 50 answer-evaluation
pairs for typical correct and incorrect answers by the
students and appropriate evaluations by the teachers.

The VivoMind approach was to use a combination of the Intellitex
parser and the VivoMind Analogy Engine (VAE) to compare each
new answer to the 50 sample answers:

  1. Translate all 50 sample answers to conceptual graphs.

  2. Translate each new answer to a new CG.

  3. Use the analogy engine to compare the new answer CG to
     the CGs for the 50 sample answers.

  4. Use the weight of evidence to determine the best match.

  5. Print out the teacher's evaluation for that answer.

This approach, by the way, is very similar to the techniques
used in "case-based reasoning".

In a preliminary evaluation, this approach worked out very well.
VAE did a good job of matching new answers to one of the previous
aswers.  Furthermore, it always produced an estimate (or "weight
of evidence") for any match.  In those cases for which the weight
of evidence was high, the response selected was correct.  In those
cases for which the weight was low, a teacher could be called to
write a new evaluation, which could be added to the list of
answer-evaluation pairs.

For the set of questions on which the prototype was tested,
all the answers were short sentences or phrases, and the LSA
approach failed to give any useful results.  VAE, however,
worked very well in comparison.

For more information about VAE, see our paper for ICCS'03:

    http://www.jfsowa.com/pubs/analog.htm
    Analogical Reasoning