Trouble with the TOEFL?

by F.

As described in this paper, a Latent Semantic Analysis algorithm does pretty well at reading. Developing Intelligence has a nice summery of the paper and gives the bottom line:

1. After training, LSA performed at 64.4% correct on a multiple choice test of synonymity taken from TOEFL (in contrast, humans score around 64.5% on average on this test, which is frequently used as a college entrance examination of English proficiency in non-native speakers. By this metric, LSA would be admitted to many major universities!)

2. Calculations of the rate of word learning by 7th graders suggests that they acquire .15 words per 70-word text sample; analogous calculations of LSA’s rate of acquisition show that LSA acquires .1500 words per text sample read

3. The comprehension by college students of several versions of a text sample about heart function is precisely replicated by LSA, when comprehension is measured as the degree of semantic overlap between subsequent sentences;

4. Humans initially show facilitated processing of all meanings of a previously-presented word, but after 300 ms show priming only of context-appropriate meanings; LSA shows similar effects insofar as similarity is higher between a homograph and two words related to different meanings of the homograph than between a homograph and unrelated words, and in that LSA considers words related to the context-appropriate definition of a homograph as more related than words related to the context-inappropriate definition of the homograph;

5. Human reaction times in judgments of numerical magnitude suggest that the single digit numerals are represented along a “logarithmic mental number line;” LSA was able to replicate this effect in its ratings of similarity among the single digit numerals, which also conform to a logarithmic function

Here’s the abstract from the paper:

How do people know as much as they do with as little information as they get? The problem takes many forms; learning vocabulary from text is an especially dramatic and convenient case for research. A new general theory of acquired similarity and knowledge representation, Latent Semantic Analysis (LSA), is presented and used to successfully simulate such learning and several other psycholinguistic phenomena.

By inducing global knowledge indirectly from local co-occurrence data in a large body of representative text, LSA acquired knowledge about the full vocabulary of English at a comparable rate to school-children.

LSA uses no prior linguistic or perceptual similarity knowledge; it is based solely on a general mathematical learning method that achieves powerful inductive effects by extracting the right number of dimensions (e.g., 300) to represent objects and contexts. Relations to other theories, phenomena, and problems are sketched.