Monday, February 22, 2016

Measuring (machine) intelligence by MCQ?

I'm idly wondering: would Ms Google pass most MCQ tests constructed by academics?  If so, should we believe that she is  intelligent? Cade Metz, in an article in Wired,  gives a partial answer:
"... Clinicians were helping IBM train Watson for use in medical research. But as metaphors go, it wasn't a very good one. Three years later, our artificially intelligent machines can't even pass an eighth-grade science test, much less go to medical school.  So says Oren Etzioni, a professor of computer science at the University of Washington and the executive director of the Allen Institute for Artificial Intelligence, the AI think-tank funded by Microsoft co-founder Paul Allen. Etzioni and the non-for-profit Allen Institute recently ran a contest, inviting nearly 800 teams of researchers to build AI systems that could take an eighth grade science test, and today, the Institute released the results: The top performers successfully answered about 60 percent of the questions. In other words, they flunked..."
   Apparently: somewhere in the world, folks have formed the belief that 60% of possible marks is a fail, no matter how the testing instrument is constructed; however if this test of machine intelligence were run here, we'd be required - by University policy - to award a C+ pass!
   Metz quotes Doug Lenat: "... If you're talking about passing multiple choice science tests, I always felt that was not actually the test AI should be aiming to pass," he says. "The focus on natural language understanding-science tests, and so on-is something that should follow from a program being actually intelligent. Otherwise, you end up hitting the target but producing the veneer of understanding."  What a pleasant surprise: I agree with Doug about something!
   It's an intriguing question to ask of any University: is it certifying only "the veneer of understanding" on its graduates, or do they have some "deep understanding"?  More importantly, how might we reliably measure the depth of understanding in a MOOC, or in any semi-automated teaching environment employing only MCQs and keyword-matches and machine-intelligent testing procedures?
[This post was adapted from an email by my colleague Clark Thomborson.]