The 2008 Loebner Prize at the University of Reading was the fourth Loebner contest for Artificial Intelligence held in the UK. This competition, staging 20th century British mathematician and code-breaker Alan Turing’s imitation game, was first held in the UK in 2001, at London’s Science Museum. In 2003, the University of Surrey hosted the Prize; in 2006, the 16th Loebner contest was held at UCL’s VR theatre, Torrington campus.
The previous four Loebner Prizes (2004, 2005, 2006 & 2007) staged twenty-plus minutes, unrestricted conversation parallel-paired comparison of 'hidden' artificial conversational entities (ACE) with hidden humans, Loebner’s version of Turing’s imitation game (the ‘restricted conversation rule’ had been lifted in 1995). Four ACE competed in 2004, 2005 and 2006, three entries submitted to Loebner 2007 . The judges from 2004 to 2007 included AI specialists, computer scientists, journalists and philosophers: Dennis Sasha, John Barnden, Kevin Warwick, Russ Abbott, John Sundman, Duncan Graham-Rowe, Ned Block (see Loebner Prize page). Professor Kevin Warwick is the only judge to have participated twice: in the 2001 jury service, one-to-one imitation game, and in 2006, in the parallel-paired contest format. Therefore, he was uniquely placed to assess any improvement in ACE performance between 2001 – 2006.
Presenting at ECAP 2007, we found some delegates unaware of the Loebner Prize. As reported at that conference a downward trend was noted in the highest score awarded by any Loebner contest judge from the 2004 Prize (highest score awarded to an ACE: 48) to the 2006 contest (28): ACE conversational ability appeared to be worsening not improving. The awarding of lower scores was seen to be as a direct result of the change in contest format from one-to-one, five minutes imitation game in 2003, when Pirner’s bronze winning machine achieved “4=probably a human” from Judge4 (see Loebner 2003 results here). In 2006, Loebner introduced a character-by-character communications protocol between the judges’ terminal and the hidden conversational partners. No scores were recorded for last year’s Prize. An approach was made for the University of Reading’s School of Systems Engineering to host the 2008 contest.
Considering the current state of technology, and feeling that machines were not yet ready for Loebner’s twenty-plus minutes parallel-paired ACE/human comparison, Warwick and Shah proposed five minutes, unrestricted conversation, parallel-paired Turing Tests in the Loebner 2008 finals for the very first time. We remind that Turing himself wrote “after five minutes” (1950), which we take to be a first impression imitation game. A message by message communications protocol was created especially for the 2008 contest, to facilitate the five minutes Turing Tests. We next took the decision of opening up the contest by accommodating choice, in the preliminary phase only, for developers to submit web-based ACE to contest and include a broader range of judges, to match Turing’s “average interrogator”. Sixteen developers expressed an interest in the 18th Prize with thirteen submitting their creations, eleven via web and two via disk. Thus, this year’s contest saw original ACE never before entered to any contest (Loebner or Chatterbox Challenge).
The preliminary phase, during June and July involved over a hundred male and female judges, aged between 8 and 64, experts and non-experts, native and non-native English speakers (Cuban, Polish, for example), based as far apart as Australia and Belgium, India and Germany, France and US and in the UK. Between them, they selected six ACE to compete in the finals on Sunday 12th October 2008.
The preliminary phase showed us that programmes can, in some cases, only do what their developer had programmed them to do: the Lovelace Objection, raised by Turing himself in his 1950 paper. One system directed you to ask it “Which is larger? An orange or the moon”, the judge preferred to ask it another “Which is larger" question: “A house or a mouse” - the system not being programmed for this interrogation, failed to answer correctly. (I’m not even going to consider its non-understanding here as we’d then have to detour into a long discussion on the meaning of understanding, because it is not fully grasped how understanding occurs in humans - indeed a lecture at the University of Reading on Ocotber 29th, by Professor Douglas Saddy, will present recent EEG/ERP experiments on sentence processing and some of the issues faced in doing brain imaging studies of cognitive processes, which show how time and timing in the brain plays a central role in understanding language.)
Press releases from the University succeeded in fostering interest among locals to take part as judges or hidden-humans in the finals, along with journalists, philosophers and computer scientists. Others were invited, including Turing's biographer Dr. Andrew Hodges. Esther Addley points out in her Guardian piece here that our sample size, 12, was small. A look at previous Loebner Prizes will show that this number of Turing Tests allocated to each finalist ACE is more than in University of Surrey’s hosted 2003 contest (sample size: 9) and three times more than the Turing Tests for each ACE in Loebner contests 2004-2007 (sample size: 4 in each of those four years). However, the benefit of more resources and time would have provided the opportunity for a much larger sample size.
One journalist was deceived by Eugene; the runner up ACE considered human in its parallel-paired comparison with a non-native English speaker (who was deemed a machine). Turing did not state that human participants in the imitation game had to be native English speakers. Blay Whitby in his “The Turing Test: AI’s biggest blind alley” (In Eds Millican & Clarke, 1996) wrote, “we feel more at ease in ascribing intelligence (and sometimes even the ability to think) to those entities with which we can have an interesting conversation than with radically different entities” (p.61).
Disagreeing with one academic’s analogy who suggests that the "untrained" or the "man in the street" be excluded from judging in a Turing Test, I feel it important that everyone and anyone interested should be given the opportunity to participate in not only the discussion of building intelligent machines but to interact with them in science contests. After all, we most probably will be sharing the planet with digito-mechatron companions, why shouldn’t we all have a say in what we desire them to be/think like? Do we want all robots to be philosophers and computer scientists? Hell no, I want mine to umpire with all incorporated technology, in international cricket matches!
Lastly, and the reason for writing this page, is the criticism of “zero progress” in the field of building systems to pass Turing’s imitation game. This comment cannot be attributed to the ‘chatbot hobbyists’ and AI enthusiasts who develop ACEs, or to sponsors of Turing Test competitions, for they get no funding from research councils, etc. Any criticism rests solely with academia that pontificates over Turing’s writings but fails to encourage any development towards building a system to pass his imitation game. You can’t have it both ways, deem the Turing Test as meaningless but happily accept participating as a judge just to show how “poor” systems are. Do something about it, encourage new and young engineers to work with great minds from multidiscipline fields on this fascinating problem. As Wilkes wrote in 1953: If ever a machine is made to pass (Turing’s) Test it will be hailed as one of the crowning achievements of technical progress and rightly so.
© Huma Shah 2008 (first posted 28/10/08)
Lay report/scores here. (Detailed analysis and evaluation of results from the preliminary and final phases of Loebner 2008 is underway and will be presented at conferences, submitted for journal publication.)
Update November 2009:
See 'Hidden Interlocutor Misidentification in Practical Turing Tests' (Shah & Warwick, 2009c), response to some Turing interrogators' inaccurate evaluation, here.