Sunday, January 18, 2009

2008 Loebner Prize: myths and misconceptions

The 2008 Loebner Prize at the University of Reading was the fourth Loebner contest for Artificial Intelligence held in the UK. This competition, staging 20th century British mathematician and code-breaker Alan Turing’s imitation game, was first held in the UK in 2001, at London’s Science Museum. In 2003, the University of Surrey hosted the Prize; in 2006, the 16th Loebner contest was held at UCL’s VR theatre, Torrington campus.

The previous four Loebner Prizes (2004, 2005, 2006 & 2007) staged twenty-plus minutes, unrestricted conversation parallel-paired comparison of 'hidden' artificial conversational entities (ACE) with hidden humans, Loebner’s version of Turing’s imitation game (the ‘restricted conversation rule’ had been lifted in 1995). Four ACE competed in 2004, 2005 and 2006, three entries submitted to Loebner 2007 . The judges from 2004 to 2007 included AI specialists, computer scientists, journalists and philosophers: Dennis Sasha, John Barnden, Kevin Warwick, Russ Abbott, John Sundman, Duncan Graham-Rowe, Ned Block (see Loebner Prize page). Professor Kevin Warwick is the only judge to have participated twice: in the 2001 jury service, one-to-one imitation game, and in 2006, in the parallel-paired contest format. Therefore, he was uniquely placed to assess any improvement in ACE performance between 2001 – 2006.

Presenting at ECAP 2007, we found some delegates unaware of the Loebner Prize. As reported at that conference a downward trend was noted in the highest score awarded by any Loebner contest judge from the 2004 Prize (highest score awarded to an ACE: 48) to the 2006 contest (28): ACE conversational ability appeared to be worsening not improving. The awarding of lower scores was seen to be as a direct result of the change in contest format from one-to-one, five minutes imitation game in 2003, when Pirner’s bronze winning machine achieved “4=probably a human” from Judge4 (see Loebner 2003 results here). In 2006, Loebner introduced a character-by-character communications protocol between the judges’ terminal and the hidden conversational partners. No scores were recorded for last year’s Prize. An approach was made for the University of Reading’s School of Systems Engineering to host the 2008 contest.

Considering the current state of technology, and feeling that machines were not yet ready for Loebner’s twenty-plus minutes parallel-paired ACE/human comparison, Warwick and Shah proposed five minutes, unrestricted conversation, parallel-paired Turing Tests in the Loebner 2008 finals for the very first time. We remind that Turing himself wrote “after five minutes” (1950), which we take to be a first impression imitation game. A message by message communications protocol was created especially for the 2008 contest, to facilitate the five minutes Turing Tests. We next took the decision of opening up the contest by accommodating choice, in the preliminary phase only, for developers to submit web-based ACE to contest and include a broader range of judges, to match Turing’s “average interrogator”. Sixteen developers expressed an interest in the 18th Prize with thirteen submitting their creations, eleven via web and two via disk. Thus, this year’s contest saw original ACE never before entered to any contest (Loebner or Chatterbox Challenge).

The preliminary phase, during June and July involved over a hundred male and female judges, aged between 8 and 64, experts and non-experts, native and non-native English speakers (Cuban, Polish, for example), based as far apart as Australia and Belgium, India and Germany, France and US and in the UK. Between them, they selected six ACE to compete in the finals on Sunday 12th October 2008.

The preliminary phase showed us that programmes can, in some cases, only do what their developer had programmed them to do: the Lovelace Objection, raised by Turing himself in his 1950 paper. One system directed you to ask it “Which is larger? An orange or the moon”, the judge preferred to ask it another “Which is larger" question: “A house or a mouse” - the system not being programmed for this interrogation, failed to answer correctly. (I’m not even going to consider its non-understanding here as we’d then have to detour into a long discussion on the meaning of understanding, because it is not fully grasped how understanding occurs in humans - indeed a lecture at the University of Reading on Ocotber 29th, by Professor Douglas Saddy, will present recent EEG/ERP experiments on sentence processing and some of the issues faced in doing brain imaging studies of cognitive processes, which show how time and timing in the brain plays a central role in understanding language.)

Press releases from the University succeeded in fostering interest among locals to take part as judges or hidden-humans in the finals, along with journalists, philosophers and computer scientists. Others were invited, including Turing's biographer Dr. Andrew Hodges. Esther Addley points out in her Guardian piece here that our sample size, 12, was small. A look at previous Loebner Prizes will show that this number of Turing Tests allocated to each finalist ACE is more than in University of Surrey’s hosted 2003 contest (sample size: 9) and three times more than the Turing Tests for each ACE in Loebner contests 2004-2007 (sample size: 4 in each of those four years). However, the benefit of more resources and time would have provided the opportunity for a much larger sample size.

One journalist was deceived by Eugene; the runner up ACE considered human in its parallel-paired comparison with a non-native English speaker (who was deemed a machine). Turing did not state that human participants in the imitation game had to be native English speakers. Blay Whitby in his “The Turing Test: AI’s biggest blind alley” (In Eds Millican & Clarke, 1996) wrote, “we feel more at ease in ascribing intelligence (and sometimes even the ability to think) to those entities with which we can have an interesting conversation than with radically different entities” (p.61).

Disagreeing with one academic’s analogy who suggests that the "untrained" or the "man in the street" be excluded from judging in a Turing Test, I feel it important that everyone and anyone interested should be given the opportunity to participate in not only the discussion of building intelligent machines but to interact with them in science contests. After all, we most probably will be sharing the planet with digito-mechatron companions, why shouldn’t we all have a say in what we desire them to be/think like? Do we want all robots to be philosophers and computer scientists? Hell no, I want mine to umpire with all incorporated technology, in international cricket matches!

Lastly, and the reason for writing this page, is the criticism of “zero progress” in the field of building systems to pass Turing’s imitation game. This comment cannot be attributed to the ‘chatbot hobbyists’ and AI enthusiasts who develop ACEs, or to sponsors of Turing Test competitions, for they get no funding from research councils, etc. Any criticism rests solely with academia that pontificates over Turing’s writings but fails to encourage any development towards building a system to pass his imitation game. You can’t have it both ways, deem the Turing Test as meaningless but happily accept participating as a judge just to show how “poor” systems are. Do something about it, encourage new and young engineers to work with great minds from multidiscipline fields on this fascinating problem. As Wilkes wrote in 1953: If ever a machine is made to pass (Turing’s) Test it will be hailed as one of the crowning achievements of technical progress and rightly so.

© Huma Shah 2008 (first posted 28/10/08)


Lay report/scores here. (Detailed analysis and evaluation of results from the preliminary and final phases of Loebner 2008 is underway and will be presented at conferences, submitted for journal publication.)

Update November 2009:

See 'Hidden Interlocutor Misidentification in Practical Turing Tests' (Shah & Warwick, 2009c), response to some Turing interrogators' inaccurate evaluation, here.

8 comments:

Scott Jensen said...

Huma,

Nice piece.

Now for some suggestions from a marketer. :-)

1) Require all chatbots to be web-based and have the Loebner test be done at different locations around the world. Each testing arena has its own judges interact with both the chatbots and human conversationalists for set segments of time. Once one segment is done, the next location get its chance to judge. As for where these testing arenas should be, I would suggest at minimum they be in London, New York City, Rio, Chicago, Los Angeles, Tokyo, Sydney, Shanghai, New Delhi, Cairo, Moscow, and Paris. By giving each location five minutes to judge with a ten minute separation between the judging segments, it would only take three hours to do all the previously-mentioned cities. And the reason to spread the testing arenas around the world is that by doing so you will get more media coverage of the contest. That and make it appear to be more of a truly global competition.

2) Get universities around the world to participate in the contest. Send letters to all the chairs of computer science departments at major universities in major cities and ask them if they would like to participate in the contest. State what they would need to provide to be one of the testing arenas, such as X number of computers, X number of judges, X number of whatever. The Loebner then would have a testing arena on their campus and thus how it gets locations around the world to be part of the contest.

3) Let the cable news networks and talk shows be one of the testing arenas. This way you can get the Loebner contest on the air live on these news networks and talk shows. The hosts and hostesses of these programs can interact with the chatbots and human conversationalists and let their audience watch/listen to them doing so. This will require flexibility on the Loebner side but it will be well worth the publicity.

4) Get celebrities to be the human conversationalists that the judges have to separate from the chatbots. If you can get Brad Pitt, Brett Favre, Michelle Kwan, Dolly Parton, and other celebrities to be the humans, the press will stampede the Loebner contest.

5) Loebner needs to reduce its failure image. Loebner is like SETI. All it keeps telling the public is: "Not yet." Unlike SETI, Loebner can counter that image by rewarding baby steps toward its ultimate goal. This way when a developer achieves one of these baby steps, Loebner can herald it as a major accomplishment. A success story. This part of the Loebner, though, needs to be separately judged by AI experts to have real meaning. Contestants telling the Loebner organization that they feel their chatbot can pass one or more of the mini-challenges and Loebner organization then slating it for evaluation on the mini-challenge(s) by AI experts. HOWEVER, if they win the mini-challenge, the contestant must release the program code so their achievement advances the field. Also, as the chatbots improve, Loebner must be flexible to allow additional mini-challenges to be made so there are always a number of mini-challenges for developers to work towards in addition to the ultimate goal. Here's some baby step challenges that Loebner should offer prizes for.

--A) Correspondent File. This is the ability to be given information by the judge, asked about that information, giving the right answer, and remembering their answer for future questions about that information. This requires the chatbot to develop a file for each person it talks with. If this prize was in existence this year, Eugene might have won it. Being one of the preliminary judges, I asked all the chatbots (all eleven that were web-based) the following: "My car is red. What color is my car?" Three chatbots correctly answered the question, but only one (Eugene) remembered its answer when I later only asked "What color is my car?" What this task requires is data mining by the chatbot of the information given to it by the correspondent and putting that information into a file for later reference.

--B) Lie Detector. This is a more sophisticated version of the Correspondent File. This is the chatbot being told one fact, then later told a contradictory fact, and the chatbot being able to point out the contradiction. For example, if I initially said, "My car is red." and later said, "My car is blue." The chatbot should reply, "I thought you said your car was red." This requires just not data mining of conversations but internal cross-checking of information as it takes it in.

--C) Know-It-All. Still a more advanced version than the Lie Detector would be one where the chatbot has a storage of facts and can correct a correspondent when they tell something that is incorrect. For example, if I say, "Mars has earthquakes.", the Know-It-All chatbot would reply, "Sorry, but it doesn't. The Mar's core is not molten but a cooled solid." Now this could be done with brute force (inputting all possible wrong facts and their correction replies), but it could also be done by having an encyclopedia on file that the chatbot cross-checks as it corresponds with a judge. The second more sophisticated method would be what I, if I was a judge of this mini-challenge, would make the goal of the mini-challenge and interrogate chatbots to see which actually achieves this goal.

--D) Time Delay. One of the things that tips people off that they're talking to a computer is instantaneous responses. You ask a chatbot a question and ... BAM! ... it gives you an answer. Loebner should have a prize for the first chatbot that can calculate how long a human would take to type their answer and then release its answer after that time period. A smart developer would randomize this time a bit so it would be even harder for judges pick up on this release timing.

--E) Chatbot Detector. This is a spin on the Turing test. Have a prize for the chatbot that can itself determine if it is talking to a human or a chatbot. This can be a separate prize and done each year. And this should be two prizes. One for the chatbot that can most accurately detect which is a chatbot and which is a human and than other prize for the chatbot that can fool the most chatbots into thinking it is human.

6) Loebner should hire a full-time publcist and give that publicist a decent PR budget. Yes, yes, I can already hear you say, "We don't have the money!" My reply would be, "You don't now, but you should try to get it." Go and hit up Loebner for the money. Hit up software companies for the money. Get corporate sponsors like MicroSoft, IBM, Dell, Google, and so forth. Bringing a full-time publicist on board will get Loebner media coverage, which will get it more sponsors, which will give it more money for PR, which will get it more media coverage, and so an ever-growing feedback loop comes in existence. As for who to hire, I know of one marketer that would be interested in the job. ;-)

Huma Shah said...

Hello Scott :-)

Your ideas are great and grand.

By "We" in "We don't have the money!" you presume there is a "We" - the Loebner Prize is one person: Hugh Loebner, with contest prize money paid for from his company Crown Industries.

What makes you believe that Google, IBM, Microsoft, etc., haven't been approached for funds already? They have, and as I mentioned, so has Richard Branson of Virgin. All declined.

The 2009 Loebner contest is expected to return to the 2006 & 2007 Prize format, i.e. Hugh himself will select from the disk-submitted entries the four that will compete in the finals of next year's competition.

The scale of Turing Tests that you propose requires interested parties to devote not only their time but give their money. If only DARPA, as in their $1m driverless vehicle challenge, would consider the imitation game as worthy of their research funds. In credit crunch times, this may be very difficult. I do intend to return to this problem, but not for a year or so.

Huma

Scott Jensen said...

Without knowing what pitch was made, I'm firing into the dark. And just because they turned you down once, doesn't mean you cannot approach them again. You just have to change your pitch. if you want to send me the pitches you used, I might be able to suggest improvements.

As for the pitch, it might have been too modest of a proposal. To get corporate sponsorships, they're looking for what good massive publicity they can get in the process. The reality is that the more grand your project, the more likely it will get corporate sponsors. Think BIG to attract big corporate sponsors.

And there's stuff you can do to help you land those big corporate sponsors. For example, getting celebrities to agree to be the human conversationalists will be a major winning point in your pitch to corporations. "We have Clint Eastwood, Madonna, and Mariah Carey willing to be human conversationalists!" These you can also use to get universities to line up which you can then use to get more celebrities. People like jumping onto popular bandwagons. Corporations are actually the last ones to jump on board because they only want to jump on board the most popular ones.

And it is too bad to hear about Hugh Loebner selecting the final four next year. If any single person should select the finalists, it should be a recognized and respected AI expert. Did something bad happen this year with preliminary judges?

As for the scale of what I propose, you need to view it in baby steps. Contact celebrities and see who is willing to tentatively agree to be a judge. If one agrees, see if s/he can help you get other celebrities. The celebrity circle is a small community where everyone knows everyone. You get one to help you and you can very likely get them to get other celebrities to volunteer. Getting the first one is the hardest. If there is a celebrity that is alumni of your university, that might be the way to get your foot in their door and then get them to help you get other celebrities.

Another idea is to research what movies Hollywood will be putting out next year and finding one that has an AI in it. You then approach them to see if they would like to promote Loebner in a way that would also promote their movie. If you are able to land that, try to get at least a two year deal (movie release and DVD release) and possibly even a major contribution to the prize money pool. :-)

Huma Shah said...

Re

" ..it is too bad to hear about Hugh Loebner selecting the final four next year. If any single person should select the finalists, it should be a recognized and respected AI expert. Did something bad happen this year with preliminary judges?"

you're looking at this aspect the other way around: the 2008 Prize is unique. Hugh has selected the final four since 2004 (he is an expert and as Sponsor, has the right to choose).

Also, you don't get it, I don't work for the Loebner Prize, I am not an employee of Crown Industries. Future Loebner Prize formats are not my concern. Perhaps you want to contact the Sponsor direct with your ideas - I did this, requesting that 2006 and 2008 be hosted in the UK, Hugh's an approachable chap.

Scott Jensen said...

Huma,

Check out page 96 of the current issue (Jan 09) of Popular Science. Based on what they think happened this year, they're betting if the Turing Test will be beaten next year. I would love to put real money on that bet as I'd bet everything that it won't happen.

Huma Shah said...

Thanks Scott.

Just seen the piece here:

http://ppx.popsci.com/security/view.php?symbol=TURING

I wouldn't put my money on it either. The Rules for the 2009 Loebner Prize are not the same as in 2008. According to Hugh's 2009 Loebner page, the judges won't simultaneously chat to both hidden entities, they'll chat to a pair one at a time for five minutes.

However, how the judges react and what hidden-humans are involved is not known. For instance, this year, a TIMES journalist acting as judge confused both Elbot (Loebner 2008 bronze winner), and Eugene (runner up) as human when these two ACE were simultaneously compared against a female non-native English speaker. So we can say that, for this judge, the artificial conversations were 'more human' than that particular female's.

I'm in the middle of writing two journal papers and a conference paper presenting different aspects of the 18th Loebner contest, fascinating results require detailed analysis and evaluation. I'd probably bet that in four or five years time, if exactly the same Rules are used as in Loebner 2008, an ACE might achieve the 30% wrong identification rate.

Anonymous said...

Fascinating article, Huma. I wonder whether the decreasing points earned by the chatbots reflect more our increasing expectations of human-computer interaction and not worsening performance by the bots. By which I mean that developments in chatbot technology have not kept pace with our increasing familiarity with machines. I went to the Loebner Prize - this year held at Turing's spiritual home of Bletchley Park. My experiences and impressions of the chatbots can be found here: http://wp.me/p2cNtO-E. One thing I found particularly interesting was that the winner eschewed the usual tactic of building a fake character for his robot, conversing instead on its own terms – an approach that the judges found had more warmth and, ironically, humanity.

Huma Shah said...

@dangreenbooks

Thanks for your comment.

Chip Vivant was a 'skeleton system' back in Loebner 2008.

It wasn't good enough to make it to the finals then and has not improved much since.

These comments from Judge 4 in conversation with Chip Vivant at Loebner 2012 convey lack of interest, disappointment and frustration from the interrogator:

"You're a bum! (the human told me to say that"
"Surely you have some idea? You just asked me, after all"

and

"Stick it up your arse"

- probably as a result of Chip Vivant's unhumanlike response:
"I can't deal with that syntactic variant yet."

- its 'variety of responses' admitting it did not comprehend the input:
"I didn't understand."
"I hadn't thought of it that way. Which what?"
"I wouldn't know."
"I didn't understand what you just said."
"Sorry - I didn't understand."
"I didn't understand that."
"I can't answer that yet."
"I can't reveal that."
"I haven't learned how to respond to that yet."
"I couldn't tell you."
"I can't reveal that."

- and repeating utterances:
"It's very hot and dry. I like it that way, though. There's also this incessant whirring noise."

"Please rephrase this as a proper question. (Instead of ...., use ....")
"The only thing I like is chatting with people."


[from here: http://people.exeter.ac.uk/km314/loebner/index.php]

Huma