Huma Shah and Professor Kevin Warwick describe the day and discuss the results from the recent Loebner Prize contest which was hosted by the University's School of Systems Engineering in October 2008.
Sponsored by scientist and philanthropist Dr. Hugh G. Loebner, this 'can a machine think?' competition is based on the controversial notion of measurable intelligence, originally posed by the 20th century mathematician Alan Turing. This measure of intelligence was proposed to be by comparison and deception through text-based conversation. Under the official criteria of the Turing Test, a machine that fools more than 30% of the human interrogators is deemed to have passed the test.
The results from the contest are currently being evaluated (invited paper submitted to Kybernetes journal; paper accepted for presentation at ECAP 2009). Both of us have experienced judging in Turing tests: Kevin twice previously in a Loebner Prize, in 2001 at the London Science Museum and at UCL in 2006; Huma judged in Chatterbox Challenge 2005 and co-organised the 2006 Loebner Prize. This helped us in the design of the unique, two-pronged format for the 18th consecutive Loebner competition. For the first time, web-based contestants were permitted into the preliminary phase.
The Loebner Prize is a conduit for mostly part-time developers, scientists and computer programmers who enjoy the experience of Turing's thought experiment. The 2008 rules encouraged thirteen Artificial Conversational Entities (ACE) to enter the initial phase of the competition (staged in June and July 2008). Previous Loebner winners and contestants Alice, Jabberwacky, Ultra Hal, Elbot, and Eugene along with new comers Amanda, Bootie, Brother Jerome, Chip Vivant, LQ, Orion, Trane, and Zeta competed for a place in the finals. Another first for the contest was allowing volunteers, including staff and students from the University, to test the entries during the preliminary phase. Over one hundred volunteers male and female, expert and non-expert, children and adults, native and non-native English speakers based as far apart as the US and India, Australia and Belgium, France, Canada, Germany, Sweden and the UK participated as preliminary phase judges. All tests were conducted in English, and by chatting to systems in a one-to-one way (known as the jury-service Turing test), each ACE was analysed for conversational ability and emotion content. The judge's combined verdict led to six ACE invitees to vie for the 'most human' award through parallel-paired Turing tests staged in the finals on October 12th.
Five systems, Brother Jerome, Elbot, Eugene, Jabberwacky and Ultra Hal, a fresh intake of 24 judges, and 21 'hidden humans' completed the 96 Turing tests. This was facilitated by a specially commissioned message-by-message technical communications protocol (MATT), designed by Merrill Lynch computer programmer Marc Allen. Imagine two MSN chat windows allowing you to speak to two different friends at the same time - that is the sort of set-up the MATT protocol enables over a local network.
Information sent out in press releases to local media encouraged volunteers to apply to be interrogators /judges, and also to be hidden-humans. Judges included communication experts, journalists, computer scientists, philosophers and post-graduate students; hidden-humans were mainly students.
The Turing tests stretched over two sessions, morning and afternoon, with 12 rounds in each session, and four tests in each round. Examined over five minutes of unrestricted conversation, each ACE finalist was compared against a hidden-human by a judge. The judges occupied one room, each sitting in front of a terminal marked A, B, C, or D. Each terminal had a split-screen computer monitor, with text in boxes on the left and the right. Behind the partition in another room, hidden-humans or ACE controlled what the judges were able to see on their screens.
The judges had to use their own judgment on what constitutes 'human conversation' in order to expose the machine and identify the hidden human during a text-based conversation. After the conversation, judges were asked to what or who they had been talking to on the left and right-hand side of the screen. It should be noted that the hidden humans did not chat with the ACE during the Turing tests.
Months of planning and hard work behind the scenes did not detract from an exciting and fun science contest, which was attended by a diverse range of academics and members of the public.
Participants, judges and hidden-humans all seemed to enjoy their experience:
Daisy Johnson thought the experience as a judge "interesting from a language point of view" and since, she has noted from the transcripts that the ACE did not make spelling mistakes in contrast to the hidden-humans, she feels this is a "fairly key signifier of human-ness", a feature that ACE developers could consider in the future.
Dr. Lucy Chappell correctly identified the gender of the hidden-humans identifying male as "less conversational (short answers, to the point)" and female hidden-humans as "quite happy to tell me a lot about themselves and their life".
Defeng, one of a few non-native English speaking participants in the finals, said he focused on two points when simultaneously testing each pair of hidden entities. Defeng concluded that 'speed of response' and 'length of utterance' determined how he rated a hidden interlocutor. If an entity replied very quickly with a long sentence, or asked a question in response to a question, it was more likely to be a machine; whereas if a discussion developed with a hidden entity answering his questions, then they were likely to be human.
Liz Dowthwaite "found the whole experience really enjoyable and fascinating".
60 of the 96 parallel-paired Turing tests featured 5 ACE compared against humans, the other tests involved control groups (human-human, machine-machine). Though no machine passed the Turing test during the competition (none of them succeeded in convincing 30% of the judges), we can report that some judges were deceived. Deceptions (when the judges did not correctly identify the entity with which they had had the conversation) occurring within the competition can be split into three different groups known as: the Eliza effect (where the ACE were considered human), the confederate effect (where the interrogators mistook hidden-humans for machine) and gender-blurring effect (where male hidden-human confused for female and vice-versa by some interrogators). Several of those Loebner finals conversations will be examined by post-contest judges, to find if they agree with contest judges.
Deceptive rate in human comparison Turing tests
Average Score for conversational ability (out of 100)
|Elbot - winner||3 judges deceived in 12 human comparison Turing tests||52.25|
|Eugene||1 judge deceived in 12 human comparison Turing tests||49.92|
|Ultra Hal||1 judge deceived in 12 human comparison Turing tests||32.42|
Two parallel conversations from Loebner 2008 finals are available here (pdf-24kb). Can you tell what or who each judge is speaking with at the left and the right? Are the entities machine or human? If human, are they female or male, child or adult, native or non-native English speakers? What gender are the judges? Send your answers to email@example.com
We aim to repeat the ‘two Turing tests’ Loebner 2008 paradigm in an ambitious project at the place Turing broke codes during World War II – Bletchley Park in 2012, the centenary of Turing’s birth. We hope that interdisciplinary teams of academics will encourage their students and fresh designers to get involved in this exciting engineering project, supporting progress. We cannot guarantee that they will win but we can promise that they will enjoy the ride.
Thanks to the event-day team, Dr. Ian Bland, Chris Chapman, Mark Allen, Tristan Matthews, behind the scenes, Alex Brannen, Dr. Lucy Chappell, Chris Rayner, Carole Leppard, Lucy Virtue, Nellie Round, Sandra Illet, and the University's security personnel.