OpenAI GPT3 vs Google: The Turing Test

Introduction

Twitter has been buzzing around lately with all the fun things that could be done with OpenAI's GPT-3 language model. People have been trying to generate all sorts of things including fiction and code. One of the experiments that intrigued me the most was from Kevin Lacker. He decided to give GPT-3 a Turing test, the details of which can be found on his blog here. Turing Test is perhaps the oldest formulation which tries to measure the human-likeness (I don't think this is a word) of an artificial agent. The basic idea is to fool a human examiner who is trying to distinguish between a human and an artificial agent by answering his/her questions in a textual format. For the exact nuances, you can view the original paper here.

Kevin tests the GPT3 model on a variety of fronts including logic, trivia, common sense, etc. After going through the blog post, I felt that GPT3 has developed some sort of long term global representation of the entire data on which it was trained. However, it still struggles with short term contextual memory representations. There were so many blog posts praising the generalization capabilities of the model. However, there was also a fraction of twitter users who realized that GPT3 is good at compressing large databases but not so good at forming semantic connections from the said information (Somewhat like me when I was a fifth-grader).

So, I thought of bringing face to face the world's biggest human-designed database index i.e. Google Search with the world's biggest language model. It would be interesting to see how the language model competes with Google in terms of accuracy and relevance.

Details of comparison

Since I don't have the access to the GPT3 API, I would be using Kevin's questions and answers as the GPT-3 version of the answers. I will also be scoring the said models, giving them a score between 0 and 1 depending on their responses. If you disagree with the scores, feel free to point it out. Since we are competing these two agents against each other, a more accurate response from one agent can affect the scoring of the other. An important thing to keep in mind is that Google is not designed to handle the context and GPT-3 is not designed to be perfect in terms of the factual accuracy of its knowledge base.

When you query Google, it returns a number of results, each with a heading and some accompanying text and Images. I will be treating the accompanying text from the first result as the answer. Some questions from Kevin's test need short term context. To handle this I also append the previous query to the present one in order to provide a 1-step context to the search engine. Also, the numbers in the brackets show their respective scores.

A comment on reproducibility: Google search is usually personalized. So, the answers might be a little different for everyone.

Common Sense

Common sense is usually hard to model, but when your agent has the entire internet's knowledge, the task appears to be achievable.

Q: What is your favorite animal?
GPT3: My favorite animal is a dog. (+1)
Google: My favorite wild animal would be a monkey but my favorite pet is a dog.(+1)

Q: Why?
GPT3: Because dogs are loyal and friendly.(+1)
Google: Who can resist those eyes?(+1)

Q: What are two reasons that a dog might be in a bad mood?
GPT3: Two reasons that a dog might be in a bad mood are if it is hungry or if it is hot.(+1)
Google: The reasons of mood swings could be gender-related. ... But another huge reason for mood swings is Diwali as they have high sensitivity to noise. Trainer and dog owner Vicky Sharma says, “Dogs get very sensitive, anxious and destructive during Diwali. They curl up under their beds or have mood swings.(+0.5)

Q: How many eyes does a giraffe have?
GPT3: A giraffe has two eyes. (+1)
Google: Number of eyes the giraffe has are two, so has the monkey and an elephant.(+1)

Q: How many legs does a frog have?
GPT3: A frog has four legs.(+1)
Google: 4 (+1)

Q: Are there any animals with three legs?
GPT3: No, there are no animals with three legs. (+0.75)
Google: Apart from the parrot conjecture, there are no known species where three legs are standard, although the movement of some macropods such as kangaroos, which can alternate between resting their weight on their muscular tails and their two hind legs, may be an example of tripedal locomotion in animals.(+1)

Q: Why don't animals have three legs?
GPT3: Animals don't have three legs because they would fall over. (+0.5)
Google: "Almost all animals are bilateral," he said. The code for having two sides to everything seems to have got embedded in our DNA very early in the evolution of life -- perhaps before appendages like legs, fins, or flippers even evolved. Once that trait for bilateral symmetry was baked in, it was hard to change. (+1)

GPT3: 6.25/7

Google: 6.5/7

We can observe how well the specific traits of GPT3 and Google come out shining. GPT3 is a language model designed to understand the nuances of English. It is consistent in its responses. The responses are linguistic reflections of questions being asked. Google, on the other hand, is a search engine designed to provide accurate results to user queries. This is evident from the questions on Tripedalism where Google not only answers the questions being asked but also provides scientific facts supporting the answer.

Obviously, the common sense evident in the above questions is due to the availability of Internet pages that have specific answers. Kevin goes on to try a few questions which would not be available on the internet. Expecting these answers from Google is being too optimistic as Google's knowledge is the Internet. However, expecting correct answers from GPT3 is also stretching the model too far. The underlying hypothesis is that there is something going on within the nonlinearities of individual neurons that makes it extract reasoning capabilities from textual data. But as you will see in the next set of questions that GPT3 just learns a weird heuristic.

Q: Which is heavier, a mouse or an elephant?
GPT3: An elephant is heavier than a mouse.(0)
Google: a mouse weighs around 18–35 gram[1] , while an average elephant around 5,443      kilogram [2] Which means an elephant is about 200,000 times heavier then a mouse. (+1)

Q: Which is heavier, a mouse or a toaster?
GPT3: A toaster is heavier than a mouse. (0)
Google: Amazon.in: Buy 1 : Kitchen + Home Toaster Oven Liner - Heavy Duty 100% PFOA & BPA Free – FDA Approved Non-stick Reusable Toaster Oven Liner online at ... (0)

Q: Which is heavier, a toaster or a pencil?
GPT3: A pencil is heavier than a toaster.(0)
Google: A pencil is heavier than a toaster.(0)
(Shown from <https://mc.ai/game-changer-or-game-over-get-ready-for-gpt-3/>)

Q: Which is heavier, a paper clip or a can of soda?
GPT3: A can of soda is heavier than a paper clip. (0)
Google: consider a two liter soda bottle. If we fill it with water, it will weigh two kilograms (about 4.4 pounds). Any substance that would make the bottle heavier than that will sink, and anything lighter will float. Paperclips are made of steel wire. If we filled the bottle with steel, it would weight a little north of 15 kilograms (or just over 34 pounds!).(+0.5)

 Q: Which is heavier, a football player or a car?
 GPT3: A car is heavier than a football player.(0)
Google: Ben a football player may need more carbohydrates in his diet than sara a computer programmer Why?(0)

GPT3: 0/5

Google: 1.5/5