OpenAI shipped GPT-4, its highly anticipated text-generating AI model, yesterday, and it’s a curious piece of work.
GPT-4 improves on its predecessor, GPT-3, in important ways, such as making more truthful statements in fact and allowing developers to more easily prescribe its style and behavior. It is also multimodal in the sense that it can understand images, allowing it to describe the content of a photo and even explain it in detail.
But GPT-4 has serious shortcomings. Like GPT-3, the model “hallucinates” facts and makes basic reasoning errors. In an example on OpenAI’s own blog, GPT-4 describes Elvis Presley as the “son of an actor.” (Neither of his parents were actors.)
To get a better handle on GPT-4’s development cycle and its capabilities, as well as its limitations, TechCrunch spoke via video call Tuesday with Greg Brockman, one of OpenAI’s co-founders and its president.
Asked to compare GPT-4 to GPT-3, Brockman had one word: different.
“It’s just different,” he told TechCrunch. “There are still many problems and errors [the model] makes … but you can really see the jump in skill in things like calculus or law, where it went from really bad in certain domains to actually pretty good relative to people.
Test results support his case. On the AP Calculus BC exam, GPT-4 scores a 4 out of 5, while GPT-3 scores a 1. (GPT-3.5, the intermediate model between GPT-3 and GPT-4, also scores a 4.) And in a simulated bar exam, GPT-4 passes with a score around the top 10% of test takers; The GPT-3.5 score hovered around the bottom 10%.
Switching, one of the more intriguing aspects of GPT-4 is the aforementioned multimodality. Unlike GPT-3 and GPT-3.5, which could only accept text prompts (e.g. “Write an essay about giraffes”), GPT-4 can use a prompt of both images and text to perform an action (e.g. a picture of giraffes in the Serengeti with the prompt “How many giraffes are shown here?”).
That’s because GPT-4 is image-trained And text data while its predecessors were text-only trained. OpenAI says the training data came from “a variety of licensed, created, and publicly available data sources, which may include publicly available personal information,” but Brockman objected when I asked for details. (Training data has landed OpenAI in legal trouble before.)
The image understanding of GPT-4 is quite impressive. For example, give the prompt “What’s funny about this picture?” Describe the panel by panel’ plus a three-panel image showing a fake VGA cable connected to an iPhone, GPT-4 breaks down each image panel and correctly explains the gag (“The humor in this image comes from the absurdity of plugging a large, outdated VGA connector into a small, modern smartphone charging port”).
Only one launch partner currently has access to GPT-4’s image analysis capabilities: a visually impaired assistive app called Be My Eyes. Brockman says the wider rollout, whenever it happens, will be “slow and deliberate” as OpenAI evaluates the risks and benefits.
“There are policy issues like facial recognition and how to deal with images of people that we need to address and resolve,” Brockman said. “We need to figure out where the species danger zones are — where the red lines are — and then clarify that over time.”
OpenAI addressed similar ethical dilemmas surrounding DALL-E 2, the text-to-image system. After initially disabling the ability, OpenAI enabled customers to upload people’s faces for editing using the AI-powered image generation system. At the time, OpenAI claimed that upgrades to its security system enabled the facial editing feature by “minimizing the potential for harm” from deepfakes and attempts to create sexual, political, and violent content.
Another perennial prevents GPT-4 from being used in unintended ways that could cause harm – psychologically, financially or otherwise. Hours after the release of the model, Israeli cybersecurity startup Adversa AI published a blog post demonstrating methods for bypassing OpenAI’s content filters and getting GPT-4 to post phishing emails, offensive descriptions of gay people, and other highly offensive text to generate.
It is not a new phenomenon in the language model domain. Meta’s BlenderBot and OpenAI’s ChatGPT have also been asked to say wildly offensive things and even reveal sensitive details about their inner workings. But many, including this reporter, had hoped that GPT-4 could deliver significant improvements in moderation.
When asked about GPT-4’s robustness, Brockman emphasized that the model underwent six months of security training and, in internal testing, was 82% less likely to respond to requests for content not allowed by the user’s usage policy. OpenAI and 40% more likely to produce “factual” answers than GPT-3.5.
“We’ve spent a lot of time trying to understand what GPT-4 is capable of,” Brockman said. “Sending it out into the world is how we learn. We are constantly making updates, including a number of improvements, so that the model is much more scalable for whatever personality or mode you want it to be in.”
The early real-world results are frankly not that promising. Aside from the Adversa AI tests, Bing Chat, Microsoft’s chatbot powered by GPT-4, has been shown to be highly susceptible to jailbreaking. Using carefully tailored input, users have been able to get the bot to profess love, threaten harm, defend the Holocaust, and fabricate conspiracy theories.
Brockman did not deny that GPT-4 falls short here. But he highlighted the model’s new controllability tools, including an API-level capability called “system” messages. System messages are essentially instructions that set the tone – and set boundaries – for GPT-4’s interactions. For example, a system message might read: “You are a tutor who always responds in the Socratic style. you never give the student the answer, but always try to ask just the right question to help him learn to think for himself.”
The idea is that the system messages act as guardrails to prevent GPT-4 from going off course.
“Putting out the tone, style and content of the GPT-4 has been a major focus for us,” said Brockman. “I think we’re starting to understand a little bit more about how to do the technique, about how to have a repeatable process that gets you to predictable results that will be really helpful to people.”
Brockman also pointed to Evaals, OpenAI’s new open-source software framework to evaluate the performance of its AI models, as a sign of OpenAI’s commitment to “robusting” its models. Evals allows users to develop and run benchmarks for evaluating models such as GPT-4 while inspecting their performance – a sort of crowdsourced approach to model testing.
“With Evans, we can [use cases] what users care about in a systematic form that we can test against,” said Brockman. “Part of why we [open-sourced] it’s because we’re moving away from releasing a new model every three months – whatever it used to be – to make constant improvements. What you don’t measure, you don’t make, do you? As we make new versions [of the model]we can at least be aware of what those changes are.”
I asked Brockman if OpenAI would ever compensate people for testing his models with Evals. He wouldn’t commit to that, but he did note that OpenAI is – for a limited time – granting selected Evals users early access to the GPT-4 API.
Brockman and I’s conversation also touched on GPT-4’s context window, which references the text that the model can consider before generating additional text. OpenAI is testing a version of GPT-4 that can “remember” about 50 pages of content, or five times as much as standard GPT-4 can hold in its “memory” and eight times as much as GPT-3.
Brockman believes that the expanded context window is leading to new previously untapped applications, particularly in the enterprise. He envisions an AI chatbot built for a business that leverages context and knowledge from a variety of sources, including employees from different departments, to answer questions in a highly informed yet conversational manner.
That’s not a new concept. But Brockman claims that GPT-4’s answers will be much more useful than those provided by today’s chatbots and search engines.
“Before, the model didn’t know who you are, what you’re interested in, and so on,” Brockman said. “Having that kind of history [with the larger context window] will definitely make it better… It will boost what people can do.