The moment man first picked up a stone or a branch to use as a tool, he altered irrevocably the balance between him and his environment — James Burke
As discussed in previous issues of Let’s Talk Text, despite the incredible examples of performance in current state-of-the-art LLMs such as ChatGPT, LLMs still have a tendency to hallucinate, fabricate information, or neglect obvious truths. Similar to how humans use tools like a calculator, GPS, or google search to create the logical steps in their dialog and reason arguments, researchers at Meta have trained AI to do the same.
LLMs making API calls
The premise of the technique is to allow a generative model, in this case GPT-J, to access API calls through their text generation. Using special tokens such as “<API>” and “→”, GPT-J can access different API calls. Let’s see how this works with the following example:
The Nile has an approximate length of <API> QA(What is the approximate length of the Nile?) →
Anytime the special token “→” is generated by the model, the generation is paused while the appropriate API is called. In this case, the API chosen was QA, indicating that a question answering system was asked to generate the length of the Nile. The answer is placed as the next token in the sequence and GPT-J continues producing text with the additional context from the API:
6,853 km </API> 6,853 kilometers, the White Nile being its main source.
Here’s some examples of a few more API calls made by ToolFormer:
Tools at GPT’s disposal
The API that the language models use could conceivably be anything, including a Python script or even another language model. In the paper, only a handful of these tools were used:
So how do we train the model to rely on various APIs as tools?
Generating API training data
The coolest part of the this research is their data generation process. Generating text with integrated API calls is very difficult to do manually. This would involve taking a piece of text and producing API calls that could improve the generated text. It’d likely be impossible to do this manually, millions of times, which is what would be required to generate enough training data.
Therefore, they used few-shot prompting of another LLM to generate the training data for each API. Here’s an example of a question answering training sample being generated:
Your task is to add calls to a Question Answering API to a piece of text. The questions should help you get information required to complete the text. You can call the API by writing "[QA(question)]" where "question" is the question you want to ask. Here are some examples of API calls:
Input: Joe Biden was born in Scranton, Pennsylvania.
Output: Joe Biden was born in [QA("Where was Joe Biden born?”)] Scranton, [QA("In which state is Scranton?”)] Pennsylvania.
Input: Coca-Cola, or Coke, is a carbonated soft drink manufactured by the Coca-Cola Company.
Output: Coca-Cola, or [QA("What other name is Coca-Cola known by”)] Coke, is a carbonated soft drink manufactured by [QA("Who manufactures Coca-Cola?”)] the Coca-Cola Company.
Input: x
Output:
The final x could be any of the documents in the corpus.
Filter API calls
Of course, not all of the generated training samples can be directly used in the new corpus. We can measure how useful calling the API is to the model by measuring the weighted cross entropy loss during text generation. If incorporating the results from the API doesn’t reduce cross entropy loss, we can assume that the API call and subsequent results don’t assist the model in training. In those cases, we safely discard that training sample.
Discarding training samples that don’t significantly reduce cross entropy loss results in the model itself essentially deciding which training samples to include in the new corpus based on how much the training sample would help the model generate tokens.
Since an LLM is producing the training samples, it’s relatively cheap to create this data, resulting in a large training corpus even after rigorous filtering. Still, many training samples get filtered out. The researchers report that processing more than a million documents resulted in only a few thousand examples of useful calls to the calculator API.
Results
Ultimately, Toolformer considerably improves zero-shot performance of a 6.7B parameter GPT-J model. So much so that it even outperforms a much larger GPT-3 model on a range of different downstream tasks. Here are the results on three downstream tasks.
It also becomes apparent from the data that making API calls and learning to use them is challenging enough and ultimately serves little purpose for small models. The additional model complexity is likely required to make good use of these tools.
As LLMs continue to progress, I believe it will be increasingly important to supply LLMs with access to these kinds of tools. Currently, these models behave as “bullshit artists.” Allowing them to reason the same way humans do — relying on sources and external pieces of information — would empower them to become a better tool for us.