Chain-of-thought reasoning
Chain-of-thought reasoning is the machine learning task of generating intermediate reasoning steps while the model attempts to get to the final answer. There are several reasons you may want a model such as ChatGPT or BLOOM to perform this intermediate step:
It’s been shown that this improves performance on complex reasoning problems and reduces model hallucination. Read more about hallucination here.
Understanding how the model arrived at a final answer might be critical to the context of the problem you’re trying to solve.
In the past few years, there has been some progress in tasking language models to generate these intermediate “rationale” steps in which they explain the reasoning for the final answer. As an example, take a look at the below example:
This is the output you could get from an LLM right now just through prompt engineering. However, the next step in this area of research is using multimodality: asking the model to combine multiple forms of information in the form of text combined with an image.
Chain-of-thought reasoning + images
Critical to our understanding of complex topics is understanding multiple forms of information: text, images, or video. Imagine reading an encyclopedia with no images or diagrams to provide context, it’d likely be difficult. Our ability to acquire and understand knowledge is greatly strengthened by combining different modalities. In a recent paper, researchers have found a way to combine chain-of-thought-reasoning with multimodality. Take a look at how this could work:
The architecture of how this works is surprisingly simple. Just like normal, the text is fed through your standard transformer encoder to generate a high-dimensional representation of the prompt. For the image, they use an off-the-shelf vision feature extraction model called DETR. DETR, or DEtection TRansformer, is exactly what it sounds like. It’s an encoder-decoder transformer trained for end-to-end object detection. A single-head attention network correlates the text tokens with the image patches, and this fused output is fed through a transformer decoder to generate the text answer.
I see multimodality becoming a critical part of LLM architecture in the future. Even though the ChatGPT API has been released for only a week, there are many very cool projects combining the power of the model with other types of information. For example, check out ChatPDF, a project where you can upload a PDF, and ask it questions like it’s a person.
I’m excited to hear your thoughts on multimodal LLMs! Reply directly to this email and let me know what you think, or if you know of a neat project using ChatGPT I’d love to hear about that too.