When you ask ChatGPT a question and see that little “thinking…” cue, it’s pausing intentionally: considering possibilities and refining the reply instead of offering the first guess.
This approach, called inference time computation, allows AI to spend extra cycles when a query is tricky, trading milliseconds for better results. Now here’s the bigger question: can we train a model to perform careful double-checking without special add-ons, and make it work across text, images, and more, using only basic unsupervised training?
In this piece, we’ll explore how that works and why Energy-Based Transformers (EBTs) may be the most exciting leap yet.
Imagine you type a question into your favorite chatbot.
A standard transformer whips through its layers once, streams out tokens, and is done in a few hundred milliseconds. With the new approach, the model still produces that first draft, but then it pauses to do one (or many) additional internal passes. During those passes, it may self-edit, re-rank alternative completions, or search its latent space. The final answer is shown when the timer ends or confidence is good enough.