这个帖子由Andrej Karpathy发布,讨论了一个新的OCR模型——DeepSeek-OCR的潜力。帖子中他首先提到这个模型在OCR任务中表现不错,但其最吸引他注意的并不是OCR本身,而是它对大语言模型(LLMs)输入方式的启示。
Andrej Karpathy on Twitter / X
以下是帖子的要点:
图像输入 vs. 文本输入:
Karpathy提出一个有趣的假设:LLMs的输入是否可以从文本转换为图像。他认为,图像比文本可能更适合作为LLMs的输入。通过将文本渲染成图像,模型可以处理更多信息,不仅限于文字,还包括样式、颜色、甚至图像本身。这种方式可以带来以下优势:
弃用tokenizer:
Karpathy表达了他对传统tokenizer的强烈反感。tokenizer在文本处理时存在很多问题,它需要将表面上相同的字符(如一个笑脸emoji)转化为完全不同的token,导致许多不必要的复杂性。通过图像输入,Karpathy认为不再需要这种“中间层”,而是直接处理图像像素。
OCR任务与视觉文本任务:
他指出,OCR只是视觉到文本的一种转化方式,而未来的模型可以通过视觉-文本转换处理更多任务。他暗示,图像输入可能成为未来的标准,尤其是在LLMs中,而文本输入可能会被看作是某种低效的、中间的表示。
讨论中的问题:
总的来说,Karpathy通过这个帖子提出了一个挑战传统的观点:是否可以将图像作为LLM的输入,代替传统的文本token。这不仅是对技术的思考,也是对自然语言处理领域未来发展的深刻洞察。
Andrej Karpathy @karpathy 2025-10-20
I quite like the new DeepSeek-OCR paper. It's a good OCR model (maybe a bit worse than dots), and yes data collection etc., but anyway it doesn't matter.
The more interesting part for me (esp as a computer vision at heart who is temporarily masquerading as a natural language person) is whether pixels are better inputs to LLMs than text. Whether text tokens are wasteful and just terrible, at the input.
Maybe it makes more sense that all inputs to LLMs should only ever be images. Even if you happen to have pure text input, maybe you'd prefer to render it and then feed that in:
\- more information compression (see paper) => shorter context windows, more efficiency
\- significantly more general information stream => not just text, but e.g. bold text, colored text, arbitrary images.
\- input can now be processed with bidirectional attention easily and as default, not autoregressive attention - a lot more powerful.
\- delete the tokenizer (at the input)!! I already ranted about how much I dislike the tokenizer. Tokenizers are ugly, separate, not end-to-end stage. It "imports" all the ugliness of Unicode, byte encodings, it inherits a lot of historical baggage, security/jailbreak risk (e.g. continuation bytes). It makes two characters that look identical to the eye look as two completely different tokens internally in the network. A smiling emoji looks like a weird token, not an... actual smiling face, pixels and all, and all the transfer learning that brings along. The tokenizer must go.
OCR is just one of many useful vision -> text tasks. And text -> text tasks can be made to be vision ->text tasks. Not vice versa.
So many the User message is images, but the decoder (the Assistant response) remains text. It's a lot less obvious how to output pixels realistically... or if you'd want to.
Now I have to also fight the urge to side quest an image-input-only version of nanochat...