Andrej Karpathy点评DeepSeek-OCR

这个帖子由Andrej Karpathy发布，讨论了一个新的OCR模型——DeepSeek-OCR的潜力。帖子中他首先提到这个模型在OCR任务中表现不错，但其最吸引他注意的并不是OCR本身，而是它对大语言模型（LLMs）输入方式的启示。

以下是帖子的要点：

图像输入 vs. 文本输入：

Karpathy提出一个有趣的假设：LLMs的输入是否可以从文本转换为图像。他认为，图像比文本可能更适合作为LLMs的输入。通过将文本渲染成图像，模型可以处理更多信息，不仅限于文字，还包括样式、颜色、甚至图像本身。这种方式可以带来以下优势：
- 信息压缩：图像能更有效地压缩信息，相比传统的文本token，图像输入可能更加高效。
- 更多信息流：文本和图像可以同时提供更丰富的信息流，例如粗体、颜色、特殊字符等，传统的文本token难以表达。
- 双向注意力机制：与传统的自回归（autoregressive）注意力机制不同，图像输入可以通过双向注意力（bidirectional attention）进行处理，这种方式更加强大和灵活。
弃用tokenizer：

Karpathy表达了他对传统tokenizer的强烈反感。tokenizer在文本处理时存在很多问题，它需要将表面上相同的字符（如一个笑脸emoji）转化为完全不同的token，导致许多不必要的复杂性。通过图像输入，Karpathy认为不再需要这种“中间层”，而是直接处理图像像素。
OCR任务与视觉文本任务：

他指出，OCR只是视觉到文本的一种转化方式，而未来的模型可以通过视觉-文本转换处理更多任务。他暗示，图像输入可能成为未来的标准，尤其是在LLMs中，而文本输入可能会被看作是某种低效的、中间的表示。
讨论中的问题：
- 其中一位用户提出了图像可以如何实现双向注意力而文本不能的问题。Karpathy回应，实际上文本也可以实现双向注意力，只是当前LLMs普遍采用了自回归模式，效率高且易于训练。他认为通过在训练过程中引入双向注意力，可能会得到更好的结果。
- 另一位用户质疑，将图像输入拆解成patches的过程是否和token化类似，Karpathy承认这一点，但强调图像的patches比文本token要“自然”得多，因为它们来源于实际的图像数据，而不是人为的符号分割。

总的来说，Karpathy通过这个帖子提出了一个挑战传统的观点：是否可以将图像作为LLM的输入，代替传统的文本token。这不仅是对技术的思考，也是对自然语言处理领域未来发展的深刻洞察。

Andrej Karpathy @karpathy 2025-10-20

I quite like the new DeepSeek-OCR paper. It's a good OCR model (maybe a bit worse than dots), and yes data collection etc., but anyway it doesn't matter.

The more interesting part for me (esp as a computer vision at heart who is temporarily masquerading as a natural language person) is whether pixels are better inputs to LLMs than text. Whether text tokens are wasteful and just terrible, at the input.

Maybe it makes more sense that all inputs to LLMs should only ever be images. Even if you happen to have pure text input, maybe you'd prefer to render it and then feed that in:

\- more information compression (see paper) => shorter context windows, more efficiency

\- significantly more general information stream => not just text, but e.g. bold text, colored text, arbitrary images.

\- input can now be processed with bidirectional attention easily and as default, not autoregressive attention - a lot more powerful.

\- delete the tokenizer (at the input)!! I already ranted about how much I dislike the tokenizer. Tokenizers are ugly, separate, not end-to-end stage. It "imports" all the ugliness of Unicode, byte encodings, it inherits a lot of historical baggage, security/jailbreak risk (e.g. continuation bytes). It makes two characters that look identical to the eye look as two completely different tokens internally in the network. A smiling emoji looks like a weird token, not an... actual smiling face, pixels and all, and all the transfer learning that brings along. The tokenizer must go.

OCR is just one of many useful vision -> text tasks. And text -> text tasks can be made to be vision ->text tasks. Not vice versa.

So many the User message is images, but the decoder (the Assistant response) remains text. It's a lot less obvious how to output pixels realistically... or if you'd want to.

Now I have to also fight the urge to side quest an image-input-only version of nanochat...