How Vision Transformers Work (and Why Patch Tokenization Is a Hack)

W15 Basic Tutorial 1 · Intermediate · April 2026

Research Area: Neural Architectures, Computer Vision

Companion Notebooks

#	Notebook	Focus	Compute
00	`00_ssm_vs_attention.ipynb`	SSM (Mamba-like) vs. self-attention — sequence modeling, spatial reasoning, efficiency comparison	CPU only
01	`01_equivariant_vs_standard.ipynb`	Equivariant vs. standard features — 3D point cloud classification, rotation generalization	CPU only

Vision Transformers (ViT) are elegant. A transformer was designed for sequences. Images are 2D spatial grids. The solution? Reshape the grid into a sequence, add position embeddings, and run the transformer. It works surprisingly well—so well that it became the foundation for modern vision AI.

But this simplicity masks a fundamental architectural mismatch. We're forcing a square peg (spatial structure) into a round hole (sequential attention). Understanding why ViT succeeds despite this mismatch, and where it breaks down, is essential for appreciating the post-transformer architectures taking over the field.

A note on framing. When we call patch tokenization a "hack," we are making a precise architectural claim — not diminishing ViT's impact. Dosovitskiy et al. (2020) fundamentally changed vision research. ViT proved that a general-purpose sequence model, with minimal vision-specific inductive bias, could match or beat decades of CNN engineering. That result reshaped the field and directly enabled the multimodal foundation models we rely on today. Our argument is narrower: the specific mechanism of flattening 2D spatial structure into a 1D sequence, while remarkably effective, introduces limitations that matter more as we move toward 3D world modeling and physical AI. Recognizing those limits is how the field progresses — standing on ViT's shoulders, not dismissing them.

The Transformer Was Built for Text

Let's start with first principles. In 2017, Vaswani et al. published Attention Is All You Need, introducing the transformer architecture. The goal was clear: build a neural network for sequence-to-sequence tasks like machine translation. English sentence → French sentence.

The transformer has two core operations:

Self-attention: Each token attends to all other tokens, learning which are relevant. If you're translating "The cat sat," the model learns that "cat" should attend to "sat" and vice versa.
Position encoding: Since there's no recurrence (no LSTM-like state), the model needs to know where each token is in the sequence. Vaswani et al. used sinusoidal position embeddings: position i gets a unique, fixed encoding based on sine and cosine functions. Without these embeddings, self-attention is permutation invariant—it treats its inputs as an unordered set, a "bag of tokens" with no notion of sequence position.

These design choices are perfect for language. Text is a sequence. Words have a natural left-to-right order. Position matters, but in a 1D way.

Images, however, are 2D spatial grids. No natural sequence order. A pixel's neighbors aren't the pixels to its left—they're above, below, left, right, and diagonal. This is a fundamentally different structure.

The ViT Trick: Images as Sequences

In 2020, Dosovitskiy et al. published An Image Is Worth 16×16 Words. They asked a simple question: what if we just forced images into the transformer's sequence framework?

Here's the trick:

Patch the image: Divide a 224×224 image into non-overlapping 16×16 patches. You get 14×14 = 196 patches.
Flatten each patch: Each 16×16×3 patch becomes a vector of 768 dimensions.
Add a class token: Prepend a special [CLS] token to the sequence (learnable embedding).
Add position embeddings: Learn a 197×768 embedding table for the 196 patches + 1 class token.
Run the transformer: Standard multi-head self-attention encoder, N layers.
Classify: Take the final [CLS] token representation and feed it to a classifier head.

That's it. No convolution. No spatial inductive bias. Just patches → embeddings → self-attention → classification.

The pipeline looks like this (conceptually):


┌──────────────────────────┐
│   224×224 RGB Image      │
└──────────────────────────┘
           ↓
┌─────────────────────────────────────────┐
│  14×14 Non-Overlapping 16×16 Patches    │
│  (196 patches total)                    │
└─────────────────────────────────────────┘
           ↓
┌──────────────────────────────────────────┐
│ Flatten to 768-dim vectors (16×16×3)     │
│ [CLS] + 196 patches = 197 tokens         │
└──────────────────────────────────────────┘
           ↓
┌──────────────────────────────────────────┐
│ Add Learned Position Embeddings (197×768)│
└──────────────────────────────────────────┘
           ↓
┌──────────────────────────────────────────┐
│  Transformer Encoder (N × Multi-Head     │
│       Self-Attention + FFN)              │
└──────────────────────────────────────────┘
           ↓
┌──────────────────────────────────────────┐
│ Classification Head on [CLS] Token       │
└──────────────────────────────────────────┘

Remarkably, this works. ViT-Base on ImageNet achieves 77% accuracy. ViT-Large pushes past 80%. It's competitive with, then surpasses, ResNets.

Why?

Why It Works (Despite Being a Hack)

Three reasons explain ViT's success:

1. Self-Attention Is a Universal Approximator

Self-attention, given enough capacity, can learn to compute any function over its input set. This is not metaphorical—it's been proven. With sufficient depth, width, and data, attention can approximate convolutional operations, spatial filtering, or any other transformation. The model doesn't need convolutional inductive bias built-in; it can learn the behavior from data.

2. Position Embeddings Recover Spatial Structure

The position embedding table isn't just a black box. During training, the model learns to encode the 2D grid structure. Patches that are spatial neighbors in the image develop similar or nearby embeddings. The model learns that position 0 and position 16 are vertically adjacent (one row down). This isn't encoded explicitly—the model discovers it from the structure of the data and the attention patterns it learns.

3. Scale Compensates for Lack of Inductive Bias

This is the key insight: ViT trades inductive bias for scale. A CNN has built-in spatial structure—convolutions are local and translation-equivariant by design, meaning a feature detected at one location is automatically detected at every other location. This is efficient. But it's also restrictive. ViT is more flexible: no built-in assumptions about space.

The tradeoff? ViT needs more data to match CNN performance on small datasets. But with enormous data, it surpasses CNNs. The famous ViT scaling curve shows this:

ImageNet (1.3M images): ResNet-50 ~77%, ViT-Base ~77%. Roughly tied. ViT needs more data.
JFT-300M (300M labeled images): ViT-Large surpasses best ResNets (88% vs 85%). ViT wins.

The pattern is clear: below a certain data scale, CNNs are more efficient. Above it, ViT's flexibility and attention's universality take over.

Why It's a Hack: Four Fundamental Limits

Despite its success, ViT's patch-tokenization approach has deep architectural flaws:

1. 2D→1D Serialization Breaks Spatial Proximity

An image is 2D. Neighbors are defined in 2D: up, down, left, right, diagonals. But patches are laid out in a 1D sequence, left-to-right, top-to-bottom.

Example: In a 14×14 patch grid, patch (0, 15) and patch (1, 0) are spatial neighbors (adjacent rows). But in the 1D sequence, they're positions 15 and 16... actually, let's be precise: they're positions 15 and 14 (row-major ordering). They're adjacent in sequence only by accident.

Now consider patch (7, 0) and patch (7, 1). Spatial neighbors. Sequence positions 98 and 99. Also adjacent.

But patch (7, 13) and patch (8, 0)? Spatial neighbors. Sequence positions 111 and 112. Adjacent again.

And patch (0, 0) and patch (0, 1)? Neighbors. Positions 0 and 1. Adjacent.

Okay, the row-major serialization does preserve some local structure. But globally, the problem remains: a pixel's full 2D neighborhood is scattered across the sequence. Self-attention must learn, through training, which sequence positions correspond to spatial neighbors. It's learning something the architecture should encode.

2. Position Embeddings Must Be Learned, Not Structural

A CNN knows by construction that a 3×3 kernel applied at position (i, j) will see pixels in a 3×3 neighborhood. The spatial structure is hardcoded.

ViT's position embeddings are learned parameters. The model must discover, from data, that positions 0 and 1 are neighbors, that position 16 is below position 0, and so on. This is inefficient and brittle. Transfer to higher resolutions? The learned position embeddings from 224×224 training don't naturally extend to 512×512. The model must relearn spatial structure.

3. Quadratic Attention Cost

Self-attention computes pairwise similarities between all tokens: O(L²) where L = sequence length.

For an image:

224×224 at 16×16 patches: 196 patches → ~38K attention entries. Manageable.
512×512 at 16×16 patches: 1024 patches → ~1M attention entries. Getting expensive.
1024×1024 at 16×16 patches: 4096 patches → ~16M attention entries. Impractical.

Double the image resolution, and attention cost multiplies by 4. This is why ViT struggles with high-resolution images.

Compare to alternatives:

Convolutional networks: Local operations, O(L) complexity.
Mamba/SSMs: Linear complexity, O(L).

Only attention is quadratic.

4. No Native 3D Structure

ViT flattens spatial structure into a 1D sequence and adds 2D position embeddings. But images are samples of 3D scenes. They contain depth, occlusion, surface normals, and geometric relationships that are fundamentally 3D.

A ViT has no built-in way to reason about 3D structure. It can learn it from 2D images, given enough data, but it's learning a representation of something the architecture should encode.

For physical AI—robots interacting with the 3D world—this is a critical gap.

The Efficiency Problem: Concrete Numbers

Let's ground the quadratic cost issue with real numbers:

Image Resolution	Patch Size	# Patches	Attention Entries	Relative Cost
224×224	16×16	196	~38K	1×
384×384	16×16	576	~332K	8.7×
512×512	16×16	1024	~1M	26×
1024×1024	16×16	4096	~16.8M	440×

And this is just the attention computation. Memory scales similarly. Processing a 1024×1024 image requires 440× the attention cost of a 224×224 image.

Meanwhile, a convolutional network's cost grows linearly with resolution (assuming fixed receptive field). A Mamba-based model scales linearly as well.

This is why modern vision research is moving beyond attention for dense tasks like segmentation or scene understanding on high-resolution images. Quadratic cost is a fundamental constraint.

What Comes After ViT

ViT showed that transformers can do vision. But its limitations—serialization, learned position structure, quadratic cost, missing 3D reasoning—drive the next generation:

Mamba and State Space Models: Linear complexity without sacrificing quality. Vision Mamba achieves ViT-like performance with O(L) cost.

Hybrid Models: Attention for global context, SSMs or convolutions for local detail. Best of both worlds.

Equivariant Architectures: Native symmetries for rotations, translations, and reflections. Crucial for 3D and physical reasoning.

Multi-Scale and Hierarchical Design: Process images at multiple resolutions, combining coarse global structure with fine details.

Our companion tutorial on Beyond Attention — Post-Transformer Architectures for Physical AI dives deep into each of these approaches and why they matter for the next generation of vision models.

Companion Notebooks

We've prepared companion notebooks for this series:

#	Notebook	Focus	Compute
00	`00_ssm_vs_attention.ipynb`	SSM vs. self-attention from scratch — efficiency and capability comparison	CPU only

Notebook 00 implements self-attention and SSMs from first principles, showing the efficiency difference directly and demonstrating how SSMs achieve similar representational capacity with linear complexity. A second notebook on equivariant vs. standard features for 3D classification drops later this week.

Summary: The Elegance and the Hack

Vision Transformers are an elegant idea: reshape an image into a sequence, add position embeddings, run a transformer. It works. It scales. It's become the foundation of modern computer vision.

But it's a hack. We're encoding 2D spatial structure into a 1D sequence and asking the model to rediscover it. We're paying quadratic cost for dense predictions. We're missing 3D reasoning that should be structural, not learned.

ViT succeeded because self-attention is powerful enough to overcome these architectural mismatches when given enough data and compute. But as we push toward physical AI, autonomous systems, and high-resolution real-time vision, these fundamental limits become constraints.

The next era of vision models will retain what ViT proved—that transformers can process spatial data—while fixing what it got wrong: incorporating true spatial structure, supporting efficient multi-scale reasoning, and grounding visual understanding in 3D geometry.

How Vision Transformers Work (and Why Patch Tokenization Is a Hack)

Companion Notebooks

The Transformer Was Built for Text

The ViT Trick: Images as Sequences

Why It Works (Despite Being a Hack)

1. Self-Attention Is a Universal Approximator

2. Position Embeddings Recover Spatial Structure

3. Scale Compensates for Lack of Inductive Bias

Why It's a Hack: Four Fundamental Limits

1. 2D→1D Serialization Breaks Spatial Proximity

2. Position Embeddings Must Be Learned, Not Structural

3. Quadratic Attention Cost

4. No Native 3D Structure

The Efficiency Problem: Concrete Numbers

What Comes After ViT

Companion Notebooks

Summary: The Elegance and the Hack

Further Reading

Comments