;colony/science  / Computers, Visually  / How does compression shrink a file?
Computers, Visually

How does compression shrink a file?

You can't shrink a file by magic, but most files are full of repetition and predictable patterns. Compression finds that waste and describes it more briefly.

Plate 77 — Saying it shorter run-length encoding · AAAAA → 5A
Slide from noisy to repetitive and watch the byte count collapse.
Predict firstAs you slide the data from messy to repetitive, how much will the file shrink?
store a repeated run once: AAAAA becomes 5ABEFORE · 24 bytes, one per character A B B B B B C A A A B B B B B C C C C C A A A B ▼ run-length encodeAFTER · 16 bytes, one count + symbol per run 1A 5B 1C 3A 5B 5C 3A 1BSIZE33% smaller
PLATE 77 · SAYING IT SHORTER
Repetition some runs
Slide toward repetitive and the runs grow — fewer tokens, smaller file. Noisy data barely shrinks.
Compression ratio
1.5: 1
Bytes
24 → 16
Instead of writing AAAAA, you can write 5A — "five A's" — and rebuild it perfectly later. Compression hunts for repetition and stores it once. The more a file repeats itself, the smaller it gets. Random noise has nothing to reuse, so it barely shrinks at all.
Try with the plate
  • Slide the data to highly repetitive and watch the size collapse.
  • Make the data random and see that it barely shrinks at all.

Compression shrinks a file by describing its repetition and predictable patterns more briefly. Run-length encoding rewrites a run like AAAAA as 5A; Huffman coding gives common symbols short codes and rare ones long codes. Lossless methods rebuild the file byte-for-byte, while lossy ones like JPEG and MP3 discard fine detail to go smaller.

The short answer

If you had to write AAAAA, you could just write 5A instead, five A's, and unpack it back perfectly later. That's the whole idea behind shrinking a file: find anything that repeats and write it down once with a count. Files are full of repeats you can't see, so this saves a lot of room. Slide the simulator from messy to repetitive and watch the file shrink.

The common mix-up

Most people think compression must always lose something. In fact lossless compression rebuilds a file byte-for-byte identical; it just describes the repetition most files are full of more briefly, like rewriting AAAAA as 5A.

What's actually happening

A common worry is that compression must be losing something, that a smaller file has to be a worse file. For the kind of compression that zips your documents, that's simply not true: it shrinks the file and then rebuilds it byte-for-byte identical. The reason it can do this without losing anything is that most files are wasteful. They say the same things over and over, and you can describe that repetition far more briefly than spelling it out.

The simplest version is run-length encoding. A stretch of the same value, AAAAA, gets written as a count and the symbol, 5A, which you can expand back exactly. A black-and-white scan of a page is mostly long runs of white, so this alone can shrink it dramatically. The simulator demonstrates it live: as you make the data more repetitive, the runs grow, the encoded version collapses to a handful of count-symbol tokens, and the size meter drops, sometimes to a third or a quarter of the original. Make the data noisy and random instead, and there's nothing to reuse, so it barely shrinks at all. That's the deep rule: compression feeds on predictability, and truly random data can't be compressed.

Real tools go further than counting runs. Huffman coding looks at which symbols appear most and gives the common ones short codes and the rare ones long codes, the same instinct Morse code used by making E a single dot. There's a hard mathematical floor here, the data's entropy, below which no lossless method can go. The only way past that floor is to stop being lossless: JPEG photos and MP3 audio throw away fine detail your eyes and ears barely notice, trading perfect fidelity for much smaller files. So the real fork is lossless, which rebuilds the original exactly, versus lossy, which keeps something close enough and far smaller.

Remember this

Compression feeds on predictability, so repetitive files shrink dramatically while truly random data cannot be compressed at all.

Try it at home Compress a sentence by hand
  1. 1Write out a run like "AAAAABBBCCCCCCC" and count its characters: fifteen here.
  2. 2Now rewrite it as counts and symbols: 5A 3B 7C. Count those characters instead, far fewer, and you can rebuild the original exactly.
  3. 3Try it on random letters with no repeats. Your "compressed" version is the same length or longer, showing that compression only helps when there's a pattern to find.

Common questions

Does compressing a file always lose something?

Not with lossless compression, the kind that zips your documents. It shrinks the file and then rebuilds it byte-for-byte identical, exploiting the fact that most files repeat the same things over and over.

Why won't random data compress?

Compression feeds on predictability. Truly random noise has no repeats or patterns to reuse, so no lossless method can make it smaller, a limit you can prove mathematically through the data's entropy.

How do JPEG and MP3 get so much smaller?

They go lossy, throwing away fine detail your eyes and ears barely register to drop below the entropy floor that bounds lossless methods. The original is gone, but you usually can't tell.

Built & checked by Nilesh Singh · how this is made · last updated June 2026