If you had to write AAAAA, you could just write 5A instead, five A's, and unpack it back perfectly later. That's the whole idea behind shrinking a file: find anything that repeats and write it down once with a count. Files are full of repeats you can't see, so this saves a lot of room. Slide the simulator from messy to repetitive and watch the file shrink.
Most people think compression must always lose something. In fact lossless compression rebuilds a file byte-for-byte identical; it just describes the repetition most files are full of more briefly, like rewriting AAAAA as 5A.
What's actually happening
A common worry is that compression must be losing something, that a smaller file has to be a worse file. For the kind of compression that zips your documents, that's simply not true: it shrinks the file and then rebuilds it byte-for-byte identical. The reason it can do this without losing anything is that most files are wasteful. They say the same things over and over, and you can describe that repetition far more briefly than spelling it out.
The simplest version is run-length encoding. A stretch of the same value, AAAAA, gets written as a count and the symbol, 5A, which you can expand back exactly. A black-and-white scan of a page is mostly long runs of white, so this alone can shrink it dramatically. The simulator demonstrates it live: as you make the data more repetitive, the runs grow, the encoded version collapses to a handful of count-symbol tokens, and the size meter drops, sometimes to a third or a quarter of the original. Make the data noisy and random instead, and there's nothing to reuse, so it barely shrinks at all. That's the deep rule: compression feeds on predictability, and truly random data can't be compressed.
Real tools go further than counting runs. Huffman coding looks at which symbols appear most and gives the common ones short codes and the rare ones long codes, the same instinct Morse code used by making E a single dot. There's a hard mathematical floor here, the data's entropy, below which no lossless method can go. The only way past that floor is to stop being lossless: JPEG photos and MP3 audio throw away fine detail your eyes and ears barely notice, trading perfect fidelity for much smaller files. So the real fork is lossless, which rebuilds the original exactly, versus lossy, which keeps something close enough and far smaller.
Compression feeds on predictability, so repetitive files shrink dramatically while truly random data cannot be compressed at all.
- 1Write out a run like "AAAAABBBCCCCCCC" and count its characters: fifteen here.
- 2Now rewrite it as counts and symbols: 5A 3B 7C. Count those characters instead, far fewer, and you can rebuild the original exactly.
- 3Try it on random letters with no repeats. Your "compressed" version is the same length or longer, showing that compression only helps when there's a pattern to find.
Common questions
Not with lossless compression, the kind that zips your documents. It shrinks the file and then rebuilds it byte-for-byte identical, exploiting the fact that most files repeat the same things over and over.
Compression feeds on predictability. Truly random noise has no repeats or patterns to reuse, so no lossless method can make it smaller, a limit you can prove mathematically through the data's entropy.
They go lossy, throwing away fine detail your eyes and ears barely register to drop below the entropy floor that bounds lossless methods. The original is gone, but you usually can't tell.