See also How to cheat on video encoder comparisons

Absolutely the worst way to compare images is to convert one lossy format to another and conclude you “can't see the difference”.

Why is it bad? Save a photo as a couple JPEGs at quality=98 and quality=92. It will be hard to tell them apart, but their file sizes will differ by nearly 40%! Does it prove that JPEG is 40% better than… itself? No, it shows that “quality appears the same, but the file much is smaller!” can easily be nonsense that proves nothing.

Nearly the same quality and 40% smaller! Obviously JPEG is so much better than JPEG…
65KB JPEG 39KB JPEG

To make a fair comparison you really have to pay meticulous attention to encoder settings, normalizing quality, and ensuring that compared images are in fact comparable.

It's really hard to make a fair comparison. It's like multi­threaded programming: you think it's simple until you realize how many subtle things can ruin everything.

From now on I'm going to say “codec” instead of “image format”, because efficiency of image formats depends on tools used. Some tools are compressing images poorly and it can be a fault of the tool, not the image format.

Comparing lossy formats (codecs)

To make a fair comparison:

Compare only one variable at a time.

Unless you find Pareto Improvement

If images differ in both quality and file size it may be impossible to tell which one would be better if they were in a head-to-head comparison. Com­pa­rison of loss­less formats (like PNG) with lossy ones is an extreme case of this problem.

Convert from high-quality source

Converting from one lossy format to another creates an unfair situ­ation: you tell the second codec to dis­tor­tions made by the previous codec in addition to compressing the image.

An obvious case of this is saving a photo as GIF and then as JPEG. Such flawed test will make JPEG look worse than the GIF, even though we know JPEG is clearly much better for photos.

This applies to all lossy-to-lossy conversions. Even distortions that are invisible to the naked eye can bias the results.

Ensure tools' settings are as close as possible

If you publish your benchmark it's a good custom to also publish encoders' settings, versions, and source images in a lossless format to let others verify your results (here I've used mozjpeg 3.0 + imagemagick 6.8 + this + this.)

Do they save color with the same resolution? JPEG and some other formats have option to save color at half resolution (chroma sub­samp­ling). It often makes sharp red lines blocky, but other­wise is hard to notice. If you don't correct for this, you may be telling one tool to compress twice as much data!

Are all tools set to their “best” settings? Some encoders have default settings tuned for speed or compatibility.

mozjpeg's fast profile gives almost 20% larger files at the same quality setting (5% at the same SSIM)
20KB JPEG 17KB JPEG

Compare at realistic quality

Compare images only at qualities you'd actually use. Codecs are optimized for real-world use cases and may perform very poorly outside sensible quality range.

Choosing lowest quality may seem like a clever idea to make differences obvious, but actually it makes benchmarks irrelevant. It's like running a Formula 1 race in a muddy field: proves that tractors are faster than race cars.

JPEG at lowest qualities completely falls apart. Reasonable quality at lower resolution may be better.
1KB Q=12 320px 1KB Q=46 160px

The easy case: exactly the same file size

Adjust quality until compared ima­ges have exactly the same file size. Pick the image that looks closer to the original.

Potential pitfalls:

Harder case: exactly the same quality

Compress to exactly same quality measured very precisely using an objective tool. Compare which file is smaller.

Unfortunately, ensuring same quality is much harder than it seems:

Quality is not what you think it is

Quality “75” in one tool doesn't have to look like “75” in another tool. It's never comparable between image formats.

There's no objective mathe­ma­tical definition of “quality”. It's an arbitrary made-up scale (theore­tically it makes sense to measure difference on a scale from 0 to infinity, but for “quality %” you need to define what “0% quality” means). Quality setting often isn't even consistent between images or proportional to quality perceived by humans.

Your eyes are actually terrible at judging quality

It's funny, because our eyes are supposed to be the ultimate judge, but:

Subjective judgement is too “noisy” for opinion of one person to matter. It's necessary to combine scores from hundreds of people to get statistically meaningful results.

If you don't have hundreds of people to test in a controlled environment, then you have to resort to an objective quality measurement tool. You won't be able to see the difference between JPEGs at quality 98 and 99, but a tool easily can.

Choose a good tool for objective measurement

There isn't an ideal measure­ment—they're all only approximations of human perception—but some are much better than others.

Don't use dumb pixel-by-pixel measu­re­ments like PSNR (peak signal to noise ratio) or MSE (mean square error) — it's been shown over and over again that they're easily fooled.

Better tools are based on the SSIM algorithm (and its extended versions such as MS-SSIM/IW-SSIM).

Beware of tools that don't apply gamma correction as they will be biased towards dumb encoders that don't do correction either.

Most tools deal with color badly. They either only analyze gray­scale version of the image (unfairly penalizing codecs that encode color very well) or measure distortions in the RGB color space, which isn't a good approx­imation of human perception.

Test on many images

Benchmarks are like a game of Top Trumps: features of images you choose for the test are going to decide which codec will win.

All codecs have to make trade-offs. They are tuned for particular use-cases and have their weak spots. Some codecs are great at preserving sharp lines, but inefficient at storing noise. Some codecs compress noise well, but lose all fine details at the same time. It's possible to make a codec that handles all these cases well, but at a cost of hideous com­plexity and high CPU require­ments.

If you test only on one or few images, your test may be skewed by luck and outliers. When you test on many images, be careful about statistics: similarity scores are usually on a non-linear scale and file sizes vary by few orders of magnitude, so naive sums or averages can give misleading results. For example, you can run into Simpson's Paradox where one codec may be the best in almost all cases, but still get the worst score overall (due to scoring very very badly on one image).

To sum it up

Be very careful before you make sweeping judgements. It's too easy to make an unrealistic test that inaccurately measures the wrong thing.