Absolutely the worst way to compare images is to convert one lossy format to another and conclude you “can't see the difference”.
Why is it bad? Save a photo as a couple JPEGs at quality=98 and quality=92. It will be hard to tell them apart, but their file sizes will differ by nearly 40%! Does it prove that JPEG is 40% better than… itself? No, it shows that “quality appears the same, but the file much is smaller!” can easily be nonsense that proves nothing.
To make a fair comparison you really have to pay meticulous attention to encoder settings, normalizing quality, and ensuring that compared images are in fact comparable.
It's really hard to make a fair comparison. It's like multithreaded programming: you think it's simple until you realize how many subtle things can ruin everything.
From now on I'm going to say “codec” instead of “image format”, because efficiency of image formats depends on tools used. Some tools are compressing images poorly and it can be a fault of the tool, not the image format.
Comparing lossy formats (codecs)
To make a fair comparison:
Compare only one variable at a time.
Unless you find Pareto Improvement
If images differ in both quality and file size it may be impossible to tell which one would be better if they were in a head-to-head comparison. Comparison of lossless formats (like PNG) with lossy ones is an extreme case of this problem.
Convert from high-quality source
Converting from one lossy format to another creates an unfair situation: you tell the second codec to distortions made by the previous codec in addition to compressing the image.
An obvious case of this is saving a photo as GIF and then as JPEG. Such flawed test will make JPEG look worse than the GIF, even though we know JPEG is clearly much better for photos.
This applies to all lossy-to-lossy conversions. Even distortions that are invisible to the naked eye can bias the results.
Ensure tools' settings are as close as possible
If you publish your benchmark it's a good custom to also publish encoders' settings, versions, and source images in a lossless format to let others verify your results (here I've used mozjpeg 3.0 + imagemagick 6.8 + this + this.)
Do they save color with the same resolution? JPEG and some other formats have option to save color at half resolution (chroma subsampling). It often makes sharp red lines blocky, but otherwise is hard to notice. If you don't correct for this, you may be telling one tool to compress twice as much data!
Are all tools set to their “best” settings? Some encoders have default settings tuned for speed or compatibility.
Compare at realistic quality
Compare images only at qualities you'd actually use. Codecs are optimized for real-world use cases and may perform very poorly outside sensible quality range.
Choosing lowest quality may seem like a clever idea to make differences obvious, but actually it makes benchmarks irrelevant. It's like running a Formula 1 race in a muddy field: proves that tractors are faster than race cars.
|1KB Q=12 320px
|1KB Q=46 160px
The easy case: exactly the same file size
Adjust quality until compared images have exactly the same file size. Pick the image that looks closer to the original.
- It's tempting to pick an image which “looks nicer”, but that's not the game codecs are playing. If the original is noisy, then the codec that preserves the noise better should be judged as better.
If a smoother version of a photo looks nicer to you, then make the benchmark fair by smoothing the photo in a photo processing tool first, and then test how that compresses. Image codecs are not Instagrams or Photoshops. They're supposed to save images with minimum distortion, not add distortions that look pretty.
- It may not be possible to achieve exact file sizes. To remove any doubt ensure that the winner also has the smallest file size. Otherwise it could be better only because it's slightly larger (lossy codecs can use as little as one bit per pixel, so even a few bytes may make a difference).
Harder case: exactly the same quality
Compress to exactly same quality measured very precisely using an objective tool. Compare which file is smaller.
Unfortunately, ensuring same quality is much harder than it seems:
Quality is not what you think it is
Quality “75” in one tool doesn't have to look like “75” in another tool. It's never comparable between image formats.
There's no objective mathematical definition of “quality”. It's an arbitrary made-up scale (theoretically it makes sense to measure difference on a scale from 0 to infinity, but for “quality %” you need to define what “0% quality” means). Quality setting often isn't even consistent between images or proportional to quality perceived by humans.
Your eyes are actually terrible at judging quality
It's funny, because our eyes are supposed to be the ultimate judge, but:
- Quality is subjective. Is one big distortion worse than two smaller ones? Is too-blurry image better than a too-noisy one? People routinely disagree on this and change their mind based on images tested.
- Quality is hard to quantify. You can probably judge quality on a scale 1 to 5, but if asked to judge precisely on scale of 1 to 100 you'd be making things up like it was a wine tasting (“No, these pixels are too desaturated for 74/100. Definitely looks south of 73/100.")”
Subjective judgement is too “noisy” for opinion of one person to matter. It's necessary to combine scores from hundreds of people to get statistically meaningful results.
If you don't have hundreds of people to test in a controlled environment, then you have to resort to an objective quality measurement tool. You won't be able to see the difference between JPEGs at quality 98 and 99, but a tool easily can.
Choose a good tool for objective measurement
There isn't an ideal measurement—they're all only approximations of human perception—but some are much better than others.
Don't use dumb pixel-by-pixel measurements like PSNR (peak signal to noise ratio) or MSE (mean square error) — it's been shown over and over again that they're easily fooled.
Beware of tools that don't apply gamma correction as they will be biased towards dumb encoders that don't do correction either.
Most tools deal with color badly. They either only analyze grayscale version of the image (unfairly penalizing codecs that encode color very well) or measure distortions in the RGB color space, which isn't a good approximation of human perception.
Test on many images
Benchmarks are like a game of Top Trumps: features of images you choose for the test are going to decide which codec will win.
All codecs have to make trade-offs. They are tuned for particular use-cases and have their weak spots. Some codecs are great at preserving sharp lines, but inefficient at storing noise. Some codecs compress noise well, but lose all fine details at the same time. It's possible to make a codec that handles all these cases well, but at a cost of hideous complexity and high CPU requirements.
If you test only on one or few images, your test may be skewed by luck and outliers. When you test on many images, be careful about statistics: similarity scores are usually on a non-linear scale and file sizes vary by few orders of magnitude, so naive sums or averages can give misleading results. For example, you can run into Simpson's Paradox where one codec may be the best in almost all cases, but still get the worst score overall (due to scoring very very badly on one image).
To sum it up
Be very careful before you make sweeping judgements. It's too easy to make an unrealistic test that inaccurately measures the wrong thing.