Lossy compression of x-ray diffraction images

Why lossy compression?

In general, it is probably not advisable to mess with your raw diffraction image data in any way, but it does take up a lot of space! Personally, I highly recommend archiving your raw and uncompressed image data on some sort of stable, long term media, such as DVD-R, LTO tape, which currently have the lowest price/GB and are supposed to last 30+ years.

However, it would certainly be nice to have a way to keep some sort of "representation" of these raw data readily and rapidly available. Kind of like MP3 compressed music. Yes, there are experts who can hear the difference (and some will even tell you to use vinyl), but if your model is never going to agree with the structure factors to better than 20%, then how much could adding 1% noise hurt?

I still remember the days before MP3 when it was lamented that sampled audio files could never be compressed very well. Even today bzip2 does not work very well at all at compressing sampled audio (about 1.3:1), but mp3 files can be made at a compression ratio of 10:1 over CD-quality audio and we all seem to still enjoy the music.

Lossless compression performance

It would be great if we could losslessly compress our data a tremendous amount, but how good could lossless compression of diffraction images possibly be? At one time, I ran an entropy calculation on the 44968 images on /data at the ALS 8.3.1 beamline. I used a feature of Andy Hammersley's program FIT2D to compute the entropy. I don't pretend to completely understand the algorithm, but I do understand that the entropy of the image reflects the maximum possible compression ratio. For these images, the "theoretical maximum compression ratio" ranged from 1.2 to 4.8 with mean 2.7 and standard deviation 0.7. The values for Huffmann encoding ranged from 0.95 to 4.7 with mean 2.4 and standard deviation 1.0. The correlation coefficient between the Huffmann and "theoretical" compression ratio was 0.97. I had a look at a few of the outlier cases. As one might expect, the best compression ratios are from blank images (where all the high-order bits are zero). The #1 hardest-to-compress image had many overloads, clear protein diffraction and a bunch of ice rings.

So, unless I am missing something, I think the best we are going to get with lossless compression is about 2.5:1.

A case for Lossy compression?

Yes yes, I know it sounds like a horrible idea to use lossy compression on scientific data, because it would change the values of that most precious of numbers: F_obs. However, a very important question is:
How much would it change F_obs?
More practically: how much compression can you do before F_obs changes by more than the value of SIGF_obs?

Diffraction patterns are inherently noisy. If you take the same image twice, then photon counting statistics make sure that no two images are exactly the same. So which one is "right"? If the changes in pixel values from a lossy compression algorithm are always smaller than that introduced by photon-counting noise, then is lossy compression really such a bad idea? The errors introduced could be small when compared to things like R_cryst and R_free.

I suppose the best "lossy compression" is the one that preserves the features of the image you want and throws out the stuff you don't care about. So, in a way, data-reduction programs are probably the best "lossy compression" we are going to get. However, these programs are very complicated, and there is, I think, a demand for a more brain-dead compression algorithm that works simply, reproducibly and deterministically on the pixels themeselves.

The idea here is to have a massive number of data sets available "online" such as with the TARDIS effort. Typical MX beamlines collect data at a rate of 2 GB/hour, so the more efficiently we can use online stoage, the more data sets we will have available for perusal and cursory searches for interesting features. If you do find one that is "interesting" or otherwise in need of careful attention, then you can ask whomever to go into their closet and get out the DVD or tape.

Example 1

The following is an example of a lossy compression on a test lysozyme dataset. This crystal was prepared in a similar way to the PDB ID 1h87, in that it was soaked with a Gd compound, but the data were collected at the Se edge where the Bijvoet ratio was 3.8%.

The compression ratio in this case is a factor of 34.

The original images were processed with Elves to obtain F_orig and SIGF_orig (data to 1.4 A resolution). After compressing and then decompressing (cdc) the images and processing them in the exact same way, the structure factors obtained (F_cdc) were (on average) ~2.5% different from F_orig. However, the overall R_merge of both the original and cdc cases was much larger than this: 6.2%, and the average |F_orig-F_cdc|/sigma(F_orig) was 0.6. So, one could argue that the changes in F due to compression are "within the noise".

Anomalous differences were remarkably well preserved: phased anomalous difference Fourier peak heights were 17.8 sigmas high in both cases.

Generally, all data processing stats were comparable, but perhaps most interestingly, the R_free obtained at the end of a basic ARP/wARP run on F_orig and F_cdc was 0.287 and 0.275, respectively.

Your mileage may vary.

Can you tell the difference?

So, I'm not going to tell you which one, but I flipped a coin and named the original and compressed-decompressed images "case A" or "case B" and put them here:

for the mildly curious:
first image only (9 MB)

and the full data sets (100 ADSC Q315r images, or 2 GB) here:
case A (571 MB)
case B (574 MB)
don't let the different file sizes fool you!

And the "package" containing all the information needed to "generate" the decompressed data set is here:
compresssed image data (56 MB)

Performing this decompression requires the files in the download below.

I am interested in if and how anyone can "tell" which one of these data sets was compressed (without just comparing to the decompression results).

Download

A "development snapshot" of the lossy diffraction image compression and decompression project is here:
development_snapshot.tar.gz (465 kB)
this includes two "c" programs and statically-linked linux binary executables for each. A pair of scripts called "compress.com" and "decompress.com" are also provided with obvious purposes. You will also need the ffmpeg package installed on your system linux with the x264 video codec for the decompression script to work. This generally "comes with" the packaged install of ffmpeg (such as when you type "yum install ffmpeg").

How it works

Lossy compression is mathematically equivalent to adding a little bit of noise to the images and the extra noise makes them compress better. The trick is to minimize the impact of the added noise by making sure that it is less than or (at worst) comparable to the noise that is already there. The x264 codec makes this fairly easy to do.

It turns out that most of the entropy in diffraction images is not in the spots, but rather in the background between them. A shame really, as long runs of similar-valued data compress MUCH better if they are "rounded" to the same value. In fact, if you just extract the "spot regions" and set all the "background" pixels to zero, you can easily get compression ratios of 100-fold or more with gzip alone. The problem there, of course, is finding the spots (since you generally have not processed the data yet). The approach I took here is to "split" the images into "signal" and "noise" and then compress the "signal" losslessly and the "noise" with something lossy.

The procedure is:

for a given pixel coordinate, find the median value across all images
again for each pixel, determine the median absolute deviation from this mean, take this as the "noise level"
now, for each image pixel, subtract the median and divide by the noise level, call this the "zlevel" image
any zlevel values greater than, say 4 (default) times the noise level are assigned to lossless compression. That is, these pixels are moved to "pristine pixel" images that are zero everywhere else and these compress very well with bzip2.
the median image and the noise level image also compress very well with bzip2, and there is only one of each per data set.
the zlevel images are divided into regions that are less than 2048x2048 pixels to aid in parallelization and to overcome an image size limit to some distributions of the ffmpeg program.
each zlevel region is then compressed as a video stream using the x264 codec. This codec alows the user to specify the maximum allowed change in any given pixel level value. By default, this is set to ~1.7 times the noise level. This seeems to give good results.
to decompress, ffmpeg is used to convert the video streams back into images and the img_unsplit program reverses the above process.

Note that the median image produced by this procedure is a very nice way to remove all the spots from a data set and examine the smooth background directly.

Acknowledgements

I would like to thank Giles Mullen for very helpful discussions of video codecs and how they might be applied to x-ray diffraction data.