Lossy compression of x-ray diffraction images
Lossy compression of x-ray diffraction images
Why lossy compression?
In general, it is probably not advisable to mess with your raw diffraction image data in any way, but it does take up a lot of space! Personally, I highly recommend archiving your raw and uncompressed image data on some sort of stable, long term media, such as DVD-R, LTO tape, which currently have the lowest price/GB and are supposed to last 30+ years.
However, it would certainly be nice to have a way to keep some sort of "representation" of these raw data readily and rapidly available. Kind of like MP3 compressed music. Yes, there are experts who can hear the difference (and some will even tell you to use vinyl), but if your model is never going to agree with the structure factors to better than 20%, then how much could adding 1% noise hurt?
I still remember the days before MP3 when it was lamented that sampled audio
files could never be compressed very well. Even today bzip2 does not
work very well at all at compressing sampled audio (about 1.3:1), but
mp3 files can be made at a compression ratio of 10:1 over CD-quality
audio and we all seem to still enjoy the music.
Lossless compression performance
It would be great if we could losslessly compress our data a tremendous amount, but
how good could lossless compression of diffraction images possibly be?
At one time, I ran an entropy calculation on the 44968 images on /data at the
ALS 8.3.1 beamline. I used a feature of Andy Hammersley's program
FIT2D
to compute the
entropy.
I don't pretend to completely
understand the algorithm, but I do understand that the entropy of the
image reflects the maximum possible compression ratio. For these
images, the "theoretical maximum compression ratio" ranged from 1.2 to
4.8 with mean 2.7 and standard deviation 0.7. The values for Huffmann
encoding ranged from 0.95 to 4.7 with mean 2.4 and standard deviation
1.0. The correlation coefficient between the Huffmann and "theoretical"
compression ratio was 0.97. I had a look at a few of the outlier
cases. As one might expect, the best compression ratios are from blank
images (where all the high-order bits are zero). The #1
hardest-to-compress image had many overloads, clear protein diffraction
and a bunch of ice rings.
So, unless I am missing something, I think the best we are going to get
with lossless compression is about 2.5:1.
A case for Lossy compression?
Yes yes, I know it sounds like a
horrible idea to use lossy compression on scientific data, because it
would change the values of that most precious of numbers: Fobs.
However, a very important question is:
How much would it change Fobs?
More practically: how much compression can you do
before Fobs changes by more than the value of SIGFobs?
Diffraction
patterns are inherently noisy. If you take the same image twice, then
photon counting statistics make sure that no two images are exactly the
same. So which one is "right"? If the changes in pixel values from a
lossy compression algorithm are always smaller than that introduced by
photon-counting noise, then is lossy compression really such a bad
idea? The errors introduced could be small when compared to things like
Rcryst and Rfree.
I suppose the best "lossy compression" is the one that preserves the
features of the image you want and throws out the stuff you don't care
about. So, in a way, data-reduction programs are probably the best
"lossy compression" we are going to get. However, these programs
are very complicated, and there is, I think, a demand for a more
brain-dead compression algorithm that works simply, reproducibly and
deterministically on the pixels themeselves.
The idea here is to have a massive number of data sets available "online" such as
with the
TARDIS effort. Typical MX beamlines collect data
at a rate of 2 GB/hour, so the more efficiently we can
use online stoage, the more data sets we will have available for perusal and cursory
searches for interesting features. If you do find one that is "interesting" or
otherwise in need of careful attention,
then you can ask whomever to go into their closet and get out the DVD or tape.
Example 1
The following is an example of a lossy compression on a test lysozyme dataset. This crystal was prepared
in a similar way to the PDB ID
1h87, in that it was
soaked with a Gd compound, but the data were collected at the Se edge where the Bijvoet ratio was 3.8%.
The compression ratio in this case is a factor of 34.
The original images were processed with
Elves
to obtain Forig and SIGForig (data to 1.4 A resolution).
After compressing and then decompressing (cdc)
the images and processing
them in the exact same way, the structure factors obtained (Fcdc) were (on average) ~2.5% different
from Forig.
However, the overall Rmerge of both the
original and cdc cases was much larger than this: 6.2%, and the average
|Forig-Fcdc|/sigma(Forig) was 0.6.
So, one could argue that the changes in F due to compression are "within the noise".
Anomalous differences were remarkably well preserved: phased anomalous difference Fourier peak
heights were 17.8 sigmas high in both cases.
Generally, all data processing stats were comparable,
but perhaps most interestingly, the Rfree obtained at the end of a basic ARP/wARP run on
Forig and Fcdc was 0.287 and 0.275, respectively.
Your mileage may vary.
Can you tell the difference?
So, I'm not going to tell you which one, but I flipped a coin and named the original
and compressed-decompressed images "case A" or "case B" and put them here:
for the mildly curious:
first image only (9 MB)
and the full data sets (100 ADSC Q315r images, or 2 GB) here:
case A (571 MB)
case B (574 MB)
don't let the different file sizes fool you!
And the "package" containing all the information needed to "generate" the decompressed data set is here:
compresssed image data (56 MB)
Performing this decompression requires the files in the download below.
I am interested in if and how anyone can "tell" which one of these data sets was compressed (without just comparing to the decompression results).
Download
A "development snapshot" of the lossy diffraction image compression and decompression project is here:
development_snapshot.tar.gz (465 kB)
this includes two "c" programs and statically-linked linux binary executables for each. A pair of
scripts called "compress.com" and "decompress.com" are also provided with obvious purposes.
You will also need the
ffmpeg
package installed on your system linux with the
x264 video codec
for the decompression script to work. This generally "comes with" the packaged install of ffmpeg
(such as when you type "yum install ffmpeg").
How it works
Lossy compression is mathematically equivalent to adding a little bit of noise to the images
and the extra noise makes them compress better. The trick is to minimize the impact of the
added noise by making sure that it is less than or (at worst) comparable to the noise that is
already there. The x264 codec makes this fairly easy to do.
It turns out that most of the entropy in diffraction images is not in the spots, but rather
in the background between them. A shame really, as long runs of similar-valued data compress
MUCH better if they are "rounded" to the same value. In fact, if you just extract the
"spot regions" and set all the "background" pixels to zero, you can easily get compression
ratios of 100-fold or more with gzip alone. The problem there, of course, is finding the
spots (since you generally have not processed the data yet). The approach I took here is to
"split" the images into "signal" and "noise" and then compress the "signal" losslessly and the
"noise" with something lossy.
The procedure is:
- for a given pixel coordinate, find the median value across all images
- again for each pixel, determine the median absolute deviation from this mean,
take this as the "noise level"
- now, for each image pixel, subtract the median and divide by the noise level,
call this the "zlevel" image
- any zlevel values greater than, say 4 (default) times the noise level are
assigned to lossless compression. That is, these pixels are moved to "pristine pixel"
images that are zero everywhere else and these compress very well with bzip2.
- the median image and the noise level image also compress very well with bzip2,
and there is only one of each per data set.
- the zlevel images are divided into regions that are less than 2048x2048 pixels
to aid in parallelization and to overcome an image size limit to some distributions
of the ffmpeg program.
- each zlevel region is then compressed as a video stream using the x264 codec.
This codec alows the user to specify the maximum allowed change in any given pixel
level value. By default, this is set to ~1.7 times the noise level. This seeems
to give good results.
- to decompress, ffmpeg is used to convert the video streams back into images
and the img_unsplit program reverses the above process.
Note that the median image produced by this procedure is a very nice way to remove
all the spots from a data set and examine the smooth background directly.
Acknowledgements
I would like to thank Giles Mullen for very helpful discussions of video codecs and
how they might be applied to x-ray diffraction data.
Questions? Comments? Think you know which dataset is compressed?
please email
James Holton
and I will tell you if you got it right.