The Micro-focus Data Processing Challenge


Small crystals burn up too fast.


The challenge:

I dare anyone to write an automation program for solving the structure by SAD phasing using only the data in xtal001 through xtal070 of these images: img files
also available as 15 530 MB tarballs.
500 MB tarballs are also available for just image number 001, 002, and 003

Don't let the unit cell fool you. This is not lysozyme. This is titin (1g1c), with the unit cell squeezed a bit so that two are exactly the same length. Although the space group is P212121, there will be an indexing ambiguity that you must resolve.

These data are a realistic simulation of the radiation damage situation faced with a lysozyme-sized protein growing ~5 micron crystals and shot with a 6 micron beam.

The exposure time was adjusted to get decent resolution on the first image, but unfortunately, you don't get very many shots before the crystal dies! And once you start trying to scale and merge the data, it gets even worse. The rad dam creates "non-isomorphism" that rapidly becomes unmanageable, despite the fact that the damage model used here is actually a very simple equation (described by Holton & Frankel, 2010).

The real trick with this dataset, however, is the fact that the a and b axes are the same length, but the space group is P212121. This means that autoindexing will get a and b swapped for about half of the wedges and you will need to check each one of them and "flip" the ones that don't agree with the rest.

Is this a common problem when mergeing data from many crystals? Yes.
Is there a program for doing this automatically? No.

Good luck.


Why?

With 5 micron crystals and a 6 micron beam it is formally impossible to get a complete data set from a single crystal (Holton & Frankel, 2010). They burn up too fast. Solving structures from micro-focus beams therefore requires data processing software that can assemble data from multiple crystals that may or may not have ambiguous indexing, may have radiation damage, and are individually highly incomplete. Short of actually cheating (see below), can you figure this out?

If you can, I'll post your solution, and a link to your software here.


Where did these data come from?

They are from simulated diffraction patterns of titin (1g1c).


Do I have to download the raw images or can I get the XDS processing files?

Processing has many options, and XDS also changes over time. A recent "naiive" processing run where XDS was given no information about the data other than the image headers is available as XDS_ASCII.HKL (51 MB) or INTEGRATE.HKL (70 MB).
An earlier version of XDS (2015) did things a little differently INTEGRATE.HKL (34 MB) and XDS_ASCII.HKL (18 MB).


What if I'm not interested in resolving the indexing ambiguity?

If you just want to experiment with radiation damage correction and data merging strategy the correctly-indexed and full-resolution INTEGRATE.HKL (74 MB) and XDS_ASCII.HKL (37 MB) files for you.


What constitutes "cheating"?

For the results of this challenge to be useful to practicing crystallographers (both those of us who use software and those who develop it) your "solution" to this problem must be a plausible "before you knew the right answer" scenario. The "right answer" is 1g1c. Anything else, at this point, I'd say is great!.
So, is there anyone out there who can piece this data together? Anyone?

If you think you have found a way to solve this without "cheating" in any way, let me know! I will re-name the challenge after you.

James Holton <JMHolton@lbl.gov>