The SAD Twin Challenge

The challenge:

I dare anyone who considers themself an expert macromolecular crystallographer to derive the structure in 3dko from this data.

Why?

Twinning has long been the kryptonite of anomalous phasing methods. Yes, there are a few examples out there, usually with a small number of sites, where a clever crystallographer (usually with surname "Dauter") was able to figure out the heavy atom partial structure despite the twinning. But, in general, heavy-atom finding programs get very confused by twinning. And a 50:50 "perfect twin" might be the most frustrating of all. You can see the anomalous differences. Even measure them very well. But you still can't make sense of them.

A major reason for this inefficacy in our software is that methods developers seldom get their hands on "interesting" twinned cases that have anomalous differences. Perhaps it is because the people who collected it are too embarassed to admit they had to find another crystal form? Also, it is difficult to pick and choose what twin fraction the data have. 50:50 is generally considered impossible to solve, but what about 60:40, or 70:30?

Well, here you go:
twin_5050.mtz
twin_5149.mtz
twin_5248.mtz
twin_5347.mtz
twin_5446.mtz<-- impossible?
twin_5545.mtz<-- harder
twin_5644.mtz<-- hard
twin_5743.mtz<-- possible
twin_5842.mtz
twin_5941.mtz<-- Pavol Skubak Crank2 solution
twin_6040.mtz<-- Takanori Nakane SHELX[CDE] solution
twin_7030.mtz
twin_8020.mtz
twin_9010.mtz
twin_9901.mtz
all twin fractions in 1% steps can be downloaded at once in mtz or xds format

The "right answer" here is the PDB entry 3dko modified slightly to have SeMet residues here.

The original structure is not from twinned data, but I selected 3dko because belongs to a space group that CAN be twinned.

3dko has 12 Met residues, so there are 12 selenium sites to find. How hard could that be? Well, here is the success rate with shelxd:

So, in this case the anomalous signal is quite strong, and shelxd finds all 12 sites with twin fractions as high as 0.44, provided it runs for up to 100,000 trials. However, if you "cheat" and use the phases of the final, refined, correct model to compute a phased anomalous difference Fourier, then all 12 sites are clearly resolved above the tallest noise peak all the way out to a twin fraction of 0.5.

Where did these data come from?

They are actually from simulated diffraction patterns created for an educational workshop to demonstrate to novice crystallographers what twinning is. Specifically, there are two datasets: A and B, which are identical in every way except the crystal orientation. You can solve either one of them by SAD, no problem. However, if these two crystals were in the same beam at the same time, the diffraction pattern you'd get would be the pixel-by-pixel sum of the relevant images from the A and B datasets. You can generate this sum using my provided img_mix.com script. Just run it with no options to get online help. In this way, you can get any twin fraction you want. But, for starters, I'd say try your hand at the 80:20 case, and then see if you can get anywhere with 56:44.

Preliminary results

Beware R factors with twinning!

Note that all R/Rfree values on this graph come from refining the SAME model against the SAME data, just with different "true" twin fractions. So, if you see your R/Rfree drop when converting a symmetry operator to a twinning operator, that does NOT mean that you have twinning. It means you need to check the same exchange for twinning vs crystallographic symmetry for all the other operators in your space group to see if one of them is particularly better than any other. If it's not, you probably just have crystallographic symmetry and something else is wrong with your model.

Ever wondered how accurate all those "twin fraction estimates" you get from your favorite programs are? Well, here you go:

Here "pointless" is just running the standard L-test, which is known to underestimate high twin fractions. Interestingly, refmac's twin refinement, even given the right answer in the first place, tends to estimate a little high for low twin fractions. The maximum-likelihood twin fraction estimated by phenix.xtriage (the last one reported in the log file) seems to be the most accurate overall.

What constitutes "cheating"?

For the results of this challenge to be useful to practicing crystallographers (both those of us who use software and those who develop it) your "solution" to the "impossible" problem must be a plausible "before you knew the right answer" scenario. For example, simply dropping in the "right answer" (3dko) and finding the sites with "MR-SAD" is definitely cheating.

Using the right sequence information is not cheating, since that is generally something you will know before you sit down to collect data.

However, this does beg another important question: can MR-SAD let you get away with a more distant homolog than regular MR alone? What about in the presence of twinning?

If you think you have found a way to crack the 54:46 twin case without "cheating" in any way, let me know! I will re-name the challenge after you.

James Holton <JMHolton@slac.stanford.edu>