(Biochem609 Molecular Replacement class; Bart.Hazes@Ualberta.ca; February 27, 2003)

Molecular Replacement



Introduction

Once you have collected your diffraction data you want to calculate an electron density map. However, you need phase information to do this. One solution to this problem originates from the following relationship:

figure 1
Figure 1: Formula to calculate the structure factors F(hkl) as the sum of the scattering contributions of j atoms (I have ignored temperature factors to keep it simple)

The important property of this formula is that you can calculate the structure factors (Fcalc) from a model's atomic coordinates x, y, and z. Therefore, by comparing Fcalc with Fobs you can identify a "correct" model from a large set of guesses. In the early 1900s people actually solved simple salt structures by educated guessing the positions of the atoms/ions and comparing the predicted and observed diffraction patterns. Unfortunately, this is beyond hope for even the tiniest protein.

For proteins there are just too many atoms to guess all their individual positions. However, if you already know the positions of atoms within one molecule then instead of placing atoms in the unit cell, you "only" have to place molecules in the unit cell [Molecular (Re)placement; here abbreviated as MR]. The prior knowledge of the protein structure, the "search model", can come from several sources:

The higher the similarity between the search model and the unknown protein the easier it will be to find a MR solution. In addition, the phases derived from the correctly placed search model will be more accurate. If you have a very poor search model then, even if MR succeeds, you may find that the phase information is too poor to successfully refine the structure due to serious model bias. So, unless you have high resolution data (better than 2.0 Å or so) or can use non-crystallographic symmetry averaging you may still need experimental phases. So if MR is very difficult because of a poor search model you may want to think about experimental phasing rather than fight the MR problem.

The text below presents MR in a very practical manner. You could also take a look at the MR web page from Randy Read that includes a more thorough discussion of the theory and some of the recent developments using maximum likelihood. Another web page worth reading is the one on MR from Ian Tickle. Finally, the proceedings from a recent CCP4 annual meeting on molecular replacement has been published as a special issue in Acta Cryst D57, 2001.


Basic principle

The goal in MR is to orient and position the search model such that it coincides with the position of the unknown protein in the crystal. This is depicted in Figure 2.

figure 2
Figure 2: Given a crystal with proteins "P" of unknown structure and the known structure of a related protein "P" (notice the different font in the figure); find out how to orient and position the known structure such that it corresponds with the unknown structure in the crystal.

From the figure it is clear that we have to determine both the orientation of the search model and its position. To define the orientation of a molecule you always need three parameters alpha, beta, and gamma. The position has to be found relative to the symmetry axes and need 0 (P1 space group), 2 (polar space groups), or 3 (all other space groups) parameters. So for most space groups we have to find 5 or 6 parameters per molecule, leading to billions of combinations to consider. With current computing power, this may not be entirely impossible to calculate, but there is a faster way to obtain the answer. The trick is to first find the three parameters defining the orientation of the molecule (rotation function) followed by a separate determination of the two or three positional parameters (translation function). To understand how we can separate these two aspects we have to look at the Patterson map.


Patterson map

P1 cell

The Patterson map is basically a Fourier map calculated with the square of the structure factor amplitude and a phase of zero. Patterson showed that this type of map is basically an interatomic vector map. Each peak in the map corresponds to a vector between atoms in the crystal and the intensity of the peak is the product of the electron densities of each atom. This is shown schematically in Figure 3.

figure 3
Figure 3: Unit cell with a 5-atom model and its corresponding Patterson map. The atoms of the 5-atom model have different colours. In the Patterson map, all Patterson vectors from one "originating atom" to all other atoms are drawn in the colour of the "originating atom". The complete Patterson is obtained by also considering the unit cell translations.

Figure 3 shows a number of properties of the Patterson map.

Since the Patterson map is independent of the position of the molecule, we only have to determine the three parameters that describe the orientation of the search molecule. One way of doing this is with the rotation function which compares the Patterson calculated from the search model with the Patterson from the observed data.

P2 cell

The example above was for a P1 cell (no symmetry). The figure below shows a unit cell with two molecules related by 2-fold symmetry around an axis running horizontally in the plane of the page.

figure 4
Figure 4: Unit cell with two 5-atom molecules related by 2-fold symmetry and the corresponding Patterson. Intramolecular vectors in the Patterson are coloured red or blue to indicate that they were derived from the red or blue molecule, respectively. The purple spheres are intermolecular Patterson vectors.

Figure 4 shows several more properties of the Patterson map.

So for any polar space group we have to determine the orientation as well as the position in the two dimensions perpendicular to the symmetry axis, so five parameters in total. However, because the intramolecular vectors are independent of position we can still use them to determine the orientation without knowing the position. Furthermore, as the intramolecular vectors are enriched near the origin, we can reduce the "contamination" by intermolecular vectors by restricting our analysis to a sphere surrounding the origin. After the orientation has been found with the rotation function, we can use the full Patterson to find the positional parameters by the translation function. Thus we have broken a 5-dimensional search into a more efficient 3- plus 2-dimensional search.

Higher symmetry cells

For space groups with two or more non-parallel symmetry axes it becomes too complicated to draw Patterson diagrams as I did above. The principles remain the same, however, in these cases you will have to find three positional parameters to uniquely define the position of the search molecule relative to all symmetry axes.

Factors affecting the observed Patterson map

As mentioned above, the Patterson map coefficients use the squared amplitudes of the structure factors. This means that strong reflections contribute disproportionally to the Patterson map. A dataset where strong reflections are systematically missing (for instance because they were overloading the detector) will have a distorted Patterson map. Similarly, outliers that are much too strong will adversely affect the Patterson map. Finally, data sets where a whole chunk of reciprocal space has not been measured, for instance data missing the cusp, will have artifacts in the Patterson map.

Factors affecting the calculated Patterson map / search model

The success of MR depends largely on the similarity between the search model and the protein of interest. There is a inverse correlation between the rms deviation of atomic positions and the percentage sequence identity. Clearly, the higher the sequence identity the easier the MR but it is hard to give an absolute % at which MR will no longer work. In addition to similarity, the extent to which the search model represents the molecule of interest is also important. If the search model is very similar, but only represents 10% of the complete protein under study it will be harder to find a solution. Another problem that has come up is that some proteins (for instance antibodies) have two or more domains that can adopt different relative orientations with respect to each other. If the search model does not have the same relative orientation then MR may fail. In this case, one solution would be to use the domains individually for MR.

It has been found that it often helps to remove parts of the search model that have a high likelihood of being different from the structure of interest. Therefore it is common practice to delete parts of the search model that are flexible or contain insertions/deletions. It is also better to remove water molecules. Finally, if the search model has individual B-values then you can first try to use them as is. Even if they are not highly reliable, they will down-weight (by having high B-values) the regions that are less well determined. If this does not give good results you can set one B-value for all atoms, corresponding to the fall-off of you data (an estimate is given by the CCP4 program TRUNCATE). Alternatively, you can add a constant to the existing B-values so that the average B-value reflects the fall-off of your data, whilst maintaining the relative differences in B-values between atoms.

When multiple search models are available, you can compare these models to find out what the conserved structural core is and use only that for MR. Alternatively, you can use all potential search models simultaneously. This will downweight the parts that are not conserved between the various models. This can also be applied to an ensemble of NMR structures.


Rotation function

The basic principle behind the rotation function is depicted below.

Rotation Function
Figure 5: Rotation function principle. The top half shows the search model and its corresponding Patterson vectors for a random orientation. The Patterson does not correspond to the observed Patterson shown at the bottom right. Rotating the search model to the correct orientation will give a calculated Patterson that matches the observed Patterson.

In order to quantify the overlap between the observed and calculated Pattersons we use a product function. In this procedure, the corresponding positions in the observed and calculated Patterson maps are multiplied. This gives a maximum when the two Pattersons overlap. In mathematical notation this is:

Rotation Function Equation
Figure "borrowed" from Randy Read's web page.

You can think conceptually about physically rotating your search model, calculating the Patterson and comparing it with the native Patterson. This has been implemented in various forms (CNS and XPLOR rotate a set of strong Patterson peaks rather than the search model itself). However, just as we can describe an electron density as a set of Fourier coefficients, Tony Crowther discovered that it is possible to describe the Patterson with spherical Bessel functions. The mathematical properties of these spherical Bessel functions allow you to calculate the rotation function scores for all possible orientations in one calculation (similar to a Fourier transform). I can't explain the mathematics to you, but the good news is that you can now calculate the rotation function in a matter of seconds or minutes.

As a user there are a few parameters that you control that will affect the sensitivity of the rotation function. Optimal settings differ for each case and if you don't get a clear hit with the default setting then you have to play with the parameters. I will go through them individually below.

Resolution

MR is sensitive to both the low and high resolution limits that you use during the analysis. The default high resolution limit is most often 4 Å. If there are significant differences between your search model and the protein of interest then the data beyond 4 Å is not going to be useful. If your search model is very similar to the protein of interest then you are going to get a good solution whatever resolution you use. However, in cases where your search model is very accurate but represents only a fraction of the contents of the unknown crystal unit cell, it may help to include higher resolution data as well. In contrast, if your search model is poor you may want to vary the high resolution limit between 4 - 6 Å.
The very low resolution data is also not normally useful. This is because in the real crystal the protein is embedded in solvent. In the search model the protein is basically in a vacuum. This gives systematic errors that are stronger at low resolution. In practice, the low resolution cutoff is normally about 8 Å, but you can try values in the range from 7 to 10 Å. Again, only worry about this if you don't get a good result with the default values.

B-value / E-values

Instead of applying resolution cutoff values, you can also apply a B-value to weaken (positive B) or strengthen (negative B) the high resolution terms. Some people also advocate the use of E-values. E-values, also called normalized structure factors, do not decrease with resolution (as if all atoms where non-vibrating point-atoms). This "sharpening" of the data places more emphasis on the high resolution terms, similar to applying a negative B-value. As described above for setting the high resolution cutoff, I would expect that sharpening your data is most useful in situations where your search model is relatively similar to the protein of interest.

Radius of integration

The radius of the sphere around the origin of the Patterson that is analysed (integrated) determines which fraction of the intramolecular vectors are going to be included and how much contamination of intermolecular vectors there is going to be. In any case you want the radius to be smaller than the smallest unit cell dimension in order to avoid integrating the origin peak corresponding to a full unit cell translation. For the same reason you don't normally want to include the volume very close to the origin so you set both a minimum and maximum radius of integration. The optimal setting depends on the size and shape of your molecule. None of the intramolecular vectors is going to be longer than the longest dimension of the protein so the radius of integration should never be larger than this. The rule of thumb is to use a radius of integration that is approximately 75% of the largest dimension. This will cover most of the intramolecular vectors without too much contamination by intermolecular vectors (see Figure 4). If your MR is not trivial you may want to try a few radii of integration to find the optimal setting.

Step size

You have to tell the program how fine/coarse to sample the rotation space. You want the angular step size to be small enough to make sure you don't mis the correct peak. The error in the relative position of two two atoms depends on the distance between those atoms. For instance, for two atoms separated by 20 Å an inaccuracy of 1 degree will give an error of 20*pi/180=0.35 Å. A step size of 2.5 degrees is normally ok (the worst case angular error will be 2.2 degrees). Smaller values won't hurt they just take longer to calculate and often don't help much either as the result is dominated by differences in the search model. You may want to use smaller values when your search model is good but partial or if you are using a rather large radius of integration which, as calculated above, will give larger positional errors.

Search model box size

You have to assign a unit cell for the search model. This can be any unit cell, but by choosing it cleverly you can get a better signal. The problem is that the Patterson calculated for the search model contains the intramolecular vectors you are interested in, but also intermolecular vectors between search models related by unit cell translations. You do not want the latter vectors to contaminate the sphere of integration of the search model (see Figure 6). This means that each unit cell axis should be larger than the dimension of the protein along that axis plus the radius of integration plus the high resolution limit. This is a minimal size. It doesn't hurt to make the box larger, it just makes the computer work harder.

Search model box
Figure 6: Choice of search model unit cell size. A too small unit cell (top half) causes intermolecular vectors between molecules related by a unit cell translation to infiltrate the sphere of integration (dotted circle). In the bottom example the cell is chosen to prevent any infiltration.

Effect of space group symmetry

For a unit cell with N asymmetric units, your search model, if complete, will represent only 1/Nth of the atoms in the unit cell. Therefore rotation function solution in high symmetry space groups tends to be more difficult.

Self rotation function

In some cases you will find that the rotation function gives more than one convincing solution. This happens when there are multiple copies of the search model present in the asymmetric unit of the crystal (this is also called non-crystallographic symmetry or NCS). For instance, when the crystal contains a dimer in the asymmetric unit, you should find two peaks, one corresponding to the orientation of each monomer. There is a special form of the rotation function where you compare the native Patterson with itself. Peaks in this so-called self-rotation function correspond to the rotation required to superimpose the multiple molecules in the asymmetric unit. It will give both the direction of the rotation axis and the angle of rotation. For dimers this should be close to 180 degrees, 120 degrees for trimers etc. In addition to the NCS symmetry peak, the highest peak will always be the 0 0 0 origin peak. Obviously, if you don't rotate, the molecule will superimpose on itself.

Before starting MR, or any other use of a new data set, I strongly recommend that you calculate a self-rotation function so that you are aware of non-crystallographic symmetry if you have it. You should also always calculate Mathews volume (Volume of asymmetric unit/molecular mass of your molecule) to find out if there can be multiple molecules in the asymmetric unit.

There are two exceptions where the self-rotation does not correctly show the non-crystallographic symmetry. The first case is when the NCS operation is parallel to a crystallographic symmetry operation of the same order; e.g. a 2-fold NCS parallel to a 2-, 4-, or 6-fold crystallographic axis, or a 3-fold NCS parallel to a crystallographic 3-, or 6-fold. In this case the orientation of the NCS-related molecules is identical to the orientation of molecules related by crystallographic symmetry and therefore cannot be distinguished. A second side-effect is that translational symmetry is generated, see Figure 7.

Parallel crystallographic and NCS symmetry
Figure 7: When a dimer, shown in red, is oriented such that its dimer axis (blue ellips) is parallel to crystallographic 2-fold symmetry (black ellips) then the NCS dimer and its crystallographically 2-fold related dimer (gray) are also related by a translation (corresponding to the blue vector). All atoms in one dimer are related by this vector to a corresponding atom in the other dimer (shown as black arrows for one monomer) and thus give a huge non-origin Patterson peak corresponding to this vector. The molecules shown as contours are related to the others by unit cell translation.

The presence of NCS translational symmetry can be detected in the Patterson map since each atom in molecule one will be related by the translation vector to the corresponding atom in molecule two. Therefore, whenever there is translational symmetry there will be a huge peak in the Patterson. The size of the peak will depend on the perfection of the translation. If the two "translated" molecules differ slightly in orientation then the peak hight will be reduced. When the orientation differs more significantly the peak may no longer be visible in a standard Patterson, but can often still be observed in a low resolution Patterson (e.g. using data to only 6, 8 or even 10 Å). Just like it is good practise to calculate a self rotation function for each new dataset it is also prudent to calculate a self Patterson map (calculated just from the squared Fobs).

The second situation where the self-rotation function may fool you is when your data shows merohedral twinning. In this case your crystal really consists of two or more crystals, oriented such that their diffraction patterns exactly overlap. This can occur when the symmetry of the unit cell shape is higher than that of its content. For instance, a trigonal unit cell shape has 6-fold symmetry, whereas its content has only 3-fold symmetry. When there is twinning, each twin makes an independent contribution to the structure factors and therefore to the peaks in the Patterson. You will therefore find a peak in the self-rotation function corresponding to the twinning operator. If you initially assume that this is NCS symmetry you will normally find that the resulting Matthews volume is impossibly low (unless your crystal has a very high solvent content to start with). This suggests that you may have a twinning problem. You can statistically detect twinning once you have collected your data and this is always a wise check to perform for data from unit cells that can twin. You can do this with the Merohedral Crystal Twinning Server.

The locked rotation function

If you have N-fold NCS in your unit cell, it means that a search model consisting of a monomer represents only 1/Nth of the total atoms in the asymmetric unit. Therefore it will be harder to find the rotation function solution. However, since the self-rotation function gave you the NCS operation you know that there should be N peaks in the rotation function related by the NCS symmetry. There is a special version of the rotation function that uses this extra knowledge to enhance the sensitivity. This is called the locked rotation function and it is implemented in the GLRF program.

Phased rotation function

In situations where some prior phase information is available, this can be used to help find the rotation function solution. If the phases are so good that you can interpret parts of the map, you could build a few well-defined parts of the model and then superimpose the search model onto these structural fragments. If the map is too poor to build a model, but you can see the solvent boundaries of the molecule, you can cut out a block of density from the map that contains the molecule of interest. Next you place this density in a large unit cell and calculate structure factors from the density. You can then use these structure factors as if they were your "observed" diffraction data. The Patterson corresponding to this "fake" crystal will only contain intramolecular vectors and therefore be much cleaner than the Patterson from the actual diffraction data. This would be similar to taking the blue molecule out of Figure 4 and getting a Patterson that would only show the red spheres.

Rotation angle definitions

There are two common ways to define rotations during MR. In one of these, the rotation is described as three consecutive rotations:

  1. Rotate around the Z axis by alpha degrees
  2. Rotate around the new position of the Y axis for beta degrees
  3. Rotate around the newest Z axis position for gamma degrees

This is depicted in Figure 8. This description of the rotation is hard on the brains but easy for the computer. It is normally used for the cross-rotation function. There are alternative definitions rotating around Z, X, and Z consecutively (and perhaps more) to make life more confusing, but the CCP4 package uses the ZYZ convention. A special situation arises when the beta angle is 0 or 180. In this case the first and third rotations are both around the original Z axis. Therefore the resulting rotation is alpha + gamma degrees around the Z axis. So the solutions 10 0 30 and 20 0 20 are identical as alpha + gamma is 40 degrees in both cases.

Euler rotation angles
Figure 8: Definition of the Euler rotation angles used in the CCP4 program package.

The second method defines the direction of a rotation axis and the magnitude of the angle to rotate. This is also referred to as polar angles. In this definition, phi and psi define the direction of the rotation axis, see Figure 9, and kappa defines the rotation angle. This definition is a bit easier on the brains and it is especially useful for the self-rotation function where you expect distinct rotation angles; e.g. 180 degrees for a dimer, 120 degrees for a trimer etc.

Polar rotation angles
Figure 9: Definition of the polar rotation angles used in the CCP4 program package.


Translation function

After the orientation of the search model has been found, we have to place it at the correct position in the unit cell. Figure 3 shows that for a P1 unit cell (no symmetry) you can place the molecule anywhere you want as both the intramolecular vectors and the vectors between molecules related by unit cell translations are independent of the position of the molecule in the unit cell.
Figure 4 shows that when there is just a single symmetry operation (polar space groups) then only the position perpendicular to the axis matters. So for a symmetry axis parallel to the C-axis we have to define x and y but can choose z arbitrarily. Finally, for non-polar space groups we have to define both x, y, and z relative to the symmetry operations.

The actual translation function again compares calculated and observed Pattersons. The information about the position is present in the intermolecular vectors so this time you use the full Patterson, minus the origin, and you again use a product function. In this case the product function can be evaluated by a Fourier transform using the correlation theorem. This is illustrated by the formulas below.

Translation Function Equation
Figure "borrowed" from Randy Read's web page.

Instead of comparing Patterson maps, you could in this situation also directly compare the calculated and observed structure factors since, if correctly positioned, the search model should generate calculated structure factors that correspond to the observed data. This has been implemented in for instance the program BRUTE.

Choice of origin

The correct position of the search model is defined relative to the symmetry axes in the crystal. However, there are always multiple symmetry axes running parallel in the crystal. For instance, figure 7 shows that if there is a 2-fold symmetry axis there are always at least 9 running parallel through the unit cell (one through the origin, 3 more related by unit cell translations, and 5 more in between the first 4 because a 2-fold plus a translation generates another 2-fold in the middle). If you find the correct position for your molecule relative to an axis, you still have to choose relative to which of the several axes you will place your model. This is an arbitrary decision. This also means that in the translation search you do not have to test all possible positions, just the unique set relative to the symmetry axes (for instance just one of the quadrants in the unit cell of figure7.

Screw axis ambiguity

For space groups with 3-, 4-, or 6-fold screw axes, the diffraction pattern does not allow you to determine whether they are 3(1)/3(2), 4(1)/4(3), 6(1)/6(5), or 6(2)/6(4) axes. However the type of screw does matter for the translation function. Therefore you have to try both possibilities. Only one should give a significant solution and that tells you what the true space group symmetry is.

Orientation refinement

The success of the translation function depends on how good your search model is and how accurate the rotation function solution is. For tricky problems you could consider to first do some rigid body refinement of the oriented search model in a P1 cell of the same size as your true unit cell. In the P1 cell, the position of the molecule doesn't matter. This unit cell with a single molecule will be a poor representation compared to the multiple models present in the real cell, but it is good enough to optimize the orientation. A better orientation will boost the signal in the translation function.

Phased translation function

Just like the rotation function, you can use prior phase information to increase the sensitivity of the translation function. However, this time you don't have to cut and paste electron density. You can consider the electron density in the unit cell (Rho[xtal])to be the convolution of the density of a monomer in the right orientation (Rho[model]) and a delta function having a peak at the right position (Rho[delta]).

Rho[xtal] = Rho[model] * Rho[delta]  {where * denotes the convolution}

The Fourier transform of this yields:

F[xtal] = F[model] x F[delta]   {where x denotes multiplication}

F[xtal] are your observed structure factors and the prior phases. F[model] are the calculated structure factors (amplitudes and phases) from your search model in the correct orientation. The only unknown is F[delta] which can be obtained by deconvolution. The resulting F[delta] corresponds to a map with a peak for the translation vector that needs to be applied to superimpose Rho[model] on Rho[xtal]. This property of the Fourier transform is exploited in the phased translation function which is a very sensitive way to find the solution to the translation function.


Multi domain/subunit placement

The unit cell may contain multiple molecules when, for instance, there is NCS, or if you have crystallized a complex of two or more distinct molecules. It is also possible that your protein contains multiple domains for which there are individual search models. Finally, for proteins with domains that can adopt different conformations relative to each other it is also often better to carry out MR with the individual domains. In all these cases there are a few extra things to consider.

It is best to start MR with the best search model; that means the model that is expected to be the most similar and/or most complete. For instance if you have crystallized a complex of a 40kDa and a 8kDa protein, you would start with the 40kDa protein as it represents the largest fraction of the scattering power.

For the rotation function you have to find the orientation of all fragments independently, unless there is NCS in which case you can use the locked rotation function if necessary. If you can find the orientation for some but not all parts then go ahead and use successful parts for the translation function. If you can solve a sufficiently large fraction of the total content of the asymmetric unit then phases from that partial model may allow you to recognize the missing part. You can then proceed with the missing part as if there is prior phase information

For the translation function you can build up your complete model in steps. First place the easiest fragment. Next you run the translation function again with the next fragment, but you also include the fragment whose position you have already found. The known part of the model will be kept fixed, whereas the position for the new fragment will be determined. You can repeat this if you have more than two fragments.

If there are ambiguities in the position due to the space group symmetry (P1 or polar) or origin choice, then you can arbitrarily make a choice for the first molecule that you place. However, the other molecules have to be placed correctly relative to the first molecule. Therefore, once the first molecule has been placed there are no longer any ambiguities. So you will have to search the full unit cell for their position.


New developments

Brute force searches

With increasing computer power there are now developments to do brute force 5 or 6 dimensional searches. The advantage is that you avoid the problem of overlap between intra- and intermolecular vectors and that all molecules in the unit cell are modeled in the search (compared to only a single model in the classical rotation function). Examples are: Queen of Spades (Monte Carlo minimization), SOMoRe (low resolution coarse search followed by refinement of most likely solutions), EPMR (evolutionary search), BRUTE (just plain crunching the N dimensions).

Beast

Beast is a rotation function that simultaneously models all symmetry related copies of the model. It works in Fourier space and compares Fcalc against Fobs using a maximum likelihood target. The problem is that you don't know the relative phase between the symmetry related copies when you don't know their relative positions. This uncertainty is incorporated in the likelihood probability distributions, assuming a SIM-like distribution where the symmetry copy with the strongest Fcalc defines the center of the distribution and the others define the variance. (Read, 2001, Acta D57, 1373-82)

Spherically averaged search model

MolRep now implements a new method where the translation function is solved first, so without yet knowing the orientation. The approach is to create a spherically averaged search model (reflecting the fact that the orientation is unknown) and use a phased translation function to find the optimal position. The orientation is then found by a phased rotation function. (Vagin & Isupov, 2001, Acta D57, 1451-6)