Problems in Protein Evolution

Problems in Protein Evolution

October 1, 2001, revised November 19

I do set My bow in the cloud

Most recent version of this article

Old version of this article

Some very serious problems in the evolution of proteins threaten the theory of evolution, and appear to disprove it. A demonstration of the seriousness of these problems therefore constitutes a disproof of the theory of evolution. In particular, the evolution of proteins having significantly different shapes (tertiary structures) than previously existing proteins appears to be impossible.

Although there are other arguments against the theory of evolution, the present argument differs in several ways. Some arguments against evolution involve the improbability of abiogenesis, that is, the origin of the first life forms. These arguments are convincing, but biologists will say that some as yet undiscovered mechanism resulted in the first life forms. The biologists' argument is hard to refute in a formal way. Other arguments point to the complexity of life, and the implausibility of such a complex system evolving. For example, Behe documents the complexity of flagella and argues that they are "irreducibly complex," meaning that the system cannot function unless many parts appear, and all these parts could not have arisen at once by evolutionary processes. This is also a convincing argument, but it is hard to formalize because it does specify mathematical probabilities. Biologists will say that some as yet undiscovered mechanism resulted in the evolution of the flagellum and other structures that appear to be irreducibly complex. Other arguments involve the fact that information of the kind found in life forms does not appear by natural processes. However plausible this argument is, it does not have a formal mathematical justification. Similar comments apply to the lack of transitional forms in the fossil record, and many other arguments commonly used against the theory of evolution. The argument presented here is different, in that it involves existing genetic mechanisms, not hypothesized ones, and it involves the calculation of mathematical probabilities (or rather, improbabilities). This therefore appears to be the first argument that qualifies formally as a disproof of the theory of evolution.

This argument also involves redundancies in many of its aspects. That is, under several models, the change of shape of proteins is impossible, and each model has its own set of reasonable assumptions.

The difficulties presented here do not imply that proteins cannot acquire new functions by mutations; this can happen by mutations that do not significantly affect the shape of the protein. It is only the change of the shape of a protein that presents difficulties.

The difficulties in changing the shape of proteins by evolution involve both probabilities and laws of protein structure. The evolution of proteins of new shapes by point mutations is not possible, because the change of shape of a protein would require too many mutations. If the probability of a mutation is high enough to change the shape of the protein, then many other mutations will also occur that will essentially randomize the rest of the gene and cause the newly shaped protein to be harmful to the organism. One might argue that large scale changes to a gene could result in a protein of a new shape more readily than could point mutations. However, other arguments based on laws of protein structure prevent this. The kinds of amino acids that appear on the inside and the outside of a protein are different. There can only be a small number of insertions of a part of one protein into another that do not violate the distinction between the inside and outside of the protein, and the chance that any one of these will be beneficial, is very small.

The present argument is based on assumptions from the theory of evolution, according to which life began as a simple reproducing system that gradually developed into the life forms we see today. This system (or systems, if life developed multiple times) must have been very simple, because it had to originate without the benefit of evolution. In particular, it could only have had one or a very small number of proteins. From these few proteins, those in current life forms must have evolved.

Each protein is produced by one (or possibly several) genes. The evolution of new proteins must have occurred by mutations to these genes. Since genetic mechanisms in current life forms are strikingly uniform, with a few modifications, it is reasonable to assume that these mechanisms have been in operation for hundreds of millions or billions of years, in the accepted evolutionary scenario. Thus the evolution of many proteins from a few must have occurred by genetic mechanisms that are still in existence. Even if special mechanisms operated in the evolution of one-celled creatures, there are undoubtedly many proteins and shapes of proteins that only appear in multicellular organisms, and these must have evolved from others by currently existing genetic mechanisms. Furthermore, because all known one-celled organisms have similar genetic mechanisms, it is reasonable to assume that these mechanisms were operating for a considerable portion of the time that these one-celled organisms evolved from simpler organisms having many fewer proteins. During this time, proteins having new shapes must have appeared.

Each protein is composed of a sequence of amino acids that join together and are then called "residues." The "side chains" of the residues determine their chemical properties. Some side chains are hydrophobic (oily) and tend to cluster together inside the protein. Others are hydrophilic (water loving) and tend to occur on the outside of the protein. Thus a given shape of a protein will tend to be associated with a particular sequence of hydrophobic and non-hydrophobic side chains. Changing the shape of the protein requires changing this sequence of side chains.

A problem with the evolution of proteins having new shapes is that proteins are highly constrained, and producing a functional protein from a functional protein having a significantly different shape would typically require many mutations of the gene producing the protein. All the proteins produced during this transition would not be functional, that is, they would not be beneficial to the organism, or possibly they would still have their original function but not confer any advantage to the organism. It turns out that this scenario has severe mathematical problems that call the theory of evolution into question. Unless these problems can be overcome, the theory of evolution is in trouble.

The typical mechanism proposed to explain the evolution of new proteins is that an existing gene is duplicated, and one of the copies of the gene then begins a series of mutations that eventually result in a gene able to produce a new protein. If the mutations result in a change in the shape of the protein, then the protein will probably no longer have a function in the organism, because the function of a protein is closely related to its shape. The mutating duplicated gene is still able to produce a protein, but the protein has no function in the organism. We call such a gene "useless" to indicate that it does produce a protein but the protein has no function in the organism. This is distinct from "pseudogenes," which no longer produce proteins at all because mutations have corrupted a control region or something else necessary for the gene to function.

A difficulty with this scenario for protein evolution arises from the small number of genes. If many of these genes are useless, then the number of useful genes would be even smaller than the number of discovered genes, which seems highly unlikely. Therefore the average number of useless genes (as opposed to pseudogenes) in an organism is very small, reducing the probability that this kind of evolution can occur. Furthermore, a useless gene produces a protein that either fails to fold properly or has no useful function in the organism. Producing this protein requires extra energy without producing any benefit, and is therefore detrimental to the organism. In addition, misfolded proteins have to be removed from the cell, requiring extra energy. Misfolded or useless proteins are actually likely to have some harmful effect on the organism. This means that these useless genes are likely actually harmful to the organism, more harmful than pseudogenes. Harmful mutations tend to be eliminated from a population, making it even more unlikely that a useless gene would persist long in a population. Finally, it is likely that some mutation to a useless gene would render it nonfunctional, producing a pseudogene, which would be unlikely to result in the evolution of a new protein of benefit to the organism. For all of these reasons, the evolution of new protein shapes through neutral mutations is highly unlikely. However, it is necessary to estimate the number of mutations needed in order to bound the probability of such new shapes evolving.

In [Denton and Marshall 01], it is stated that there are probably only a small number of protein folds, perhaps not more than a thousand or so altogether:

Consideration of these 'constructional laws' suggests that the total number of permissible folds is bound to be restricted to a very small number -- about 4,000, according to one estimate. Confirmation that this is probably so is provided by a different type of estimate, based on the discovery rate of new folds. Using this method, Cyrus Chothia of Britain's Medical Research Council estimated that the total number of folds utilized by living organisms may not be more than 1,000. Subsequent estimates have given figures of between 500 and 1,000.
References given include [Chothia 92] and [Lindgard and Bohr 96]. This small number of folds is evidence that protein folds, and functional proteins, are highly constrained. Thus the probability that a random sequence of amino acids would produce a properly folding protein is very small, because a random sequence would be highly unlikely to have the proper sequence of hydrophobic and hydrophilic side chains. This in turn implies that it would typically require many mutations to produce a new protein shape from an existing one.

For purposes of this argument, it is reasonable to estimate the probability of producing a specific new shape of protein from existing ones. This is because the probability of producing such a new shape decreases rapidly with the number of mutations required. Therefore the probability of producing any new shape protein is dominated by the probability of producing the particular new shape that requires the smallest number of mutations. To see this, consider f(x) = 10-x and note that the sum of f(3) + f(5) + f(8) is almost identical to f(3), because the larger arguments hardly influence the sum at all. And among the shapes requiring a minimal number of mutations, only a small fraction will be beneficial to the organism. Even if all shapes were considered, it would probably influence the probability by a factor of at most 1000 or 10,000.

Random sequences of amino acids are highly unlikely to fold properly into any shape of protein; rather, they will have many hydrophobic side chains exposed and will tend to stick together and also stick to other proteins that are in the process of folding. It is possible to estimate the probability that a random sequence of amino acids will fold properly. Assume that each protein fold has a particular sequence of hydrophobic and hydrophilic side chains, and the chance that a random amino acid will have the right properties for a particular position in the fold is 1/2. Each amino acid is specified by a codon of three nucleotides, each of which can be one of four bases, though the third nucleotide often does not make much of a difference. Therefore for a gene of 1000 base pairs, which is typical, there will be 333 amino acids, and the chance that a random sequence will have the right properties for a particular fold is 2-333, or about 10-100. Since there are about 10,000 folds, the chance that a random sequence could fold into any of these shapes would be at most 10-96. However, it is more reasonable to assume that some of the amino acids can deviate from the required properties without damaging the fold. If each fold permits 10 percent of the amino acids to be arbitrary, then one can show that the probability that a random sequence of amino acids will fold properly is about 10-50, under these assumptions. In reality, the probability should be much smaller, because it is not enough just to have hydrophobic or non-hydrophobic side chains at specified positions; other constraints must be satisfied, as well.

An estimate of the number of mutations needed to produce a protein of a new shape is provided by [Baker and Sali 01] in which it is stated that prediction of the structure of a protein is difficult if there is less than a 30% amino acid sequence identity. If there is 30% or more agreement in the amino acid sequences, the structure of the proteins will be similar. This implies that proteins of significantly different structure will differ by more than 70% of their amino acids. Also, from [Service 97]:

pairs of natural proteins differing in up to 70% of their amino acid sequences virtually always fold up in to the same general 3D structure.
It follows that obtaining a different structure requires more than a 70 percent difference in amino acid sequence, almost always. If a protein has 1000 coding base pairs, or 333 amino acids, then at least 233 of these must differ in order to have a 70% difference and a new shape. Each such difference will require one, and possibly two, point mutations, for probably well over 300 point mutations in all. There was a competition ([Service 97]) to reduce this 70 percent figure. The winner succeeded in obtaining a new protein fold by changing only 40 percent of the amino acids! This suggests that reducing this 40 percent figure will be difficult.

One can also get an estimate by noting that protein structures tend to bury hydrophobic side chains. Thus each protein fold will correspond to a particular pattern of hydrophobic and non-hydrophobic side chains. It seems that on the average one would have to change about half of the side chains from hydrophobic to non-hydrophobic and vice versa to get a new fold. Half of 333 amino acids would be 166 amino acids. Each such change would require one or two mutations, for well over 200 point mutations to change from one protein fold to another.

Another way to justify the fact that many mutations are needed to change the shape of a protein is found in [Cordes et al 99]:

Mutagenesis experiments show that limited changes in sequence can have large effects on stability and activity, but generally do not lead to large shifts in structure. For example, highly disruptive mutations such as insertions in elements of regular secondary structure or hydrophobic-to-charged substitutions at core positions lead to only minor structural differences in bacteriophage T4 lysozyme and staphylococcal nuclease, pointing to a strong drive to preserve the basic native fold.
This implies that many mutations are needed to produce a new fold. It also implies that in the transition between folds, a protein passes through a region of instability. Since natural proteins tend to be stable, it must be that instability is detrimental to the organism and (under evolutionary assumptions) is eliminated from the population. Therefore proteins tend to remain in regions of stability, and many mutations are required to change their shape. Thus mutations along the path of change would be harmful to the organism and would tend to be eliminated from the population.

Another figure is given in [Chen et al 96]. In order to change the function of a protein, making a slight change in the secondary structure of one small loop, leaving most of the tertiary structure of the protein unchanged, it was necessary to replace one sequence of 7 amino acids by another sequence of 13 amino acids and change 4 other amino acids. Inserting 6 amino acids requires 18 insertions and changing 7+4 more requires between 11 and 22 more substitutions, for a total of between 29 and 40 mutations. If the insertions are done carefully, these bounds can be reduced by 3 and 6 substitutions, respectively.

To get another estimate of the number of mutations needed, consider two additional references. In [Cordes et al 99], the authors present a mutation that exchanges one fold of a protein for another. This occurs by exchanging the functions of two residues in each strand of a beta sheet, and each ASP-LEU mutation requires at least four substitutions, for a total of at least eight point mutations. But these appear to happen in both monomers of the protein, which are coded for by the same gene, so only four point mutations would be needed. The new fold leaves most of the protein structure intact, so it is not clear that this qualifies as a change of the shape of the protein either. What happens is that a small portion of the two ends of the protein changes its configuration, but the middle of the protein does not. It would seem easier to change the configuration of the ends of a protein without modifying the overall structure than to change the middle. Changing the shape of a fold in the middle of a polypeptide chain would be unlikely to leave the ends of the fold in the same position as before, implying that additional changes would be necessary further on down the protein. Furthermore, the ends of the protein are on the outside of the protein in this case. For portions of a fold on the inside of the protein, changing the shape of the fold would also require changing the geometry of the surrounding portions of the protein, undoubtedly requiring many more mutations. For the small portion of the fold that is changed, 1/3 of the amino acids had to be changed. Larger folds and folds in the middle of the protein would require a larger proportion of amino acid changes and many more changes altogether.

In [Cordes et al 00], the authors discuss what happens when only half of these mutations occur. If only two of the four substitutions occur, the protein will rapidly oscillate between the two configurations, preserving the overall shape of the protein. This does not qualify as a change in the overall shape of the protein, but suggests that there are regions of instability between different folds.

We will suppose for concreteness that eight point mutations can produce a new shape of a protein that folds properly, with not too many hydrophobic side chains on the outside, et cetera, so that the protein could conceivably be a functional protein. However, it is still unlikely that the new protein has a function in the organism, because almost all mutations are harmful. Therefore it probably would require at least several such sets of mutations to produce a new protein shape that folds properly and has a useful function in the organism, or, 24 point mutations in all. In addition, a number of mutations to one or more active sites or other places in the protein involved in chemical interactions would probably be necessary, which we will estimate as at least 10 point mutations, for 34 in all. In reality, this is being very generous, and considerably more would be needed. Estimates given above were in the range 200-300 or more. If one assumes that a protein fold can tolerate disruptive mutations to ten percent of its amino acids, then at least 34 mutations would be needed to change the shape of the protein. Thus the figure of 34 point mutations is highly conservative.

Now, one problem with producing proteins of new shapes by evolution is that the mutations of a useless gene would be neutral mutations, because they would confer no benefit to the organism. From population genetics considerations it follows that almost all neutral mutations are eliminated from a population, and most such mutations do not last very long. According to the theory of neutral evolution, neutral mutations are eliminated from a population on the average in 2(Ne/N)ln(2N) generations (if I understand the matter correctly), where Ne is the effective population size, N is the population size, and ln is natural logarithm. Note that Ne/N is at most 1. For a population of a billion, this would be about 44 generations. For a population of a trillion, it would be about 56 generations. The chance to accumulate a significant number (even 2) of neutral mutations to a gene within 44 to 56 generations is negligible. A typical mutation rate for a gene is one mutation for every 10 5 generations, that is, a mutation somewhere in the gene typically occurs about this frequently. In order to get 34 mutations, the gene would have to persist in the population for about 34*10 5 generations, which is highly improbable because neutral mutations are quickly lost, as a rule. However, the effect of this is difficult to assess: even though most neutral mutations are quickly lost, those that remain may spread to a larger number of individuals. Also, there are "frequency-based selection" models under which rare alleles have an elevated chance of being retained in the population, which would tend to negate this loss of neutral mutations.

It is necessary to estimate the total population size in order to bound the probability of new protein shapes evolving. A recent article ([Whitman et al 98]) estimated that there are now about 5 * 1030 prokaryotes in existence and that on the average they reproduce about once every three years. This seems to give a good bound for the total population size. Each bacterium may reproduce about once every 15 minutes, which leads to about 30,000 generations per year if much food is available, and about 1014 generations in 3 billion years. Thus the total number of individuals ever existing can be bounded by about 1045 and may well be as small as 1040. Each gene experiences a mutation about once every 105 generations, implying the total number of alleles generated is about 1040. In order to generate a protein of a new shape, the probability of doing so for each allele cannot be much smaller than 10-40. If an allele is beneficial to the organism, it will tend to spread to many individuals, reducing the total number of alleles in the population. Therefore the probability of producing a protein of a new shape is maximized by assuming that all alleles are neutral.

To get a better estimate on the probabilities, 33 of these 34 mutations needed to change the shape of the protein would be neutral or harmful; the last one might be beneficial and more likely to be retained in the population. Assume that a particular set of 34 mutations is necessary to produce a beneficial protein having a new shape. Here we are being a little bit too severe on the theory of evolution, because there may be more than one such set of mutations. A typical gene has about 1000 base pairs of coding DNA (not counting introns, etc.), though this number can vary a lot. Suppose that the probability is p that a mutation will occur at any site in a gene. Suppose that if this mutation occurs to a site outside the specified set of 34 sites, the mutation has a probability of one half of permitting the specified protein fold to form and for the protein to satisfy the constraints necessary for a functional protein (the right hydrogen bonding occurs, etc.) and be beneficial to the organism. If the protein is not beneficial to the organism, the gene will eventually be eliminated from the population. (Our computation is not sensitive to this value of 1/2. Especially if 34 is replaced by a larger number, 3/4 or 7/8 would likely work as well.) However, every one of the specified 34 sites must mutate for the specified protein fold to form and for the constraints to be satisfied. Then the probability that this fold can form and the constraints can be satisfied and the protein can be beneficial is (1 - p/2)^(1000 - 34)*p34.

Suppose p is about 1/30, meaning that the expected number of mutations is about 33. Then the expression becomes (1 - 1/60)^(1000 - 34)*(1/30)34 or (8.9 * 10-8)*1/(1.7 * 1050), or 5.2 * 10-58, a little better than the previous computation but still highly improbable. If p is larger still, the probability will be even higher. For p = 1/15, one obtains a probability of about 5.963*10-55. For p = 1/10, one obtains 2.876*10-56. For p = 1/5, one obtains 9.717*10-69. Therefore even a larger probability of mutations does not help much.

Is it reasonable to assume that mutations must occur at a specified set of sites in order to change a protein fold? From [Cordes et al 99],

Burial of hydrophobic groups is widely acknowledged as a principal source of protein-folding stability, whereas burial of polar groups inevitably decreases stability.
Therefore it is reasonable to assume that the pattern of polar and hydrophobic side chains of residues in a protein must be changed in a specific way in order to obtain a new protein fold. This implies that mutations must occur to a specific subset of the amino acids in the protein, which justifies our assumption (except for the fact that some amino acids can be specified by more than one codon).

Is it reasonable to assume that a single mutation has a probability of one half of causing the protein to misfold or fail to be beneficial to the organism? Kimura (cited in ReMine, The Biotic Message, page 246) estimates that a mutation which alters an amino acid is ten times more likely to be harmful than neutral or beneficial. From [Crow 97], "most mutations if they have effects large enough to be detected phenotypically are deleterious. ... the evidence is strong that the great majority of mutations are partially dominant, so that heterozygotes show some decrease in fitness." This appears to imply that these mutations are harmful because they add some deleterious function to a protein. Thus one can expect most mutations that change an amino acid to be deleterious due to adding a harmful function to the protein, so it is not unreasonable to assign a probability of 1/2 to this possibility, even if these mutations do not change the fold of the protein. Such harmful mutations would tend to cause the mutating useless gene to be eliminated from the population even before the desired new protein fold is achieved. This would happen because single mutations are likely to occur before all of the mutations that cause the specified change of shape of the protein, and these mutations will tend to cause the gene to be eliminated from the population before the shape change occurs. Even the mutations contributing to the shape change are also likely to introduce deleterious additional functions to the protein.

A problem with the above analysis is that harmful mutations will tend to be eliminated from the population. Mutations that are harmful and partially dominant will be eliminated and will not appear in the final protein. Therefore p must be taken as the probability that the final protein will fail to fold properly, or will fail to be beneficial to the organism. How can such a probability be estimated?

An analysis of the properties of the various side chains leads to such an estimate. The fact that 9/10 of the mutations that change an amino acid, appear to be harmful, and many of these mutations are even partially dominant, implies that the different amino acids have very different properties, and cannot easily substitute for one another. To perform an analysis, it seems reasonable to divide the side chains into five classes: hydrophobic (Ala, Val, Ile, Leu, Phe, Tyr, Trp), polar (hydrophilic) (Ser, Thr, Cys, Met, Asn, Gln), positively charged (hydrophilic) (Arg, His, Lys), negatively charged (hydrophilic) (Asp, Glu), and conformationally important amino acids (Gly and Pro). Glycine is special because it has virtually no side chain (a single hydrogen), so that it is small in size, flexible, and also lacks the properties of the other amino acids. Proline is very rigid. Some amino acids are amphiphilic (partially hydrophobic and partially hydrophilic), including Lys, Arg, and Tyr, but these all appear in other classes. Amphiphilic side chains tend to occur at positions that are partly buried and partly exposed [deGrado 97]. Thus there are (at least) five classes of amino acids having significantly different properties. Mutations that replace an amino acid by one from a different class are likely to disrupt a protein's structure, so that only a small number can be permitted without destroying the fold.

Many mutations that change the class of amino acid may be harmful and may be eliminated from the population. Therefore it is reasonable to let p be the probability that a mutation changes the class of amino acid among the specified five classes, and remains in the population a long time (say, 100,000 generations). If u is the mutation rate, then p is actually a function of u, that is, p(u). Since the 34 specified mutations are needed to change the shape of the protein, it is reasonable to assume that they all replace an amino acid by another amino acid in a different class. If most such mutations are eliminated from the population, then it will take many generations in order to obtain a large value of p, and during this time the useless gene is more likely to be eliminated. Furthermore, if p is large enough to have a reasonable probability of changing the shape of the protein, then many other mutations to the protein will also accumulate, and these will all replace amino acids by others having significantly different properties. Only a small number of such mutations can be tolerated without destroying the fold.

Suppose that each mutation at one of the specified 34 sites that changes the class of amino acid has a probability of 1/2 of producing a residue having the right properties for the specified fold (because there are so many classes). Then the chance that all 34 sites will have a mutation permitting the right fold, if p = 1/10, is 10-34*2-34 or about 10-44. This is already too small, being less than 10-40, not even considering other factors. If p < 1/10, the probability that all 34 sites will mutate as necessary is even smaller. If p = 1/5, the probability is 10-34, which could permit the new fold to form. However, if p = 1/5, then about 40 percent of the amino acids in the protein would be replaced by others from a different class because each of the first two codon positions would have a 4/5 chance of not having such a disruptive mutation, and (4/5)2 is about 40 percent. It is intuitively clear that this is a major disruption of the protein. Of the 333 amino acids in the protein, this means that 133 would be so replaced, on the average. The probability that at most ten percent, or 33, amino acids would be so replaced turns out to be about 4 * 10-34. Thus the probability of obtaining the specified fold is about 10-67, which is much too small. If p is higher, the probability becomes even much smaller because the structure of the protein is randomized even more. This analysis does not even consider the many mutations replacing amino acids by others in the same class.

Is it reasonable to assume that at most ten percent of the amino acids can be replaced by others in a different class without disrupting the fold? If one assumes that a protein can tolerate more such replacements without changing the fold, then the specified 34 mutations will not change the fold, either, since these will change at most 34 amino acids and perhaps fewer. If it is necessary to replace more than ten percent of the amino acids by others in a different class to get a different fold, then the figure of 34 should be replaced by a significantly higher figure, but this will make the probability of such a shape change even much less, since p will have to be even higher.

Another possible mechanism for obtaining new protein shapes is that material is transposed from one gene into another by "transposable elements," producing a considerable stretch of genetic material at once and possibly more quickly producing a new protein. This possibility should also be considered.

In an article in the Sacramento Bee from March 19, 2001, "Much DNA just "junk" -- or is it? Human Genome Project spurs new look at mystery material," it is stated

For example, the body seems to have crafted 50 genes out of junk sequences known as transposons, so named because they are transposable, moving around the genome like text copied and duplicated in a computer file.
implying that this mechanism cannot explain most of the genes. Also, from a priori probability considerations, there is no reason why material transposed into a gene would be more likely to lead to a useful protein than random mutations. In addition, material inserted by a transposable element will consist of adjacent sites in the gene, meaning that contiguous sites will all be inserted at the same time. This results in a greater change than having isolated point mutations, and therefore would tend to decrease the probability of the evolution of new proteins. Finally, transposable elements newly appearing in genes are likely to render the genes useless. Thus such new appearances of transposable elements will likely be harmful, and therefore must be very rare, further reducing their probability. We will assume that this happens in a gene only about once in 105 generations on the average.

In order to compute the probability for transposable elements, we assume that on the average, each time one of the specified 34 sites is introduced by a transposable element, 10 other base pairs are introduced as well. Then when the 34 sites have all been introduced, 340 other sites will have been introduced as well. Each such new base pair has a probability of one-half of prohibiting the specified fold, for a total probability of at most (1/2)340 that this fold can form. This probability is at most 1/(2.3 * 10^102), which is impractically small. Under the second model, with five classes of amino acids, the result is the same as in the previous paragraphs, and it is not necessary to assume that 340 other sites have been introduced. Letting p be the probability that a transposable element will replace a particular amino acid by one in a different class, the computation is as before, with the added problem that insertions of any amino acid are likely to disrupt a protein's structure.

Is it reasonable to assume that the genetic material inserted by a transposon (or retrotransposon) is random, as we have done? There are a number of cases to consider. Transposons often create "direct repeat" sequences on their ends and may have one or more genes in the middle. If the transposon operates by "cut and paste," then it will eventually leave the place it entered and leave behind only the direct repeats. These patterns are too regular to generate new protein structures. If the transposon operates by a "copy and paste" mode, then the main body of the transposon will be left behind. If this part contains no genes, then it will tend to be randomized by point mutations over time, justifying our assumption of randomness. If the body of the transposon consists of simple repetitive sequences, then it does not have enough variety to generate new protein shapes. If the body of the transposon contains genes, then these genes will contain sequences that will probably cause the original gene to lose its functionality. If the transposon causes a frame shift in either its own DNA or the DNA beyond itself, this will tend to randomize the gene. If the (retro)transposon contains a pseudogene, it is a LINE, and these seem to have a mechanism for avoiding insertion into functional genes. Genes can also be inserted by viruses or passed from one bacterium to another. In this case the inserted gene would be functional and it does not seem possible to have two functional overlapping genes, so the original gene would lose its function. All in all, it does not seem that there is any way for transposons or retrotransposons to contribute non-random DNA sequences to the evolution of proteins except for simple sequences that could not explain the evolution of proteins. Therefore the assumption that material inserted by transposable elements is random, appears to be correct.

Even if transposable elements or other mutations did insert DNA from one gene into another, it would not help. Even proteins with similar shapes may differ in as many as 70% of their amino acids at corresponding positions in the protein. One would expect, therefore, that proteins of different shapes would differ in considerably more than 70% of their amino acids at different positions in the protein. This shows that the amino acid sequences of proteins having different shapes are significantly different throughout their whole extent. This means that there is little to be gained in protein evolution by concatenating portions of existing protein sequences to generate proteins having new shapes. Therefore genetic material from another gene, inserted in the middle of a gene, would behave essentially the same as random DNA, and the previous analysis would apply to it.

Proteins are often composed of "domains" that fold independently, and the same or similar domains can occur in different proteins. Therefore two proteins sharing a domain might share a subsequence that is largely similar. The question then becomes how domains with new shapes could evolve. However, even similar domains in different proteins are likely to have different parts buried and exposed, so their amino acid sequences are likely to be significantly different.

One might say that existing protein families do not provide intermediates between one protein shape and another, but perhaps other protein folds existed in the past that served as intermediates, so that the mutations required to pass between one fold and another were less. Or perhaps different protein folds existing in the past shared long common subsequences of amino acids, permitting new protein folds to be created by concatenating together long subsequences of existing folds. However, above quoted references imply that there are only a small number of protein folds possible in principle, because of the laws of physics. Therefore many additional folds not only did not exist, they could not exist in principle. If many such intermediates did exist, then one wonders why they no longer do. The great majority of all functional proteins that ever existed must have been such intermediates, if they ever existed, and one would expect that many of them would still be found in existing organisms. If indeed such intermediates can be constructed, then this is evidence that life did not evolve, but was designed, because if life evolved, these intermediates should still be found in living organisms.

Not only this, but in order for a new fold to form, it has to be beneficial to the organism, or at least not harmful. Thus the new protein must interact in a useful way with other existing proteins in the organism. This constrains the possible protein folds even more. Furthermore, two proteins A and B of different shapes are not likely to share long common subsequences S of amino acids. The reason for this is that predicting the structure of a protein from its amino acid sequence is a very hard problem. This in turn implies that the shape in which the subsequence S folds is determined not only by S but by the other amino acids in the protein. Therefore the sequence S is likely to fold into a different shape in proteins A and B, and it is also likely to have different parts buried and exposed in proteins A and B. If protein A is stable, this means that the pattern of hydrophobic and non-hydrophobic side chains in the sequence S is suited for the protein shape A. Since the shape of B (and of S in B) is so much different, and different parts of S are buried in B than in A, the pattern of side chains in S will probably be unsuited to shape B, and B will be unstable. This further justifies the assumption that material inserted into a gene behaves like random DNA for purposes of evolving new protein shapes.

If one assumes that inserted sequences of DNA have random properties, the evolution of new protein shapes is highly improbable, as already shown. But there are additional difficulties.

The sheer number of subsequences of DNA is one such problem. Suppose an organism has at least 200 genes, and each gene has about 1000 base pairs of coding DNA. Thus there would be at least 200,000 base pairs in all. Assuming an insertion of 10,000 or less base pairs, the portion inserted could begin at about 200,000 sites and have 10,000 lengths for about 2*109 such subsequences altogether, and there are not likely to be many repetitions among them except for repetitive segments of DNA that do not have enough diversity to generate many new protein shapes. Thus, if a protein can be constructed by such insertions, it cannot be done in many different ways, so the number of subsequences limits the probability of this happening. Each such subsequence could be inserted in about 1000 sites in a gene. The probability of getting the right site is thus about 10^-3. The chance of getting the right sequence in the right site is then about 10^-12. Suppose the right subsequences exist somewhere in the genome to generate a new protein shape (which is highly unlikely) and five such insertions are needed to get a new stable beneficial protein shape. The chance of this is about 10^-60, much too small, even ignoring many other factors such as the fact that most neutral mutations are lost from a population soon, the need for mutations to the active sites on the protein, and the fact that such insertions would be rare. Thus one cannot expect the insertion of DNA from one gene into another to facilitate the evolution of new protein shapes, independent of prohibitions on long common subsequences in different proteins.

Two or three insertions to obtain a protein having a new function appear to be the upper limit on what is feasible. The only way to obtain a functional protein with two or three insertions from another protein is if the inserted sequences are domains. In general, if the insertion of a long subsequence S from protein A to protein B, creating protein B', could produce a stable protein, then S would most likely be a domain of A and B. It is reasonable to assume that a domain can fold in only one way. Therefore the surrounding domains of S in A and B' would have to have the same configuration for S to fit into B'. Then B, without S, would have a hole in it lined with hydrophobic side chains. This would destabilize B and would also likely cause B to have undesirable reactions with other proteins. Therefore the gene producing B would be harmful and would be eliminated from the population. If there were two or more insertions of domains from A to B the problem would be even more severe, because there would be even more exposed hydrophobic side chains in B. Thus even independent of probability considerations, transferring a domain from one protein into another by an insertion is not plausible.

Since typical proteins have few hydrophobic side chains on the outside, it must be (in evolutionary terms) that this configuration confers some advantage on the organism. Therefore if B did have many hydrophobic side chains on the outside, one would expect B to gradually mutate to replace these side chains with hydrophilic side chains. Therefore when the insertion of A occurred, the "hole" in B would have many hydrophilic side chains lining it. These would be buried inside B', destabilizing it. Since typical proteins have few hydrophilic side chains on the inside, proteins with such side chains must be harmful to the organism, and thus B' would be eliminated from the population.

As if the above-mentioned problems with obtaining new shapes of proteins by insertions were not enough, there are many additional ones. An insertion of a domain S into a protein B, producing B', requires tight restrictions on the geometry of the ends of the polypeptide chains in B and S. Together with the fact that the interface between B' and S would have to coincide with the part of S that was buried before, there would only be a small number of such insertions possible in an organism. The chance that any of these would be beneficial to the organism is very small. If one considers the reactions involving existing active sites on these proteins, the total number of possibilities will be very small, and it will be very unlikely that any of these new proteins will be beneficial. According to the theory of evolution, proteins of new shapes evolved many times, so the probability of this would have to be very nearly one. If one considers adding a new active site, this will require many mutations to the new shape protein, and possibly to another protein as well, and even if these occur, it only guarantees that a new reaction can occur. The probability that the reaction will be beneficial will be very small. Multiplying all these improbabilities together yields a mathematical impossibility.

For remaining scenarios, some terminology is necessary. The protein produced by a single gene is called a "polypeptide chain." These can adhere together in some cases and act as a single unit, also called a protein, in which the polypeptide chains are called "subunits." Each polypeptide chain will tend to fold so that hydrophobic side chains are on the inside and non-hydrophobic side chains are on the outside.

Another problem with insertions of a domain S into a protein is that the geometry of the domains surrounding S would have to be essentially identical in A and B'. This is unlikely, and cannot explain the evolution of proteins in which a domain appears in a different surrounding geometry. The only way this could happen is if S and B join together as subunits to produce a protein essentially like B', even before the insertion. But in this case, no new functionality is being produced in the organism, so this cannot explain evolution of new protein functions. It also cannot explain where domains of new shapes come from. Furthermore, the interface between subunits S and B would have largely non-hydrophobic side chains; when they were joined, these would be buried inside the protein, destabilizing it.

There are mechanisms that can explain how domains, once existing, can join together. For example, two domains A and B existing separately could mutate to have side chains on their surfaces which could cause them to stick together (as subunits), while retaining their shape. This might give them a new function in the organism. Then if the geometry of the ends of A and B is just right, an insertion can cause a single gene to produce a protein with both A and B in it. If the geometry is not just right, it can be modified by point mutations to enable such an insertion. Maybe the joined protein would be more stable than A and B produced by separate genes. However, mechanisms for joining proteins into larger structures require modularity of protein structure, so that A and B fold much the same when joined as they do separately. Such modularity does not appear to exist below the domain level, for otherwise predicting protein structure would not be such a hard problem. Therefore these mechanisms cannot explain where the domains came from.

Even at the domain level, the joining of polypeptide chains into the same gene has a number of problems. Suppose polypeptide chains A and B react (stick together) as subunits to produce protein C having a new function. For this, A and B would need to mutate to have side chains on the surface so they would stick together. This would produce A' and B' that could react (join together). However, these new side chains would probably be harmful to the existing functions of A and B. Therefore there would have to be two useless genes, one for A and one for B, that could mutate to A' and B'. It is also unlikely that just joining two polypeptide chains would serve a new function in the organism. Probably several chains would have to join, reducing the probability. Furthermore, protein-protein reactions are very specific, and typically require close agreement between 10-15 side chains for two proteins to interact. To get this many side chains to have the right properties for two proteins to interact would probably require 7 or 8 amino acid changes, at least, for probably 12 substitutions on each protein, or, 24 substitutions in all. But probably just two domains joining would still not benefit the organism, so at least 48 substitutions would be necessary just to get three proteins to join together, not to mention mutations to other active sites. Many of these mutations would have to change an amino acid to an amino acid in a different class. As shown above, this is a mathematical impossibility, and even moreso when one considers the added improbability of the right insertion taking place to join these two subunits into one polypeptide chain. In addition, many non-hydrophobic side chains would be buried when A' and B' become part of the same polypeptide chain, which would destabilize it.

It also seems peculiar that subunits A and B would ever become parts of the same gene due to an insertion if they were already fulfilling a new function as separate genes. The joining would be highly improbable because it would require just the right insertion, and it would probably actually hinder the operation of AB. It would also require tight restrictions on the geometry of the ends of A and B. Therefore this scenario does not explain why there are so many proteins with multiple domains coded by the same gene.

Even though protein structure below the domain level is not modular, it may sometimes be true that combining folds A and B produces a combined fold largely preserving the structure of A and B. This might be true often enough to permit the evolution of new domain folds. However, the problems with combining domains apply even more strongly to combining sequences below the domain level.

The exchange of domains between genes can happen by different mechanisms, but has problems as well. Suppose a protein is composed of two parts A and B, and two more subunits C and D coded by separate genes are joined together in one protein. Then the interface between A and B would consist largely of hydrophobic side chains. The interface between C and D would consist largely of non-hydrophobic side chains. If B and D have similar geometries on their ends, and B is deleted from A, A and D might join together and have a new function in the organism. Deletions are more probable than insertions because they require less information to specify. But it seems highly unlikely that AD would have a beneficial new function without many point mutations, which as shown above is a mathematical impossibility. It is also unlikely that A and D would adhere to one another, because protein-protein reactions are very specific. The fact that A would have hydrophobic side chains and D would have non-hydrophobic side chains at their interface would also prohibit their adhering to one another. This mechanism also cannot explain how large proteins (large genes) are formed from small ones.

Some mutations can exchange one part B of a gene AB for a part D of another gene CD, producing AD. This could also exchange domains between one gene and another, and preserve the parts that are buried and exposed, avoiding problems mentioned above. However, this does not explain where the domains came from, or how they joined together in the first place. It also only permits insertions in which the interface between A and B is about the same as that between C and D, and cannot explain how a domain can appear in different proteins with substantially different parts buried and exposed. It also requires very tight restrictions on the locations of the ends of the polypeptide chains in A and D, reducing the probability that this could occur. Altogether there would be only a small number of ways that such an exchange could occur in an organism, because of so many restrictions. And of course, the chance that any of these exchanges would produce a protein with a new, beneficial function in the organism is very small, whether one considers reactions involving existing active sites or the generation of new active sites by further mutations. Since the theory of evolution requires that the probability of generating proteins of new shapes be nearly one, exchanges of domains cannot explain protein evolution.

There are other problems with evolution of proteins by insertion of domains from one protein to another, or an exchange of domains between proteins. Even for point mutations, at most about one in a thousand substitutions is beneficial to an organism, and the number may be much smaller. In a highly organized system such as life, the chance that a random change will be beneficial decreases rapidly with the magnitude of the change. Therefore the chance that a large change such as adding one or more domains to a protein will produce a benefit to the organism is very small, and possibly zero. In addition, each domain will have evolved to be adapted to the protein in which it already exists. Moving it to a new protein will place it in a role for which it is not adapted, which will almost certainly result in harm to the organism. Finally, there are many constraints on protein folding in addition to the requirements that hydrophobic side chains be in the interior and non-hydrophobic side chains be on the exterior. The chance that a domain, moving from one protein to another, will satisfy these additional constraints, is very small.

As evidence of this, most mutations that change the amino acid are harmful. This implies that even exchanging one hydrophobic side chain on the inside of a protein for another, is likely to be harmful. Also, even the same domain (folding the same way) in two different organisms or proteins is likely to have many amino acids different, sometimes almost all of them, including the hydrophobic side chains on the inside. Therefore, taking a domain from one protein and putting it in another is going to result in hydrophobic side chains at the interface that do not mesh with each other. The chance that the resulting protein will be stable and fold correctly is very small.

Another problem is that the kind of mutations that can result in the moving of a domain from one protein to another, are either very improbable or almost certainly harmful to the organism.

Transfer of genes from one organism to a different one is not likely to help in the generation of new protein shapes, because such a transferred protein would have to be beneficial in both organisms, adding considerably more constraints. This could happen if the protein only involved biological mechanisms that were common among many organisms, but then the probabilities are about the same as if all these organisms belonged to the same population.

Unless the evolution of proteins of new shapes is possible, evolution is blocked. All scenarios for protein evolution have been shown to be mathematically impossible, under reasonable assumptions.


Baker, D. and Sali, A., Protein Structure Prediction and Structural Genomics (Science, Vol. 294, 5 October 2001, pp. 93-96.
Chen R, Greer A, and Dean AM, Redesigning secondary structure to invert coenzyme specificity in isopropylmalate dehydrogenase, Proc Natl Acad Sci U S A 1996 Oct 29;93(22):12171-6.
Chothia, C. One thousand families for the molecular biologist, Nature 357, 543-544 (1992)
Cordes MH, Burton RE, Walsh NP, McKnight CJ, Sauer RT, An evolutionary bridge to a new protein fold, Nat Struct Biol 2000 Dec;7(12):1129-1132.
Cordes, M., Walsh, N., McKnight, C.J, and Sauer, R., Evolution of a protein fold in vitro, Science 1999 Apr 9;284(5412):325-328.
Crow, J., The high spontaneous mutation rate: Is it a health risk?, PNAS Vol. 94, pp. 8380-8386, August 1997.
deGrado, Proteins from Scratch, Science 278:3 (3 October 1997) 80-81.
Denton, M. and Marshall, C., Laws of Form Revisited, posted April 4, 2001 on the Creation Science Resource Bulletin Board.
Lindgard, P. and Bohr, H. How many protein fold classes are to be found? in Protein Folds (eds Bohr, H. & Brunak, S.) 98-102 (CRC Press, New York, 1996).
ReMine, The Biotic Message, page 246.
Service, R., Amino Acid Alchemy Transmutes Sheets to Coils, Science Vol 277 11 July 1997 p. 179.
Whitman, W., Coleman, D., and Wiebe, W., Prokaryotes: The unseen majority, PNAS 95:12, June 9, 1998, pp. 6578-6583.

Back to home page.