Synthetic gene design


Why do it

The goal of designing a synthetic gene is to create a gene with optimized features for the intended research. Instead of cloning an existing gene and living with whatever DNA sequence it has, a synthetic gene allows you to exploid the redundancy in the codon usage to engineer desired properties into the synthetic gene. Design goals can be to optimize the codon usage for an expression host of choice, insert convenient restriction sites, remove inverted repeats, (pre/ap)pend affinity tags, etc. As an additional advantage, it allows you to skip the cloning step since all you need is the amino acid sequence. Of course there are also disadvantages. You have to do the design work, you have to pay for the DNA synthesis, and what looked good on paper may not work in practice. There are several reports in the literature on using oligo synthesis to create genes. We ourselves have no experience with this yet but these pages will document our attempts to get this working.

!!! Use this information at your own risk !!!

How to do it

Our goal is to create a synthetic gene for truncated pilin proteins from Pseudomonas aeruginosa. These proteins are 120-130 amino acids in length and therefore require approx. 400 bases to make up the gene. The design goals are to produce a gene that will allow very high levels of protein expression for use in either in vitro coupled transcription/translation protein synthesis or traditional expression in E. coli. In addition, we wish to introduce unique restriction enzyme cleavage sites at convenient places to facilitate future mutagenesis studies. The design followed the following steps:

reverse translation
given the amino acid sequence generate the corresponding DNA sequence using the preferred codon usage of E. coli.
restriction site design
find all positions where a unique restriction site can be included without changing the amino acid sequence
prevent inverted repeats
detect and, if necessary, remove inverted repeats which can slow down transcription

All programs used here come from the EMBOSS package which is distributed at no cost (EMBOSS). The programs can be compiled for most UNIX workstations including Linux which I am using. You can find documentation for the programs in directory /eagle/share/seqanal/EMBOSS-1.8.0/doc/programs/text or you can click here

Several programs need or produce sequence files. I have been using the FASTA format. FASTA files are simple and contain one comment line (starting with a ">" symbol) followed by the sequence data (either amino acid or DNA). Here is an example:

>pao gi|77635|pir||A25023 fimbrial protein - Pseudomonas aeruginosa (strain PAO)
ARSEGASALATINPLKTTVEESLSRGIAGSKIKIGTTASTATETYVGVEPDANKLGVIAVAIEDSGAGDI
TFTFQTGTSSPKNATKVITLNRTADGVWACKSTQDPMFTPKGCDN

Reverse translation

For this step we use the backtranseq program. This program reads in an amino acid sequence and produces a DNA sequence. The user can tell the program for which expression host to optimize the codon usage. The command to do this is:

backtranseq -seq test.fa -out test.dna -cfile Eecoli.cut
-seq test.fa
Tell the program to read the amino acid from the file named test.fa.
-out test.dna
Tell the program to name the output file with the DNA sequence test.dna.
-cfile Eecoli.cut
Tell the program to optimize the codon usage for E. coli. To see a list of supported organisms type: embossdata -showall

Look at the test.dna output file. It should contain a DNA sequence. To get a more fancy and informative representation you should run the remap program which lists the forward and reverse DNA strands, the amino acid sequence of the six reading frames (the one on top corresponds to your protein sequence), and the restriction sites. At the end of the file you will find a table of all enzymes that cut your DNA sequence and it also lists how many times it cuts. Finally, there is a table of all enzymes that do not cut your DNA sequence. The remap command contains a number of options to select which restriction enzymes should be displayed. You can for instance ask it to only show enzymes that produce sticky ends, are commercially available, and recognize at least 6 bases. For more options you have to look at the documentation. An example of a remap command is:

remap -seq test.dna -out test.remap -enzymes ALL -sitelen 6 -sticky -commercial

An example of part of the output is:

try7
gi|77635|pir||A25023 fimbrial protein - Pseudomonas aeruginosa (strain
PAO)

                          BstXI
                          =======>====
                      BbeI
                      ====>=
                      KasI                            BbsI
                      >=====                          ======.>
          BsePI       Mly113I                       Eco57I
          >=====      =>====                        ======............
          GCGCGCAGCGAAGGCGCCAGCGCCCTGGCCACCATCAACCCGCTGAAGACCACCGTGGAA
                   10        20        30        40        50        60
          ----:----|----:----|----:----|----:----|----:----|----:----|
          CGCGCGTCGCTTCCGCGGTCGCGGGACCGGTGGTAGTTGGGCGACTTCTGGTGGCACCTT
          =====<      ====<=                        ======............
          BsePI       Mly113I                       Eco57I
                      =====<                          ======......<
                      KasI                            BbsI
                      =<====
                      BbeI
                          ====<=======
                          BstXI

          A  R  S  E  G  A  S  A  L  A  T  I  N  P  L  K  T  T  V  E
           R  A  A  K  A  P  A  P  W  P  P  S  T  R  *  R  P  P  W  K
            A  Q  R  R  R  Q  R  P  G  H  H  Q  P  A  E  D  H  R  G  R
          ----:----|----:----|----:----|----:----|----:----|----:----|
            R  A  A  F  A  G  A  G  Q  G  G  D  V  R  Q  L  G  G  H  F
           A  R  L  S  P  A  L  A  R  A  V  M  L  G  S  F  V  V  T  S
          X  A  C  R  L  R  W  R  G  P  W  W  *  G  A  S  S  W  R  P

               Bpu10I
               =>=====
          Eco57I     Cfr42I                                  AsuNHI
          ...>       ===>==                                  >=====
          GAAAGCCTGAGCCGCGGTATCGCGGGTAGCAAAATCAAAATCGGCACCACCGCTAGCACC
                   70        80        90        100       110       120
          ----:----|----:----|----:----|----:----|----:----|----:----|
          CTTTCGGACTCGGCGCCATAGCGCCCATCGTTTTAGTTTTAGCCGTGGTGGCGATCGTGG
          ..<        ==<===                                  =====<
          Eco57I     Cfr42I                                  AsuNHI
               =====<=
               Bpu10I

          E  S  L  S  R  G  I  A  G  S  K  I  K  I  G  T  T  A  S  T
           K  A  *  A  A  V  S  R  V  A  K  S  K  S  A  P  P  L  A  P
            K  P  E  P  R  Y  R  G  *  Q  N  Q  N  R  H  H  R  *  H  R
          ----:----|----:----|----:----|----:----|----:----|----:----|
            F  A  Q  A  A  T  D  R  T  A  F  D  F  D  A  G  G  S  A  G
           S  L  R  L  R  P  I  A  P  L  L  I  L  I  P  V  V  A  L  V
          L  F  G  S  G  R  Y  R  P  Y  C  F  *  F  R  C  W  R  *  C

etc...

Enzymes that cut

      AarI  1
    Acc36I  1
    Acc65I  1
      AgeI  1
      AloI  1
      AspI  1
    AsuNHI  1
     BamHI  1
      BbeI  1
      BbsI  1
      BbuI  1
      BcgI  1
      BclI  1
    Bpu10I  1
     BsePI  1
      BsgI  1
     BsiWI  1
     BspCI  1
     BstXI  1
      BtsI  1
    Cfr42I  1
      EciI  1
    Eco57I  1
   HindIII  1
      KasI  1
      KpnI  1
   Mly113I  1

      
    
Enzymes that do not cut

AatII     AauI      AccB7I    AccIII    AclI      AdeI      AflII     AhdI
AhlI      Alw44I    AlwNI     AocI      ApaI      ApaLI     AscI      AseI
AsiAI     AsnI      Asp718I   AspEI     AsuII     AvrII     AxyI      BanIII
BbvCI     BciVI     BcuI      BfiI      BfrI      BfuI      BglI      BglII
BlnI      BlpI      BmrI      BpiI      BplI      BpmI      Bpu1102I  Bpu14I
BpuAI     Bsa29I    BsaI      BsaMI     BscCI     BscI      Bse21I    Bse3DI
BseAI     BseCI     BseMI     BseRI     BseX3I    BshTI     BsiCI     BsiMI
BsiQI     BsiXI     BsmBI     BsmI      Bso31I    Bsp106I   Bsp119I   Bsp120I
Bsp13I    Bsp1407I  Bsp1720I  Bsp19I    BspA2I    BspDI     BspEI     BspHI
BspLU11I  BspMI     BspTI     BspXI     BsrDI     BsrGI     BssHII    BssSI
Bst2BI    Bst98I    BstAPI    BstBI     BstEII    BstENI    BstPI     BstV2I
BstZI     Bsu15I    Bsu36I    Bsu6I     CaiI      CciNI     CelII     Cfr9I
ClaI      Csp45I    CvnI      DraIII    DrdI      DseDI     EagI      Eam1104I
Eam1105I  EarI      EclHKI    EclXI     Eco31I    Eco52I    Eco81I    Eco91I
EcoNI     EcoO65I   EcoRI     EcoT22I   Esp3I     FauNDI    FbaI      FseI
GsuI      Kpn2I     Ksp22I    Ksp632I   KspI      LspI      MfeI      MluI
Mph1103I  MroI      MroNI     MspCI     MunI      Mva1269I  NarI      NcoI
NdeI      NgoAIV    NgoMIV    NheI      NotI      NruGI     NsiI      NspV
PacI      PaeI      PaeR7I    PagI      PauI      PciI      PctI      Pfl23II
PflFI     PflMI     PinAI     Ple19I    PpiI      Ppu10I    PshBI     Psp124BI
Psp1406I  PspAI     PspEI     PspLI     PspOMI    PstI      PsyI      PvuI
RcaI      SacI      SacII     SalI      SapI      SbfI      SdaI      SfiI
Sfr274I   Sfr303I   SfuI      SgfI      SpeI      SphI      Sse8387I  SspBI
SstI      SstII     SunI      TliI      Tth111I   Van91I    Vha464I   VneI
VspI      XagI      XbaI      XcmI      XhoI      XmaCI     XmaI      XmaIII
XmaJI     Zsp2I

Restriction site design

For this we need a program that predicts base changes that would create new restriction sites without changing the sequence. That is done with the program silent (which makes "silent" base changes). The program needs an input DNA sequence that corresponds exactly to the amino acid that should be encoded (e.g. the output of backtranseq). The command to do this is:

silent -seq test.dna -out test.silent -enzymes ALL
-seq test.dna
Tell the program to read the DNA sequence from the file named test.dna. In most cases this will be the file produced by backtranseq
-out test.dna
Tell the program to name the output file with the DNA sequence test.silent.
-enzymes ALL
Here you can specify a comma-separated list of enzymes the program should consider. Normally you'll use ALL to cover them all.

The next thing to do is to look for interesting restriction sites in the output of silent. An interesting enzyme is one that is unique within your gene and within the vector that you want to use. Therefore I ignore any restriction sites that recognize fewer than 6 unique bases. Here is an example bit of a silent output file:


Results for pao:

KEY:
        Enzyme          Enzyme name
        RS-Pattern      Restriction enzyme recognition site pattern
        Match-Posn      Position of the first base of RS pattern in sequence
        AA              Amino acid. Original sequence(.)After mutation
        Base-Posn       Position of base to be mutated in sequence
        Mutation        The base mutation to perform

Silent mutations

Enzyme      RS-Pattern  Match-Posn   AA  Base-Posn Mutation
XspI        CTAG           17        A.A    18       G->T
PstNHI      GCTAGC         16        A.A    18       G->T
BstXI       CCANNNNNNTGG   17        A.A    18       G->C
HaeII       RGCGCY         19        A.A    24       G->Y
FunI        AGCGCT         19        A.A    24       G->T
AbeI        CCTCAGC        66        L.L    69       G->C
BseMII      CTCAG          67        L.L    69       G->C
...

As you can see, the program produces a long list of potential restriction sites. The meaning of the 6 columns is listed at the top of the file. The thing to look for is column 2. Four-cutters, like XspI, are not interesting since they almost certainly will not be unique in your gene and vector. Similarly, five-cutters (e.g. BseMII) and redundant six-cutters (e.g. HaeII) are less interesting. I therefore looked for non-redundant six-cutters (PstNHI and FunI in the list above) or enzymes with even higher specificity (e.g. AbeI, a seven-cutter).

For each restriction site of interest I then checked the frequency of the codon needed to create the restriction site. If the required codon is very rare then it may be better to look for another potential restriction site nearby. Tables of codon usage in various organisms can be found in directory /home/share/seqanal/EMBOSS-1.8.0/emboss/data/CODONS. For E. coli this looks like

TAG     *            0.070      0.200   0
TGA     *            0.210      0.600   0
TAA     *            0.710      2.000   2
GCT     A            0.200     20.000   20
GCA     A            0.220     21.300   21
GCC     A            0.230     23.000   23
GCG     A            0.350     34.510   34
TGT     C            0.420      4.000   4
TGC     C            0.580      5.500   5
GAC     D            0.460     25.110   25
GAT     D            0.540     29.710   29
GAG     E            0.290     19.000   19
GAA     E            0.710     47.410   47
TTT     F            0.440     15.600   15
TTC     F            0.560     20.100   20
GGA     G            0.060      4.700   4
GGG     G            0.100      7.600   7
GGC     G            0.410     31.210   31
GGT     G            0.430     32.510   32
...
ACA     T            0.090      4.800   4
ACG     T            0.210     10.900   10
ACT     T            0.210     11.000   11
ACC     T            0.490     25.210   25
GTC     V            0.180     13.000   13
GTA     V            0.180     13.100   13
GTT     V            0.320     23.500   23
GTG     V            0.330     24.300   24
TGG     W            1.000     10.700   10
TAT     Y            0.480     13.400   13
TAC     Y            0.520     14.700   14

I do not know to what extend a rare codon will slow down protein synthesis, especially if you have just a few of them. It is really a judgment call, weighing the lower frequency of the codon against the benefits of introducing the restriction site. In my experience so far I rarely if ever have had to go below a frequency of 0.2. So in the list above I would avoid the GGA, GGG, and ACA codons, and I would think twice about the GAG, GTC, and GTA codons.

Finally, the enzyme also has to be commercially available and it would be nice if it produced a sticky end. You can find this out with the program redata. For instance for the BstXI enzyme you would type:

redata -enzyme BstXI -out BstXI.info

Which would yield:

BstXI

Recognition site is CCANNNNNNTGG leaving sticky ends
  Cut positions 5':8 3':4
Organism: Bacillus stearothermophilus X1
Source: ATCC 49821

Isoschizomers:
   BscJI       BssGI       BstHZ55I    BstTI       

Suppliers:
Amersham Pharmacia Biotech (11/00)
Life Technologies Inc. (1/98)
HYBAID GmbH (12/00)
Stratagene (11/00)
Fermentas AB (11/00)
Appligene Oncor (10/97)
American Allied Biochemical, Inc. (10/98)
Nippon Gene Co., Ltd. (6/00)
Takara Shuzo Co. Ltd. (7/00)
Kramel Biotech (7/98)
Roche Molecular Biochemicals (3/99)
New England BioLabs (12/00)
Toyobo Biochemicals (11/98)
CHIMERx (10/97)
Promega Corporation (6/99)

References:
Langdale, J.A., Myers, P.A., Roberts, R.J., Unpublished observations.
Nwankwo, D., Unpublished observations.
Wise, R., Schildkraut, I., Unpublished observations.

So this enzyme produces sticky ends (it cuts the forward strand 8 bases after the start of the recognition sequence and the reverse strand 4 bases after the start of the recognition sequence). It has 4 isoschizomers (enzymes with the same specificity and cleavage site), and there is a whole host of suppliers to choose from. Unfortunately, you only get the suppliers for the enzyme you specified, and not those for the isoschizomers. For instance for PstNHI the output is:

PstNHI

Recognition site is GCTAGC leaving sticky ends
  Cut positions 5':1 3':5
Organism: Pseudomonas stutzeri NH
Source: S.K. Degtyarev

Isoschizomers:
   NheI        AceII       AsuNHI      

References:
Myakisheva, T.V., Dedkov, V.S., Kileva, E.V., Abdurashitov, M.A., Shevchenko, A.V., Degtyarev, S.K., Unpublished observations.

So it seems there are no commercial suppliers. However, if you would do the search for the isoschizomer NheI you will find that it is for sale at most suppliers (try yourself). So if you don't find a supplier for an enzyme you'll have to go along all isoschizomers to see if any of those would be commercially avaible.

Once you have decided that a restriction site suggested by silent is worth including in your design you can make the actual modification in your dna sequence file (the one you used as input for silent). I recommend to make each modification in a copy of the previous file so that if you mess up you can go back to the last copy that worked well. After you have made the modification you can rerun the remap program to see the new site.

In addition to creating new restriction sites it can also be worthwhile to remove some sites. For instance, if you want to use a particular restriction enzyme to insert your gene in the vector and there is a restriction site for that enzyme inside your gene you may have to remove it. (Perhaps you don't have to because you can design in the right overhangs that will be compatible with the sticky ends of your vector without having to cut your synthetic gene). Another situation, which actually happened to me, is that there are two restriction sites for an enzyme in your gene and none in the vector. By making a silent mutation that removes one of the restriction sites you create a unique site.


Prevent inverted repeats

When there is an inverted repeat in your sequence there is a chance that it will create a stem-loop structure in your mRNA which will degrade translation efficiency. You can find such inverted repeats with the program einverted. This program reads a DNA sequence and finds potential inverted repeats. If there is a significant inverted repeat you can use silent mutations to break down the inversion symmetry. To run the program type:

einverted -seq test.dna -out test.repeat -gap 12 -match 3 -mismatch -4 -threshold 50
-gap 12
Subtract 12 from the score for the introduction of gaps
-match
Add 3 to the score for each base pair match in the inverted repeat
-mismatch
Subtract 4 from the score for each base pair mismatch
-threshold 50
List a possible stemloop if it's score is higher than 50

The values listed above are the defaults. In my case no inverted repeats were found at these settings. Changing the values to 6, 4, -3, and 40 did yield predictions of inverted repeats. Since the program does not seem to be developed for stem-loop prediction in RNA I don't know how to interpret the results, e.g. what settings would be needed to predict stem loops that are significant at 37 degrees centigrade. I just tweaked a few areas with long unbroken repeats. This is clearly an area that could be improved.


For questions or suggestions please send e-mail to: Bart.Hazes@Ualberta.ca