Synthetic gene design
The goal of designing a synthetic gene is to create a gene with optimized features for the intended research. Instead of cloning an existing gene and living with whatever DNA sequence it has, a synthetic gene allows you to exploid the redundancy in the codon usage to engineer desired properties into the synthetic gene. Design goals can be to optimize the codon usage for an expression host of choice, insert convenient restriction sites, remove inverted repeats, (pre/ap)pend affinity tags, etc. As an additional advantage, it allows you to skip the cloning step since all you need is the amino acid sequence. Of course there are also disadvantages. You have to do the design work, you have to pay for the DNA synthesis, and what looked good on paper may not work in practice. There are several reports in the literature on using oligo synthesis to create genes. We ourselves have no experience with this yet but these pages will document our attempts to get this working.
!!! Use this information at your own risk !!!
All programs used here come from the EMBOSS package which is distributed at no cost (EMBOSS). The programs can be compiled for most UNIX workstations including Linux which I am using. You can find documentation for the programs in directory /eagle/share/seqanal/EMBOSS-1.8.0/doc/programs/text or you can click here
Several programs need or produce sequence files. I have been using the FASTA format. FASTA files are simple and contain one comment line (starting with a ">" symbol) followed by the sequence data (either amino acid or DNA). Here is an example:
>pao gi|77635|pir||A25023 fimbrial protein - Pseudomonas aeruginosa (strain PAO) ARSEGASALATINPLKTTVEESLSRGIAGSKIKIGTTASTATETYVGVEPDANKLGVIAVAIEDSGAGDI TFTFQTGTSSPKNATKVITLNRTADGVWACKSTQDPMFTPKGCDN
For this step we use the backtranseq program. This program reads in an amino acid sequence and produces a DNA sequence. The user can tell the program for which expression host to optimize the codon usage. The command to do this is:
backtranseq -seq test.fa -out test.dna -cfile Eecoli.cut
Look at the test.dna output file. It should contain a DNA sequence. To get a more fancy and informative representation you should run the remap program which lists the forward and reverse DNA strands, the amino acid sequence of the six reading frames (the one on top corresponds to your protein sequence), and the restriction sites. At the end of the file you will find a table of all enzymes that cut your DNA sequence and it also lists how many times it cuts. Finally, there is a table of all enzymes that do not cut your DNA sequence. The remap command contains a number of options to select which restriction enzymes should be displayed. You can for instance ask it to only show enzymes that produce sticky ends, are commercially available, and recognize at least 6 bases. For more options you have to look at the documentation. An example of a remap command is:
remap -seq test.dna -out test.remap -enzymes ALL -sitelen 6 -sticky -commercial
An example of part of the output is:
try7 gi|77635|pir||A25023 fimbrial protein - Pseudomonas aeruginosa (strain PAO) BstXI =======>==== BbeI ====>= KasI BbsI >===== ======.> BsePI Mly113I Eco57I >===== =>==== ======............ GCGCGCAGCGAAGGCGCCAGCGCCCTGGCCACCATCAACCCGCTGAAGACCACCGTGGAA 10 20 30 40 50 60 ----:----|----:----|----:----|----:----|----:----|----:----| CGCGCGTCGCTTCCGCGGTCGCGGGACCGGTGGTAGTTGGGCGACTTCTGGTGGCACCTT =====< ====<= ======............ BsePI Mly113I Eco57I =====< ======......< KasI BbsI =<==== BbeI ====<======= BstXI A R S E G A S A L A T I N P L K T T V E R A A K A P A P W P P S T R * R P P W K A Q R R R Q R P G H H Q P A E D H R G R ----:----|----:----|----:----|----:----|----:----|----:----| R A A F A G A G Q G G D V R Q L G G H F A R L S P A L A R A V M L G S F V V T S X A C R L R W R G P W W * G A S S W R P Bpu10I =>===== Eco57I Cfr42I AsuNHI ...> ===>== >===== GAAAGCCTGAGCCGCGGTATCGCGGGTAGCAAAATCAAAATCGGCACCACCGCTAGCACC 70 80 90 100 110 120 ----:----|----:----|----:----|----:----|----:----|----:----| CTTTCGGACTCGGCGCCATAGCGCCCATCGTTTTAGTTTTAGCCGTGGTGGCGATCGTGG ..< ==<=== =====< Eco57I Cfr42I AsuNHI =====<= Bpu10I E S L S R G I A G S K I K I G T T A S T K A * A A V S R V A K S K S A P P L A P K P E P R Y R G * Q N Q N R H H R * H R ----:----|----:----|----:----|----:----|----:----|----:----| F A Q A A T D R T A F D F D A G G S A G S L R L R P I A P L L I L I P V V A L V L F G S G R Y R P Y C F * F R C W R * C etc... Enzymes that cut AarI 1 Acc36I 1 Acc65I 1 AgeI 1 AloI 1 AspI 1 AsuNHI 1 BamHI 1 BbeI 1 BbsI 1 BbuI 1 BcgI 1 BclI 1 Bpu10I 1 BsePI 1 BsgI 1 BsiWI 1 BspCI 1 BstXI 1 BtsI 1 Cfr42I 1 EciI 1 Eco57I 1 HindIII 1 KasI 1 KpnI 1 Mly113I 1 Enzymes that do not cut AatII AauI AccB7I AccIII AclI AdeI AflII AhdI AhlI Alw44I AlwNI AocI ApaI ApaLI AscI AseI AsiAI AsnI Asp718I AspEI AsuII AvrII AxyI BanIII BbvCI BciVI BcuI BfiI BfrI BfuI BglI BglII BlnI BlpI BmrI BpiI BplI BpmI Bpu1102I Bpu14I BpuAI Bsa29I BsaI BsaMI BscCI BscI Bse21I Bse3DI BseAI BseCI BseMI BseRI BseX3I BshTI BsiCI BsiMI BsiQI BsiXI BsmBI BsmI Bso31I Bsp106I Bsp119I Bsp120I Bsp13I Bsp1407I Bsp1720I Bsp19I BspA2I BspDI BspEI BspHI BspLU11I BspMI BspTI BspXI BsrDI BsrGI BssHII BssSI Bst2BI Bst98I BstAPI BstBI BstEII BstENI BstPI BstV2I BstZI Bsu15I Bsu36I Bsu6I CaiI CciNI CelII Cfr9I ClaI Csp45I CvnI DraIII DrdI DseDI EagI Eam1104I Eam1105I EarI EclHKI EclXI Eco31I Eco52I Eco81I Eco91I EcoNI EcoO65I EcoRI EcoT22I Esp3I FauNDI FbaI FseI GsuI Kpn2I Ksp22I Ksp632I KspI LspI MfeI MluI Mph1103I MroI MroNI MspCI MunI Mva1269I NarI NcoI NdeI NgoAIV NgoMIV NheI NotI NruGI NsiI NspV PacI PaeI PaeR7I PagI PauI PciI PctI Pfl23II PflFI PflMI PinAI Ple19I PpiI Ppu10I PshBI Psp124BI Psp1406I PspAI PspEI PspLI PspOMI PstI PsyI PvuI RcaI SacI SacII SalI SapI SbfI SdaI SfiI Sfr274I Sfr303I SfuI SgfI SpeI SphI Sse8387I SspBI SstI SstII SunI TliI Tth111I Van91I Vha464I VneI VspI XagI XbaI XcmI XhoI XmaCI XmaI XmaIII XmaJI Zsp2I
For this we need a program that predicts base changes that would create new restriction sites without changing the sequence. That is done with the program silent (which makes "silent" base changes). The program needs an input DNA sequence that corresponds exactly to the amino acid that should be encoded (e.g. the output of backtranseq). The command to do this is:
silent -seq test.dna -out test.silent -enzymes ALL
The next thing to do is to look for interesting restriction sites in the output of silent. An interesting enzyme is one that is unique within your gene and within the vector that you want to use. Therefore I ignore any restriction sites that recognize fewer than 6 unique bases. Here is an example bit of a silent output file:
Results for pao: KEY: Enzyme Enzyme name RS-Pattern Restriction enzyme recognition site pattern Match-Posn Position of the first base of RS pattern in sequence AA Amino acid. Original sequence(.)After mutation Base-Posn Position of base to be mutated in sequence Mutation The base mutation to perform Silent mutations Enzyme RS-Pattern Match-Posn AA Base-Posn Mutation XspI CTAG 17 A.A 18 G->T PstNHI GCTAGC 16 A.A 18 G->T BstXI CCANNNNNNTGG 17 A.A 18 G->C HaeII RGCGCY 19 A.A 24 G->Y FunI AGCGCT 19 A.A 24 G->T AbeI CCTCAGC 66 L.L 69 G->C BseMII CTCAG 67 L.L 69 G->C ...
As you can see, the program produces a long list of potential restriction sites. The meaning of the 6 columns is listed at the top of the file. The thing to look for is column 2. Four-cutters, like XspI, are not interesting since they almost certainly will not be unique in your gene and vector. Similarly, five-cutters (e.g. BseMII) and redundant six-cutters (e.g. HaeII) are less interesting. I therefore looked for non-redundant six-cutters (PstNHI and FunI in the list above) or enzymes with even higher specificity (e.g. AbeI, a seven-cutter).
For each restriction site of interest I then checked the frequency of the codon needed to create the restriction site. If the required codon is very rare then it may be better to look for another potential restriction site nearby. Tables of codon usage in various organisms can be found in directory /home/share/seqanal/EMBOSS-1.8.0/emboss/data/CODONS. For E. coli this looks like
TAG * 0.070 0.200 0 TGA * 0.210 0.600 0 TAA * 0.710 2.000 2 GCT A 0.200 20.000 20 GCA A 0.220 21.300 21 GCC A 0.230 23.000 23 GCG A 0.350 34.510 34 TGT C 0.420 4.000 4 TGC C 0.580 5.500 5 GAC D 0.460 25.110 25 GAT D 0.540 29.710 29 GAG E 0.290 19.000 19 GAA E 0.710 47.410 47 TTT F 0.440 15.600 15 TTC F 0.560 20.100 20 GGA G 0.060 4.700 4 GGG G 0.100 7.600 7 GGC G 0.410 31.210 31 GGT G 0.430 32.510 32 ... ACA T 0.090 4.800 4 ACG T 0.210 10.900 10 ACT T 0.210 11.000 11 ACC T 0.490 25.210 25 GTC V 0.180 13.000 13 GTA V 0.180 13.100 13 GTT V 0.320 23.500 23 GTG V 0.330 24.300 24 TGG W 1.000 10.700 10 TAT Y 0.480 13.400 13 TAC Y 0.520 14.700 14
I do not know to what extend a rare codon will slow down protein synthesis, especially if you have just a few of them. It is really a judgment call, weighing the lower frequency of the codon against the benefits of introducing the restriction site. In my experience so far I rarely if ever have had to go below a frequency of 0.2. So in the list above I would avoid the GGA, GGG, and ACA codons, and I would think twice about the GAG, GTC, and GTA codons.
Finally, the enzyme also has to be commercially available and it would be nice if it produced a sticky end. You can find this out with the program redata. For instance for the BstXI enzyme you would type:
redata -enzyme BstXI -out BstXI.info
Which would yield:
BstXI Recognition site is CCANNNNNNTGG leaving sticky ends Cut positions 5':8 3':4 Organism: Bacillus stearothermophilus X1 Source: ATCC 49821 Isoschizomers: BscJI BssGI BstHZ55I BstTI Suppliers: Amersham Pharmacia Biotech (11/00) Life Technologies Inc. (1/98) HYBAID GmbH (12/00) Stratagene (11/00) Fermentas AB (11/00) Appligene Oncor (10/97) American Allied Biochemical, Inc. (10/98) Nippon Gene Co., Ltd. (6/00) Takara Shuzo Co. Ltd. (7/00) Kramel Biotech (7/98) Roche Molecular Biochemicals (3/99) New England BioLabs (12/00) Toyobo Biochemicals (11/98) CHIMERx (10/97) Promega Corporation (6/99) References: Langdale, J.A., Myers, P.A., Roberts, R.J., Unpublished observations. Nwankwo, D., Unpublished observations. Wise, R., Schildkraut, I., Unpublished observations.
So this enzyme produces sticky ends (it cuts the forward strand 8 bases after the start of the recognition sequence and the reverse strand 4 bases after the start of the recognition sequence). It has 4 isoschizomers (enzymes with the same specificity and cleavage site), and there is a whole host of suppliers to choose from. Unfortunately, you only get the suppliers for the enzyme you specified, and not those for the isoschizomers. For instance for PstNHI the output is:
PstNHI Recognition site is GCTAGC leaving sticky ends Cut positions 5':1 3':5 Organism: Pseudomonas stutzeri NH Source: S.K. Degtyarev Isoschizomers: NheI AceII AsuNHI References: Myakisheva, T.V., Dedkov, V.S., Kileva, E.V., Abdurashitov, M.A., Shevchenko, A.V., Degtyarev, S.K., Unpublished observations.
So it seems there are no commercial suppliers. However, if you would do the search for the isoschizomer NheI you will find that it is for sale at most suppliers (try yourself). So if you don't find a supplier for an enzyme you'll have to go along all isoschizomers to see if any of those would be commercially avaible.
Once you have decided that a restriction site suggested by silent is worth including in your design you can make the actual modification in your dna sequence file (the one you used as input for silent). I recommend to make each modification in a copy of the previous file so that if you mess up you can go back to the last copy that worked well. After you have made the modification you can rerun the remap program to see the new site.
In addition to creating new restriction sites it can also be worthwhile to remove some sites. For instance, if you want to use a particular restriction enzyme to insert your gene in the vector and there is a restriction site for that enzyme inside your gene you may have to remove it. (Perhaps you don't have to because you can design in the right overhangs that will be compatible with the sticky ends of your vector without having to cut your synthetic gene). Another situation, which actually happened to me, is that there are two restriction sites for an enzyme in your gene and none in the vector. By making a silent mutation that removes one of the restriction sites you create a unique site.
When there is an inverted repeat in your sequence there is a chance that it will create a stem-loop structure in your mRNA which will degrade translation efficiency. You can find such inverted repeats with the program einverted. This program reads a DNA sequence and finds potential inverted repeats. If there is a significant inverted repeat you can use silent mutations to break down the inversion symmetry. To run the program type:
einverted -seq test.dna -out test.repeat -gap 12 -match 3 -mismatch -4 -threshold 50
The values listed above are the defaults. In my case no inverted repeats were found at these settings. Changing the values to 6, 4, -3, and 40 did yield predictions of inverted repeats. Since the program does not seem to be developed for stem-loop prediction in RNA I don't know how to interpret the results, e.g. what settings would be needed to predict stem loops that are significant at 37 degrees centigrade. I just tweaked a few areas with long unbroken repeats. This is clearly an area that could be improved.
For questions or suggestions please send e-mail to: Bart.Hazes@Ualberta.ca