Synthesizing non-natural parts from natural genomic template
© Dhar et al. 2009
Received: 10 November 2008
Accepted: 03 February 2009
Published: 03 February 2009
The current knowledge of genes and proteins comes from 'naturally designed' coding and non-coding regions. It would be interesting to move beyond natural boundaries and make user-defined parts. To explore this possibility we made six non-natural proteins in E. coli. We also studied their potential tertiary structure and phenotypic outcomes.
The chosen intergenic sequences were amplified and expressed using pBAD 202/D-TOPO vector. All six proteins showed significantly low similarity to the known proteins in the NCBI protein database. The protein expression was confirmed through Western blot. The endogenous expression of one of the proteins resulted in the cell growth inhibition. The growth inhibition was completely rescued by culturing cells in the inducer-free medium. Computational structure prediction suggests globular tertiary structure for two of the six non-natural proteins synthesized.
To our best knowledge, this is the first study that demonstrates artificial synthesis of non-natural proteins from existing genomic template, their potential tertiary structure and phenotypic outcome. The work presented in this paper opens up a new avenue of investigating fundamental biology. Our approach can also be used to synthesize large numbers of non-natural RNA and protein parts for useful applications.
Organisms use Nature's inventory of materials and designs for living. The raw material mostly comes in the form of DNA, RNA and protein. DNA, a repository for long-term storage of genetic instructions, comprises of genes and intergenic regions. While genic regions have been thoroughly investigated in the past, intergenic regions have received increased attention recently [1–4]. It would be interesting to mine intergenic regions for unidentified genes and also use them for making novel proteins.
Here we present a simple and scalable approach of making non-natural proteins from the 'not-coding' intergenic regions. The term 'not-coding' has been used in the context of not-naturally-designed for making proteins. As against the previously described approaches of chemically synthesizing randomized protein sequences [5, 6] adding tolerated point mutations to natural proteins , generating polypeptide sequences by combinatorial shuffling , improving protein functions through directed evolution  we used the existing genomic template of E. coli to make non-natural proteins.
Description of eka sequences
Start – End
% vector contribution
70,283 – 70,386
3,651,282 – 3,651,704
348,779 – 349,210
49,681 – 49,785
57,173 – 57,313
70,285 – 70,380
Results and discussion
In EKA5 (Fig 4b), weak similarity to the beta propeller fold of viral neuroaminidases was suggested by using the PDB-Basic tool from the 3D-Jury authors . When modelling EKA5 onto the 2 predicted templates (PDB: 2htv and 2ht5,  using Modeller , it becomes apparent that the aligned portion of EKA5 only covers 3 out of 4 beta strands that would normally form a beta sheet representing one of six blades of the overall propeller structure (Fig 4b). Correct folding depends on proper stacking of the blades including hydrophobic contacts that would indicate that a single blade alone, as predicted for EKA5, would not form a stable structure. An interesting question is whether EKA5 single blades could eventually homo-polymerize into a full propeller, when over-expressed. However, such speculations can only be answered through further experimental structural studies.
Most of the currently known protein functions require folding of a protein sequence into a globular structure. Hence, we wanted to investigate if the sequences could principally adopt a known fold. While pure ab initio structure prediction is still in its infancy, the most successful current methods strongly rely on sequence similarity for fold recognition. However, we are dealing with new unknown sequences not expected to have clear homologues. Our method of choice was to try several possibilities and take a consensus of the predictions . Notably, threading methods gave more consistent results (more template hits were similar to each other), which makes sense since threading methods emphasize more on compliance with the biophysical needs of a sequence fitting into the structure rather than depending on similarity to sequences of known structures. Only for the longest of the 6 sequence inserts, EKA3 (143 amino acids), a globular tertiary structure was predicted. EKA5 (47 amino acids) also showed similarity to a known fold, however, only to one of its substructures not known to form a stable structure on its own. Similarly, the other four, EKA1 (33 amino acids), EKA2 (46 amino acids), EKA4 (35 amino acids) and EKA6 (32 amino acids) appear too short to form complex tertiary structures on their own. At best we find similarities to not more than a single helix. Furthermore, low complexity predictions  over large parts of their sequence are an additional indicator of absence of globular structure. On the other hand, the proposed structure for EKA3 is consistent among the models derived from the 4 top-ranked hits, adding support to the prediction, and the inter-model variability (Fig 4a) allows estimating the maximal accuracy that can be expected in this case. However, structure predictions in the absence of sequence similarity remain notoriously difficult at this time and experimental validation is needed to confirm validity of our models.
We do not disregard the possibility that some of the EKA genes may turn out to be real in some organisms or may represent evolutionary remnants of what-was-once a functional sequence. In fact, in one of the previous studies, expression for 4052 coding transcripts and 1102 additional transcripts in the intergenic regions of the E. coli genome was identified using the whole genome array . However, intentional conversion of these sequences to synthesize non-natural proteins is a novel attempt, to our best knowledge. One could ask why Nature didn't sample these genomic regions? And if Nature indeed sampled these regions – were these proteins discarded? If yes, why? To answer such questions one must synthesize more non-natural proteins and study their impact on cell physiology. It would be interesting to sample conserved intergenic regions, subsets of introns, overlapping regions, and so on and study the impact of making novel RNA and protein parts. Given that 98.5% of human genome is made of intergenic regions, it would be useful to mine this enormous resource to make non-natural parts for useful applications. It has not escaped our attention that our approach can be extended to make non-natural RNA parts, both coding and non-coding.
Interestingly, several studies point to the evolution-driven conversion of not-coding regions to coding regions [20–24]. However, our work demonstrates a user-defined conversion leading to the synthesis of non-natural parts. It would be relevant to ask: how to evaluate functions of genes 'not naturally needed for survival'. The traditional approaches of gene knockout and down-regulation of expression are unattractive since organisms don't need these parts by default. In our opinion, expressing such sequences under the control of a strong promoter, followed by microarray analysis could help identify interactions and pathways through which such non-natural parts act.
Furthermore, non-natural proteins that are stably expressed can be systematically tested if they adopt new folds or functions of any kinds. Besides looking at these non-natural proteins in isolation, different effects could possibly be obtained by combining them with known domains. In theory, it could be possible to derive novel synthetic multi-domain proteins in a combinatorial fashion. Given that our analyzed examples indicate both non-coding intergenic and coding but out-of-frame segments as suitable candidates for producing variants of new proteins, the imaginable number of potential new RNA and protein parts and their combinations is enormous.
It is worth noting that our work does not describe an approach for rationally designing RNA and proteins parts based on higher-level parameters. Our approach resembles a semi-random strategy of synthesizing non-natural parts, followed by functional analysis. Although we were successful in expressing all the six sequences, we do not know the boundary conditions of this approach, if any. To answer this question, one should make proteins from genome regions of different lengths, origins and features.
To the best of our knowledge this is the first report that describes artificial synthesis of protein parts from genomic regions not naturally utilized to make proteins. It would be interesting to extend this study to synthesize and characterize non-natural RNA parts. The cell-free synthesis of non-natural parts can be used in situations where their intracellular synthesis results in cell death. The other important issue, addressed through this work, is the prediction of potential tertiary structures of non-natural proteins. Though initial computational analysis indicates several potential structures, experimental study is needed to confirm these predictions. In future, an extensive study is required to uncover existence of novel structures 'possibly embedded' in the genome. Finally, our approach can be used to make novel enzymes, transcription factors, receptor proteins and so on.
The positive (pBAD/D/lacZ, Invitrogen) and negative controls (i.e. without eka sequences) were used to validate the expressions of EKA proteins. The pBAD202/D/lacZ vector was used as a positive control, and the pBAD202/D-TOPO vector without eka sequence was used as a negative control. Cell growth was automatically monitored every 10 minutes for 10 hours using an automated multiplate reader (Tecan Plate Reader, Magellan 200) at 37°C. The growth inhibitory effect of EKA1 was rescued by removing the inducer i.e., washing and re-culturing cells in arabinose (-) medium (Fig 3).
To investigate the possibility of EKA proteins folding into globular structures, all the six protein sequences were submitted to the consensus structure prediction method, 3D-Jury . The algorithm identifies consensus structural units shared among templates suggested by a wide range of established structure prediction servers. In the case of EKA3, all 4 top-ranked hits came from predictions by the threading method mGenThreader . Since these hits are structurally related, we chose each of them as separate templates for modeling using the Software Modeller  (version 9.1) to gauge the structural variability of similar predictions. Figures of the structures were generated using Yasara application .
PKD, CST, KT, SB acknowledge RIKEN's funding support for this project. PKD would also like to convey his sincere thanks to Professor Alessandro Giuliani, Dr. Y. Sakaki, Dr. S. Onami, Dr. Todd Taylor, Dr. M. Matsui, Dr. Y. Kondou, and Dr.Ch. Mohan Rao for their kind support and helpful comments. The E. coli MG1655 strain was kindly provided by the National Institute of Genetics (NIG, Japan). We sincerely thank anonymous reviewers for critically reviewing the paper and help us bring out the key message more clearly.
- Cook PR: Nongenic transcription, gene regulation and action at a distance. J Cell Sci 2003, 116:4483–91.View Article
- Bejerano G, Haussler D, Blanchette M: Into the heart of darkness: large-scale clustering of human non-coding DNA. Bioinformatics 2004, 20:i40–8.View Article
- Shabalina SA, Spiridonov NA: The mammalian transcriptome and the function of non-coding DNA sequences. Genome Biol 2004, 5:105.View Article
- Taft RJ, Pheasant M, Mattick JS: The relationship between non-protein-coding DNA and eukaryotic complexity. Bioessays 2007, 29:288–99.View Article
- Dawson PE, Muir TW, Clark-Lewis I, Kent SB: Synthesis of proteins by native chemical ligation. Science 1994, 266:776–9.View Article
- Nilsson BL, Soellner MB, Raines RT: Chemical synthesis of proteins. Annu Rev Biophys Biomol Struct 2005, 34:91–118.View Article
- Brian Kuhlman, Baker D: Exploring folding free energy landscapes using computational protein design. Curr Opin Struct Biol 2004, 14:89–95.View Article
- Riechmann L, Winter G: Novel folded protein domains generated by combinatorial shuffling of polypeptide segments. Proc Natl Acad Sci USA 2000, 97:10068–73.View Article
- Sen S, Venkata Dasu V, Mandal B: Developments in directed evolution for improving enzyme functions. Appl Biochem Biotechnol 2007, 143:212–23.View Article
- Blattner FR, Plunkett G 3rd, Bloch CA, Perna NT, Burland V, et al.: The complete genome sequence of Escherichia coli K-12. Science 1997, 277:1453–74.View Article
- Lokanath NK, Shiromizu I, Ohshima N, Nodake Y, Sugahara M, Yokoyama S, Kuramitsu S, Miyano M, Kunishima N: Structure of aldolase from Thermus thermophilus HB8 showing the contribution of oligomeric state to thermostability. Acta Crystallogr D Biol Crystallogr 2004,60(Pt 10):1816–23.View Article
- Jackson CJ, Carr PD, Liu JW, Watt SJ, Beck JL, Ollis DL: The structure and function of a novel glycerophosphodiesterase from Enterobacter aerogenes. J Mol Biol 2007, 367:1047–62.View Article
- Shenoy AR, Capuder M, Draskovic P, Lamba D, Visweswariah SS, Podobnik M: Structural and biochemical analysis of the Rv0805 cyclic nucleotide phosphodiesterase from Mycobacterium tuberculosis. J Mol Biol 2007, 365:211–25.View Article
- Ginalski K, Elofsson A, Fischer D, Rychlewski L: 3D-Jury: a simple approach to improve protein structure predictions. Bioinformatics 2003, 19:1015–8.View Article
- Russell RJ, Haire LF, Stevens DJ, Collins PJ, Lin YP, Blackburn GM, Hay AJ, Gamblin SJ, Skehel JJ: The structure of H5N1 avian influenza neuraminidase suggests new opportunities for drug design. Nature 2006, 443:45–9.View Article
- Eswar N, Eramian D, Webb B, Shen MY, Sali A: Protein structure modeling with MODELLER. Methods Mol Biol 2008, 426:145–159.View Article
- Kaján L, Rychlewski L: Evaluation of 3D-Jury on CASP7 models. BMC Bioinformatics 2007, 8:304.View Article
- Wootton JC: Non-globular domains in protein sequences: automated segmentation using complexity measures. Comput Chem 1994, 18:269–85.MATHView Article
- Tjaden B, Saxena RM, Stolyar S, Haynor DR, Kolker E, Rosenow C: Transcriptome analysis of Escherichia coli using high-density oligonucleotide probe arrays. Nucleic Acids Res 2002, 30:3732–8.View Article
- Nurminsky DI, Nurminskaya MV, De Aguiar D, Hartl DL: Selective sweep of a newly evolved sperm-specific gene in Drosophila. Nature 1998, 396:572–575.View Article
- Cai J, Zhao R, Huifeng J, Wang W: De Novo Origination of a New Protein-Coding Gene in Saccharomyces cerevisiae. Genetics 2008, 179:487–496.View Article
- Giacomelli MG, Hancock AS, Masel J: The conversion of 3'UTRs into coding regions. Mol Biol Evol 2007, 24:457.View Article
- Levine MT, Jones CD, Kern AD, Lindfors HA, Begun DJ: Novel genes derived from noncoding DNA in Drosophila melanogaster are frequently X-linked and exhibit testis-biased expression. Proc Natl Acad Sci USA 2006, 103:9935–9939.View Article
- Long M, Betran E, Thornton K, Wang W: The origin of new genes: glimpses from the young and old. Nat Rev Genet 2003, 4:865–875.View Article
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zheng Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25:3389–3402.View Article
- Ausubel FM, Brent R, Kingston RE, Moore DD, Seidman JG, Smith JA, Struhl K: Current Protocols in Molecular Biology. In Green/Wiley-Interscience. New York; 1990.
- McGuffin LJ, Jones DT: Improvement of the GenTHREADER method for genomic fold recognition. Bioinformatics 2003, 19:874–81.View Article
- Krieger E, Koraimann G, Vriend G: Increasing the precision of comparative models with YASARA NOVA – a self-parameterizing force field. Proteins 2002, 47:393–402.View Article
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.