Computational identification of adaptive mutants using the VERT system
© Winkler and Kao; licensee BioMed Central Ltd. 2012
Received: 7 October 2011
Accepted: 2 April 2012
Published: 1 December 2012
Evolutionary dynamics of microbial organisms can now be visualized using the Visualizing Evolution in Real Time (VERT) system, in which several isogenic strains expressing different fluorescent proteins compete during adaptive evolution and are tracked using fluorescent cell sorting to construct a population history over time. Mutations conferring enhanced growth rates can be detected by observing changes in the fluorescent population proportions.
Using data obtained from several VERT experiments, we construct a hidden Markov-derived model to detect these adaptive events in VERT experiments without external intervention beyond initial training. Analysis of annotated data revealed that the model achieves consensus with human annotation for 85-93% of the data points when detecting adaptive events. A method to determine the optimal time point to isolate adaptive mutants is also introduced.
The developed model offers a new way to monitor adaptive evolution experiments without the need for external intervention, thereby simplifying adaptive evolution efforts relying on population tracking. Future efforts to construct a fully automated system to isolate adaptive mutants may find the algorithm a useful tool.
KeywordsAdaptive evolution hidden Markov Model Visualizing evolution in real time Population history
Strain development to improve the utility of microbial strains has been a focus of industry for decades. Numerous methods to improve strain characteristics have been developed such as random mutagenesis [1, 2], genetic recombination [1, 3–5], serial transfers in the presence of various inhibitors , and others [7–12]. A novel method to identify the occurrence and expansion of adaptive mutants within an evolving population was recently described by Kao and Sherlock , where the population dynamics of strains expressing different fluorescent proteins competing for the limiting carbon source in a chemostat system were monitored using fluorescent activated cell sorting (FACS). This approach (VERT, Visualizing Evolution in Real Time) has been used successfully to elucidate the population dynamics of Candida albicans in the presence of an antifungal agent  and generate Escherichia coli mutants tolerant of n-butanol (Reyes and Kao, manuscript in revision). The use of fluorescent labels improves the ability of the user to track various subpopulations in a quasi-real time fashion compared to microarrays  or quantitative PCR , and therefore makes the VERT method ideal for identifying adaptive events more quickly than other strain development techniques.
A key aspect of the VERT system and other types of population tracking methods involves analysis of observed population dynamics to accurately detect adaptive events, which are subpopulation expansions triggered by novel adaptive mutants with growth-enhancing mutations. For example, if a growth enhancing mutation (such as one that confers drug resistance or more efficient nutrient uptake) arises in a labeled subpopulation, that specific subpopulation will experience an adaptive event due to an increase in population size. An algorithmic way of analyzing population history data is preferable to human inference, as the former will be more consistent and reliable in most circumstances. A simple yet robust method that can identify adaptive episodes automatically is the hidden Markov model (HMM) [17, 18], which involves the computation of the unknown state sequence that is most likely to produce the observed output (emissions) from the process in question. This technique can be applied to determine whether each subpopulation is undergoing an adaptive expansion by examining the visible population proportions, and then computing the probability of an adaptive event based on the model training data. A HMM based approach will also be sufficiently flexible to accommodate variations between experiments arising from species-specific dynamics, data quality issues, and other factors.
In this work, we introduce a population state model (PSM) that employs a hidden Markov model to identify likely adaptive events for several types of chemostat evolution experiments that employed the VERT tracking system. After showing that the PSM predictions are comparable to those obtained from human annotation, properties of several VERT experiments for different species are quantified. Several utilities have also been developed that allow the PSM to quickly analyze raw data and generate predictions concerning experimental evolutionary dynamics. Finally, the ability of the PSM to process other types of evolutionary experiments is discussed.
Results and discussion
The first step in developing a model to analyze VERT population history is the examination of the population data to develop a method that can determine if the observed population proportion for population j at time point i represents a statistically significant change compared to point i-1. A simple statistical classifier based on data obtained from neutrality (e.g. no adaptive events) experiments is developed to answer this question. This classifier is then utilized to determine emission sequences that represent the statistical significance of population proportion changes for the entire set of VERT data. A hidden Markov-based model, trained with human annotated data, is then applied to determine whether or not a subpopulation is undergoing an adaptive event based on these emissions. Finally, the error rate, behavior, and possible alternative applications of the model are considered.
Statistical classification of population dynamics data
The actual time derivative can used in place of R ij if continuous measurements are available, as the former contains much more information concerning the process dynamics and will allow more accurate detection of adaptive events.
Estimates for the mean rpe,ij(subsequently μ r ), representing a collection of slope measurements for one subpopulation, and its standard deviation (σ r ) of the same collection for metastable populations are needed to draw inferences about which fluctuations in population proportions are significant. Calibration data in the form of neutrality experiments, where adaptive events are unlikely to occur, can be leveraged to obtain these data. In an ideal case, with a perfectly accurate FACS device and populations with exactly equal fitness, μ r = σ r = 0 over the entire dataset; the population proportions would be fixed. In reality, fluctuations affecting both parameters tend to arise due to jackpot mutations, random stochasticity in the populations, or technical issues that generate noise in the data. The neutrality datasets are therefore used to calculate the slope mean and variance. The obtained values for these parameters indicated that μ r ∈ [ - 0.005, 0.004] and σ r = 0.018 for 64 neutral measurements. The parameter μ r also serves as an indicator of population stability and is, as expected, indistinguishable from zero at a 95% confidence level.
Generally, μ r will be approximately zero for fluorophores that have no fitness effect on their host strains. Some fluorescent proteins, such as tdTomato, have been observed to decrease strain fitness (data not shown), resulting in negative values of μ r . The parameter values used here may therefore be unique to specific experimental equipment and fluorophores and should be recomputed for each physically distinct setup.
Each subpopulation of a VERT experiment is analyzed to determine when to reject the null hypothesis in order to classify the data. For slopes that are unlikely to be explained by the null hypothesis (P < α), the sign of the slope is examined to determine if that point will be identified as a population size increase (positive slope, P) or a contraction (negative slope, N). Slopes that fail to meet the significance threshold, in either direction, are recorded as zero (Z) slopes. The p-value threshold for significance was α = 0.10, selected by empirical observation and based on model performance, was used unless otherwise stated. These slope classifications are subsequently used in the population state model described below.
Definition of the population state model
Population state model parameters
P AN ° = 0.154, P NA ° = 0.079
P N = 0.102, P Z = 0.150, P P = 0.748
P N = 0.434, P Z = 0.337, P P = 0.229
where P AN ° and P NA ° represents that nominal value of each state transition probability. Accordingly, P NN = 1 - P NA and P AA = 1 - P AN as well. These contiguous counts are reset to zero when symbols outside the considered set (i.e. Z, N for C P ) are encountered in the data. This modification does represent a divergence from the traditional formulation of a hidden Markov model, where the state at position i only depends on position i-1. We use this approach to represent the fact that adaptive events, once they occur and survive initial drift, expand in a non-random fashion temporarily. The exponential decay function represents the decreasing probability of transitioning out of an ongoing change in population proportion (i.e. a long adaptive expansion or continual decline); many possible forms for this function exist, but the exponential functions seems to correlate well with the observed population dynamics. This formulation allows for the explicit consideration of the current population state in the chemostat and dramatically improves the accuracy of the model.
The use of a supervised learning approach, though allowing for relatively straightforward development and training of the PSM, does introduce bias into what is considered an adaptive event which in turn affects the model parameters computed from the annotated training set. An alternative approach to HMM training involves the use of unsupervised learning, where the estimated state transition and emission probabilities are computed automatically using algorithms such as Baum-Welch . In essence, this type of HMM training computes the expected number of state transitions and the emission probabilities (in each state) that best fit the provided emission symbols, and then updates the model parameters accordingly. This iterative process continues until the change in HMM performance is below the user threshold. This type of training will be explored in future versions of the population state model.
Properties of the population state model
Using the procedure outlined previously, the PSM is trained using an annotated dataset from S. cerevisae glucose limited chemostats . Depending on the species, length of the evolution experiments, and conditions (mutagenic versus non-mutagenic), it is possible that different estimates of the Markov parameters given in Table 1 may be obtained depending on the dataset used for model training; however, the calculated probabilities seem reasonable in light of the experimental population dynamics. Non-adaptive events typically have slopes that are close to zero (P > 0.10) with the remaining events split evenly between positive and negative slopes (P < 0.10). Adaptive events are predominately weighted towards producing measurements with positive slopes as is trivially expected. The behavior of the PSM is overall most affected by the state transition properties P AN ° and P NA ° as these parameters control how quickly the model responds to changes in chemostat dynamics.
Population state model error analysis
Analysis of population dynamics
Rate of PEX†
AE Length (s)
Example application: analysis of a yeast chemostat
Distribution of adaptive events
Application to other evolution systems
Despite the usage of the VERT system and data in developing the PSM, there is no explicit dependence of the PSM on VERT data. Any method that can generate similar population histories over time (e.g. microarray or qPCR methods) can also be integrated into the PSM. The only requirement is that comparable neutrality experiments and annotated experimental data must be generated using the proposed alternative so that the PSM can estimate the required HMM parameters. The current implementation of the PSM will automatically calculate all of the necessary parameters except for μ r and σ r for the new type of measurements, both of which must be determined by the end-user as described previously. After this calibration procedure, the PSMshould be able to analyze population histories obtained from alternative methods.
Another potential application of the PSM is the construction of a mostly automated system (e.g. autoVERT) for the observation and isolation of adaptive mutants. Unlike serial transfer (batch) evolution system that require periodic transfers of culture to fresh medium, the continuous culture system used to generate the VERT population histories can be adapted to minimize required external intervention to adjust the nominal media composition. The second part of an automated system is identifying when adaptive events occur so that samples of the population can be saved (on solid media or as frozen stocks) for later manual analysis. Given that the PSM has been shown to be effective in accomplishing this task, it may be possible to adapt this model to construct such a system. Additional work is needed to optimize the PSM for this type of data forecasting as the model was primarily constructed for retrospective analysis of VERT experiments.
The population state model offers the ability to automatically detect adaptive events within fluorescent microbial populations easily and without the need for user intervention. A variety of VERT experimental properties may also be determined, enabling a quantitative comparison between the evolutionary dynamics of different VERT experiments involving various inhibitors or species of interest. Comparison to human analysis of VERT experiments revealed that the PSM produced highly accurate predictions for adaptive events and sampling time points. This algorithm represents an important new tool for the analysis of population dynamics over time and will be integral in any VERT system capable of automatic identification of adaptive mutants.
The specific experimental procedures for the VERT experiments used in this study are detailed elsewhere [13, 14]. The first requirement is that strains with chromosomally integrated fluorescent proteins (e.g. RFP, GFP, YFP) be constructed. The labeled strains must then be assayed to ensure fluorescent protein expression has a neutral effect on strain growth rates. Once label neutrality has been established, equal proportions of each strain are inoculated into a continuous culture system (chemostats) or batch flasks and sampled daily using a FACS machine to determine the size of each labeled subpopulation. The complete series of FACS measurements for a VERT experiment (see Figure 1) can be interpreted as a quantitative measurement of population dynamics. These data form the basis of the population state model developed in this work.
Description of PSM submodules
Generates data, tables, figures for this work
Compares state annotations to state predictions
Optimal sampling predictions
Converts FACS data to emission sequences
Analyzes statistics of interest (e.g. AE/gen-color)
Generates distribution of adaptive events for a dataset
Converts emission sequences to state predictions
We gratefully acknowledge the partial financial support of the NSF Graduate Research Fellowship program, NSF MCB-1054276, and the Texas Engineering Experimental Station. The authors would like to thank Dr. Cornelis J. Potgieter for his suggestions and comments.
- Adrio J, Demain A: Genetic improvement of processes yielding microbial products. FEMS Microbiol Rev 2006,30(2):187-214. 10.1111/j.1574-6976.2005.00009.xView ArticleGoogle Scholar
- Klein-Marcuschamer D, Stephanopoulos G: Method for designing and optimizing random-search libraries for strain improvement. Appl Environ Microbiol 2010,76(16):5541. 10.1128/AEM.00828-10View ArticleGoogle Scholar
- Patnaik R, Louie S, Gavrilovic V, Perry K, Stemmer W, Ryan C, del Cardayré S: Genome shuffling of Lactobacillus for improved acid tolerance. Nat Biotechnol 2002,20(7):707-712. 10.1038/nbt0702-707View ArticleGoogle Scholar
- Chen X, Wei P, Fan L, Yang D, Zhu X, Shen W, Xu Z, Cen P: Generation of high-yield rapamycin-producing strains through protoplasts-related techniques. Appl Microbiol Biotechnol 2009,83(3):507-512. 10.1007/s00253-009-1918-7View ArticleGoogle Scholar
- Bajwa P, Pinel D, Martin V, Trevors J, Lee H: Strain improvement of the pentose-fermenting yeast Pichia stipitis by genome shuffling. J Microbiol Methods 2010,81(2):179-186. 10.1016/j.mimet.2010.03.009View ArticleGoogle Scholar
- Atsumi S, Hanai T, Liao J: Non-fermentative pathways for synthesis of branched-chain higher alcohols as biofuels. Nature 2008,451(7174):86-89. 10.1038/nature06450View ArticleGoogle Scholar
- Stephanopoulos G, Alper H, Moxley J: Exploiting biological complexity for strain improvement through systems biology. Nat Biotechnol 2004,22(10):1261-1267. 10.1038/nbt1016View ArticleGoogle Scholar
- Lee S, Lee D, Kim T: Systems biotechnology for strain improvement. Trends Biotechnol 2005,23(7):349-358. 10.1016/j.tibtech.2005.05.003View ArticleGoogle Scholar
- Alper H, Stephanopoulos G: Global transcription machinery engineering: a new approach for improving cellular phenotype. Metab Eng 2007,9(3):258-267. 10.1016/j.ymben.2006.12.002View ArticleGoogle Scholar
- Klein-Marcuschamer D, Santos C, Yu H, Stephanopoulos G: Mutagenesis of the bacterial RNA polymerase alpha subunit for improvement of complex phenotypes. Appl Environ Microbiol 2009,75(9):2705. 10.1128/AEM.01888-08View ArticleGoogle Scholar
- Warner J, Patnaik R, Gill R: Genomics enabled approaches in strain engineering. Curr Opin Microbiol 2009,12(3):223-230. 10.1016/j.mib.2009.04.005View ArticleGoogle Scholar
- Chung B, Selvarasu S, Andrea C, Ryu J, Lee H, Ahn J, Lee H, Lee D: Genome-scale metabolic reconstruction and in silico analysis of methylotrophic yeast Pichia pastoris for strain improvement. Microb Cell Fact 2010, 9: 50-50. 10.1186/1475-2859-9-50View ArticleGoogle Scholar
- Kao K, Sherlock G: Molecular characterization of clonal interference during adaptive evolution in asexual populations of Saccharomyces cerevisiae. Nat Genet 2008,40(12):1499-1504. 10.1038/ng.280View ArticleGoogle Scholar
- Huang M, McClellan M, Berman J, Kao K: Evolutionary dynamics of Candida albicans during in vitro evolution. Eukaryotic Cell 2011,10(11):1413-1421. 10.1128/EC.05168-11View ArticleGoogle Scholar
- Brodie E, DeSantis T, Joyner D, Baek S, Larsen J, Andersen G, Hazen T, Richardson P, Herman D, Tokunaga T, et al.: Application of a high-density oligonucleotide microarray approach to study bacterial population dynamics during uranium reduction and reoxidation. Appl Environ Microbiol 2006,72(9):6288. 10.1128/AEM.00246-06View ArticleGoogle Scholar
- Watanabe K, Yamamoto S, Hino S, Harayama S: Population dynamics of phenol-degrading bacteria in activated sludge determined by gyrB-targeted quantitative PCR. Appl Environ Microbiol 1998,64(4):1203.Google Scholar
- Rabiner L, Juang B: An introduction to hidden Markov models. ASSP Magazine, IEEE 1986, 3: 4-16.View ArticleGoogle Scholar
- Rabiner L: A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE 1989,77(2):257-286. 10.1109/5.18626View ArticleGoogle Scholar
- Bilmes J: A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models. Int Comput Sci Inst 1998, 4: 126.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.