The BURST algorithm (Based Upon Related Sequence Types)
Version 1.00 (Feb 2001)
Edward Feil & Man-Suen Chan
Department of Biology and Biochemistry, University of Bath, Bath BA2 7AY, UK
For questions, suggestions or comments contact: firstname.lastname@example.org
BURST is a web-implemented clustering algorithm designed for use on multilocus sequence typing (MLST) data sets from bacterial pathogens, although in principle other multi-locus data could also be used. The approach specifically examines the relationships between very closely related genotypes within clonal complexes. The relationships between different clonal complexes (i.e. between more distantly related isolates) are ignored. This can be justified if recombination has occurred sufficiently frequently, or over a sufficiently long time span, within the population to overwhelm any deep-rooted phylogenetic signal. There is evidence that this is indeed the picture for a number of important bacterial pathogens.1
This raises an apparent paradox. How can clonal complexes be maintained within a freely recombining population? Maynard Smith et al.2 offered a solution they called the 'epidemic' model of population structure. In this model, clonal expansion results from the rise in frequency of a single highly adaptive genotype. These ancestral genotypes subsequently diversify through recombination or mutation to produce minor clonal variants, and hence a 'complex' of closely related strains.
BURST is based on such a model, and the key part of the analysis is in the identification of the most likely extant 'ancestral' genotype of each clonal complex, from which the clonal variants have descended. This ancestral genotype is also called the 'consensus clone'. The most illustrative way of representing the relationships between the ancestral genotype and subsequent clonal variants is not by a bifurcating tree, but by a series of circles, reflecting radial spread from the ancestral core.
The program proceeds through the following main steps:
- The sub-division of the data into 'clonal complexes'.
- Clonal complexes are defined as a group of multi-locus genotypes in which every genotype shares at least 5 loci in common with at least one other member of the group. Clonal complexes are thus mutually exclusive. This is a somewhat pragmatic definition used to identify groups which should more strictly be defined as the domain defined by the direct descendents of a particular adaptive ancestral type. However, using MLST data based on 7 loci this definition appears to provide the most appropriate level of resolution to approximate this definition. In other words, the cut-off point of 5 identical loci maximises the inclusion of strains belonging to a single clonal complex whilst excluding those that do not. For example, in a recent MLST study of carried meningococcal strains in the Czech Republic3, clonal complexes were identified both by using BURST and by a delineating clusters using SPLITS decomposition analysis. Besides the exclusion by BURST of a small number of strains differing by three loci from other members groups defined by SPLITS, the two approaches defined the same groups. Although the program also allows user-defined cut-off points for inclusion within a single clonal complex, for MLST data based on 7 loci the default option of 5 is therefore highly recommended.
- The identification of ancestral genotypes
- The identification of the most likely ancestral genotypes is the key step in this analysis. Each genotype within a clonal complex is compared in turn with all other genotypes within the clonal complex. Ancestral genotypes (or 'consensus clones') are defined as the genotype within the clonal complex that differs from the highest number of other genotypes in the clonal complex at only one locus out of seven. To put it another way, the consensus clone is that genotype defining the highest number of single-locus variants, or SLVs. Single-locus variants are identical to the ancestral genotype at 6 loci, but differ at the seventh.
- In some cases a second, or third genotype may define SLVs that have not been previously assigned to the ancestral genotype. For example, an SLV may itself define 2 or more other SLVs - corresponding to DLVs (double-locus variants) of the ancestral genotype. The assignment of likely ancestral genotypes is thus repeated until all genotypes that define 2 or more SLVs -which have not be previously assigned- have been identified. In cases where two or more genotypes define the same number of SLVs, the number of DLVs is taken into account and in the rare cases where two strains have the same number of SLVs and DLVs, the frequency is taken into consideration. Ancestral genotypes (and their associated SLVs) are defined in rank order according to the number of SLVs they define. For example, imagine a case where genotype X defines 5 SLVs, and genotype Y defines 2 SLVs. If genotype Z corresponds to an SLV of both X and Y, BURST will preferentially associate Z with X, as X defines more SLVs in total than Y.
- Typically, the ancestral genotype(s) thus defined corresponds to numerically dominant genotypes (i.e. those represented by the largest number of strains in the complex). This trend has been observed for three of the MLST data sets generated as of April 2000 for the species N. meningitidis3, S. pneumoniae1 and S. aureus. This provides independent support for the assignments. In some clonal complexes (such as those containing only two different genotypes), it will not be possible to assign ancestral genotypes, and BURST will return a message to this effect. Single genotypes that do not correspond to any clonal complex (singletons) are identified but excluded from subsequent analysis.
- Inferring likely patterns of descent within each clonal complex.
- Once ancestral genotypes have been assigned, other strains are assigned if possible according to their relationships with the ancestral genotype(s) of the clonal complex. All SLVs and DLVs are associated with corresponding ancestral genotypes. Ancestral genotypes are ranked according to the number of SLVs (or DLVs) which they define, so if a single strain corresponds to an SLV of more than one ancestral type, it is preferentially associated with the ancestral type which defines the highest number of SLVs (or, in the event of a tie, DLVs).
- Each strain within a single clonal complex must, by definition, share at least 5 loci in common with at least one other strain in the complex. Only relationships between strains which share at least 5 loci in common are considered, hence ancestral genotypes are only directly compared with their SLVs and DLVs; other relationships are shown only through intermediate strains. Strains differing by more than 2 loci from the ancestral genotype ('satellite strains') are defined on the basis of their relationship to SLVs, DLVs or other satellite strains. Each relationship involving a satellite strain is defined both with respect to the distance (i.e. number of steps: where SLV = 1 step and DLV = two steps) from the highest ranking possible ancestral genotype (i.e. the one corresponding to the highest number of SLVs), and whether the relationship involves a single-locus difference, or a double-locus difference. Once any particular strain has been assigned, it cannot be re-assigned according to a different relationship, so there are no overlaps between the descent groups defined by particular ancestral genotypes.
- Data Input
- Input is by tab/comma/space delimited text files containing columns of integers (see example data). Each row corresponds to a single strain. The first column is the ST (sequence type- the number assigned to the unique allelic profile) of the strain, and the remaining columns represent the allelic profile over the seven loci. The program accepts multiple examples of single STs and will check to ensure that the assignment of STs is consistent. If two strains are detected with the same ST but with differing allelic profiles, or with the same allelic profiles but differing STs, an error message is returned. To use the program, simply copy and paste the numbers into the main window of the program.
- Group Table
- The output is given as a tabular format as shown below
ST FREQ SLV DLV SAT 5* 9 7 2 2 6 2 0 2 9 59 1 0 1 10 61 1 0 2 9 62 1 4 3 4 63 2 4 4 3 64 1 4 3 4 65 1 4 3 4 66 1 2 5 4 6 71 2 5 4 68 1 0 1 10 108 1 1 7 3
- Each group is given a group number (group 2 in the example above). This number is then entered in the box marked group number. The STs belonging to the group are listed in the first column. The second column ('FREQ') is the frequency of the ST (i.e. the number of strains corresponding to the ST). The remaining columns give the number of SLVs, DLVs and SATellite strains defined by each ST. The program assigns the most likely ancestral genotype as the ST defining the largest number of SLVs and marks this ST with an asterisk (in this case ST 5 which defines 7 SLVs). Note that ST 5 is also the ST corresponding to the highest frequency (9) of all the STs in the group. As discussed above, this is commonly noted in within the groups of N. meningitidis, S. aureus and S. pneumoniae, and is good evidence that the assignment of the ancestral genotype has been correct. Where these two independent criteria clearly do not agree, caution is required in assuming that the assignment of the ancestral genotype is valid, although it is possible that such discrepancies may result from sampling bias. This output file also lists all the 'singletons': those STs that have not been assigned as members of a clonal complex (i.e. they differ at at least 3 loci from every other strain in the dataset).
- The output is given as a tabular format as shown below
- Graphical Representation
- The relationships within each clonal complex are given either as a gif, pdf or as a postscript file. The assigned ancestral genotype is within the central ring. For all other circles and lines; faint indicates a double-locus difference, whereas bold indicates a single-locus difference. A larger bold circle encloses the SLVs of the ancestral genotype. A faint line encloses the DLVs of the ancestral genotype. The relationships between the satellite strains are also given - with faint or bold lines indicating DLVs or SLVs.
- At present, the program can resolve up to nine ancestral genotypes within a single clonal complex along with their associated SLVs, DLVs and satellite strains.
- The direct relationships between the strains are simply shown as either a single-locus difference (bold line) or a double-locus difference (faint line). Distances and angles of the lines do not bear any relevance to levels of relatedness.
- Group Table
Validation and Applications
This approach is useful for the analysis of specific clonal complexes but does not provide any information regarding more deep-rooted relationships within the population, for which more orthodox algorithms should be used if appropriate. The validity of the approach is determined by the correct assignment of ancestral genotypes. Two independent lines of evidence can be used to support the assignment of the ancestral genotype. Firstly, as discussed above, the ancestral genotype is often the numerically dominant genotype within the clonal complex, suggesting that it may predate other genotypes. Secondly, variant alleles within SLVs are more commonly novel than the corresponding allele within the ancestral clone, suggesting these alleles have arisen recently by de novo mutation. Once the ancestral genotype has been identified, the algorithm will then present the most parsimonious patterns of descent within the clonal complex; thus allowing the mapping of epidemiological or phenotypic traits, such as antibiotic resistance or virulence, onto the clonal complex, and to assess whether the groupings are likely to make biological sense.
For example, an application of this approach to an MLST study of S. aureus5 revealed the ancestral genotype of the clonal complex to be a major carried MSSA (methicillin sensitive S. aureus) clone. An SLV of this clone corresponds to a major UK clone of MRSA (methicillin resistant S. aureus) and this in turn defines two SLVs that are also MRSA. Thus we can infer that the major MRSA clone descended from an already successful MSSA clone and subsequently gave rise to minor variant MRSA genotypes. There are similar examples concerning penicillin resistance in the MLST data for S. pneumoniae (i.e. the penR group 1a is likely to be descended from the penS group 1b)4.
Mapping other traits such as virulence on to clonal complexes may also reveal insights in to the origin of these traits. For example, an analysis of the S. aureus data revealed that virulent strains are preferentially associated with the ancestral genotypes and it appears that virulence potential is decreased as strains diversify from their respective ancestral genotypes.
A second application concerns the estimation of the relative contributions of recombination and mutation to clonal diversification. A method has been described 4,6 for estimating recombinational parameters based on comparisons between the sequences of variant alleles within SLVs with the corresponding alleles in ancestral genotypes. BURST provides an objective means by which ancestral genotypes and SLVs can be defined, and so facilitates these comparisons.
A third application is the provision of a simple working definition of a clonal complex. The subdivision of the data into 'independent' clonal complexes provides a solution to the problem of representing very large data sets using dendrograms, which treat the datasets as a whole (constructing smaller dendrograms from separate fractions of the data is one solution, but this will only be valid if the sub-division is carried out objectively with prior knowledge of likely clusters).
BURST ver 1.00 is also available as part of the START package7, written by Keith Jolley.
- Feil, E.J et al. (2001) Recombination within natural populations of pathogenic bacteria: Short-term empirical estimates and long-term phylogenetic consequences. PNAS 98: 182-187
- Maynard Smith, J. et al. (1993) How Clonal are Bacteria? PNAS 90: 4384-8
- Jolley, K. et al. Carried meningococci in the Czech Republic: a diverse recombining population. J. Clin. Micro. 38: 4492-4498
- Feil, E.J. et al. (2000) Estimating Recombinational Parameters in Streptococcus pneumoniae From Multilocus Sequence Typing Data Genetics 154: 1439-1450
- Enright, M.C. et al. (2000) Multilocus Sequence Typing for Characterization of Methicillin-Resistant and Methicillin-Susceptible Clones of Staphylococcus aureus. J. Clin. Micro. 38: 1008-1015
- Feil, E.J. et al (1999) The Relative Contributions of Recombination and Mutation to the Divergence of Clones of Neisseria meningitidis Mol. Biol. Evol. 16(11): 1496-1502
- Jolley, K. A., Feil, E. J. and Chan, M.-S., (2001) Sequence Type analysis and Recombination Tests (START). Bioinformatics 17: 1230-1