START2 Example datafiles
MLST Schemes
START2 uses a XML file format to describe its MLST schemes. The file contains details of the loci that are used to define the scheme - name, ORF, standard length, file location (can be local or a network/web address), along with the location of the profile definition file (again, this can be local or a web address). This file can be generated using the 'Define -> MLST scheme' dialog, where it can also be imported or exported. It can also be imported and exported from the 'File' menu.
It's generally not necessary to fill in details of the standard length or ORF as these are determined automatically when either the sequences are first loaded or when an analysis is performed that requires the ORF to be known. The program will handle either forward (1-3) or reverse (4-6) reading frames. There is no limit to the number of schemes that can be defined in one file.
- schemes.xml - sample schemes file containing details of the Neisseria and Campylobacter MLST schemes.
Profile definitions
The profile definitions file is optional. This is a tab-delimited text file containing the ST/allelic profile definitions. It's possible to define a web address in the MLST schemes so that this can be always up to date and in sync with the web databases. Loading this file automatically creates a dataset containing a single representative of each ST. Additionally, if you enter another dataset for the same species, the autofill option allows you to fill in gaps in your profiles if they are defined in this definition file, i.e. if a record has a ST entry but not a full profile, the profile can be added for you, or alternatively a ST value can be automatically filled in for a valid profile.
The file should have a header line containing 'ST' and the loci names. Underscores in the loci names are ignored. Columns that do not form part of the defined MLST scheme are ignored.
- neisseria.txt - Neisseria profile definitions. Up-to-date definition files for all MLST schemes hosted on pubmlst.org can be found at http://pubmlst.org/data/.
Allele sequences
Allele sequence files should be in standard FASTA format. These must be aligned and be of equal length, although gaps are allowed. Unlike the old version of START, the program reads the sequence names from the FASTA file so they do not need to be ordered sequentially. To determine the sequence allele number, the program removes the name of the locus then reads digits backwards from the end until it encounters a non-digit character, e.g. abcZ-10, abcZ10, ANYSTRING_10 will all be recognised as allele 10, but if a locus has a digit in the name, e.g. ICL1 then ICL110 or ICL1-10 will be recognised as allele 10.
- abcZ.tfa - Neisseria abcZ alleles file. Up-to-date allele files for all MLST schemes hosted on pubmlst.org can be found at http://pubmlst.org/data/.
Unlike the old version of START, analyses that only require single-locus data can be performed by loading a single sequence file without defining a dataset.
Datasets
The dataset file is a tab-delimited text file containing the allelic profile and/or ST definition along with an isolate or record identifier. The file should contain a header line containing the identifier, the loci names, and optionally, 'ST'. The identifier column can be called either 'id', 'strain', or 'isolate'. Underscores in the loci names are ignored. Other columns that do not form part of the defined MLST scheme are ignored. Incomplete profiles can be loaded.
- dataset.txt - Sample isolate dataset containing profiles for 156 Neisseria isolates.