Curators ensure that data added to the PubMLST databases are consistent and correct. If you are interested in becoming a curator, the following should provide a guide as to what the role entails.
For each species or genus site, there are usually two databases:
- a sequence definition database which defines alleles and allelic profiles linked to ST.
- an isolate database that contains isolate provenance data, allele designations and potentially genome sequences.
Most databases now use an automated submission system which performs some validation and routes submission notifications to the appropriate curator(s).
For most databases, the workload required of a curator is fairly minimal. Ideally, submissions made to the databases should be handled in a reasonable time frame as submitters are relying on the provided designations. Under normal circumstances, a turn-around time of about a week is reasonable. Where the workload is significant, or if a curator is likely to be frequently unavailable, multiple curators can be assigned permissions to handle submissions.
Curators can seek advice at any time if they are unsure of any aspect of the role or are unsure how to process any particular submission.
Alleles
Allele sequences may be generated using a number of different sequencing technologies. The sequences should already be trimmed to the appropriate start/end sites of the locus by the submitter, and it is the role of the curator to ensure that this has been done correctly. You can compare against existing alleles to check using the links provided within the curator's interface. Depending on the sequencing technology, you may need to check the quality of the data. Any allele that has not been trimmed correctly should be rejected. The curator's guide shows an overview of allele curation.
Sanger sequencing
Sequences determined by Sanger/dideoxy terminator sequencing require that forward and reverse trace files (files ending with .ab1/.scf) are submitted. The curator is required to assemble these and check that the submitted sequence is supported by the trace files. The sequence query database tools (linked directly from the submission interface) can be used to identify which nucleotides vary from the most similar existing allele, so it should be specifically checked that the two traces are unambiguous at these positions. If trace files are of poor quality or do not cover the full length of the allele then the sequence should be rejected.
Whole genome sequencing
We are increasingly seeing submission of new allele sequences determined by whole genome sequencing methods, such as Illumina. These are easier to curate as we currently accept these sequences without additional evidence, provided that the allele is trimmed to the correct start and stop positions of the locus.
Allelic profiles/STs
Submitters may send new combinations of existing alleles for ST definition. Most databases require that every new ST assigned has representative isolate data available so a separate isolate submission should be made to the isolate database at the same time. The curator should ensure that this isolate data has been submitted before defining the new ST. No other checks are usually required. The curator's guide shows an overview of profile curation.
Isolates/genomes
The submission system will perform basic checks on submitted isolate data to ensure that the field values respect any constraints set by the database. Curators are usually experts in the species/genus in question and should ensure that data make sense from a biological perspective. Uploading new isolate and genome data is straightforward and the curator's guide provides an overiew. Curators should also scan new isolates with genome assemblies to identify and designate new MLST alleles and STs. See the short video below for guidance: