Annotation pipeline
Technical overview annotation pipeline
This section provides a high level overview of the processing steps performed during import and annotation of metaproteomics data. Here, details regarding aggregation and quantification of metaproteomics data is outlined. Several decisions were made in missing data handling and format restrictions to manage potential identification and quantification biases from the datasets. The sections outlined below loosely correspond to functions executed in the dashboard backend. Specific processing steps executed during import depends on user specified annotation options and sources that are included into the dashboard (Merge DB search files into one sample, perform Unipept annotation of peptide sequences, etc.).
Annotation start: Load dataset and set annotation options
- Check presence of datasets, are datasets present to do annotation? (see scenarios)
- Load current sample table, Check and filter DB search data in input dataset and current sample table, also, if sample names have duplicates, add suffix to the new import to separate them.
- Perform peptide Annotation to process data and expand the sample table.
- Concatenate
MetaPepTable
object for new sample with the current sample table.
Annotate peptides
- Import taxonomy db (if GTDB chosen, do not perform Unipept search)
- Import taxonomy map file (if uploaded) (perform LCA if protein ID present multiple times)
- Import function map file (if uploaded)
- Load cRAP dataset if specified by user
- Import de novo data file(s) into MetaPepDeNovo format
- Create a dictionary that maps:
{raw_spectral_name: MetaPepDeNovo_obj}
- When importing and processingde novo filter out crap peptides if specified
- Add de novo data only if its
raw_spectral_name
is not present in the dict. Thus, ensure that every spectrum file has only one de novo file within a single sample. (between samples, duplicate files are allowed)
- Create a dictionary that maps:
- Import database search file(s) if present
- If Merge DB search is True (all DB search files are one sample):
- Load all DB search files to
MetaPepDbSearch
(Load metapep DB search) format and store in list, filter out crap peptides if specified for all files - Concatenate
MetaPepDbSearch
tables (one per DB search file) - Check during concatenation that source files (raw spectrum file) are unique across objects, otherwise, peptides may be counted double
- Create new sample dataset for concatenated
MetaPepDbSearch
file.
- Load all DB search files to
- If Merge DB search is False (each DB search file is its own sample):
- loop through DB search files:
- Load an DB search file and process into
MetaPepDbSearch
(see Load metapep DB search) (filter crap if specified) - Create new sample dataset for single
MetaPepDbSearch
file. - Append
MetaPepTable
into list and iterate to next DB search file.
- Load an DB search file and process into
- Concatenate all
MetaPepTable
objects into single object..
- loop through DB search files:
- If Merge DB search is True (all DB search files are one sample):
- If no DB search files supplied but only de novo file: Build
MetaPepTable
(seebuild_metapep_table()
level) for sample with only de novo data and taxonomy mapping. Append sample to existingMetaPepTable
and return it.
Convert source specific metaproteomics format into MetaPepView format
- Load correct DB search class based on DB search/de novo format (e.g. Sage).
- Read data in class and add sample name to it.
- Convert format specific DB search/de novo object to
MetaPep{DbSearch | DeNovo}
object.- Rename columns to consistent format (save confidence format as variable).
- Extract aa sequence from peptides (equate Leucin-Isoleucine, remove PTM).
- Filter cRAP peptides out.
- Extract all unique source file names to store in a list in the
MetaPep{DbSearch | DeNovo}
object. - remove file type suffix from source file, format protein ID delimiter.
Process data into new sample
- If DB search data supplied:
- Apply confidence threshold cutoff and aggregate spectrum matches to peptide sequence groups: Create PSM column that counts number of observations of peptide sequence, sum MS1 precursor signal intensities, store maximum spectrum confidence as peptide sequence confidence. From the highest confidence scan, take the spectrum information as peptide information (e.g. retention time, m/z, ppm, scan number, etc.)
taxonomic_annotation()
(if protein-taxonomy map present): supplement peptide grouped data with taxonomy ID and lineage information. If no mapping file present, add empty columns instead. If multiple proteins matched against peptide, store LCA of protein taxa.functional_annotation()
(if protein-function map present): supplement peptide data with KEGG KO information from function mapper. If multiple proteins mapped against peptide: either, only store information if no conflict in ID between proteins (empty values are ignored), or concatenate IDs into a combined string.
- If de novo data supplied:
include_de_novo()
:- If all DB search files are one single sample, take all de novo file data and concatenate to single
MetaPepDeNovo
(all de novo peptides are included in the sample, no matter the spectrum file source) - If each DB search is separate sample: fetch de novo files that match to source files in DB search data files, concatenate only these files (one DB search file is one sample, this file will only match de novo files that come from the same MS runs as that DB search file)
process_de_novo_data()
: Apply confidence cutoff, peptide length cutoff and group spectra in the concatenated de novo object by peptide sequence and sample name. (seeprocess_db_search_data()
for aggregation rules)merge_de_novo_db_search()
: add de novo fields to DB search peptide dataset, match by peptide sequence.- Store de novo metadata (confidence format, import status, de novo file format)
- If all DB search files are one single sample, take all de novo file data and concatenate to single
- If Unipept taxonomy selected
global_taxonomic_annotation()
:
- Set metadata fields: formats, what data imported, etc.
- Combine peptide dataset with metadata into
MetaPepTable
object