Search Engines¶

quantms supports three peptide database search engines: Comet, MS-GF+, and Sage. Each engine takes centroided MS2 spectra and a protein FASTA database as input and produces a list of scored PSMs. The engines can be run individually or in combination, with ConsensusID merging results for improved sensitivity.

Overview¶

Engine	Speed	Sensitivity	Best For
Comet	Fast	Good	General use, large datasets, default choice
MS-GF+	Moderate	Highest	High-resolution MS2, phosphoproteomics, +15% more PSMs than Comet
Sage	Fastest	Good	Very large datasets, cloud-scale, Rust-based

Comet¶

Comet is the default search engine in quantms. It is widely used in the field, well-validated, and fast enough for routine datasets. Comet uses a cross-correlation scoring function (XCorr) originally developed for SEQUEST.

--search_engines comet

Comet-Specific Parameters¶

Parameter	Default	Description
`--precursor_mass_tolerance`	`10`	Precursor mass tolerance
`--precursor_mass_tolerance_unit`	`ppm`	Unit: `ppm` or `Da`
`--fragment_mass_tolerance`	`0.02`	Fragment ion tolerance
`--fragment_mass_tolerance_unit`	`Da`	Unit: `Da` or `ppm`
`--num_hits`	`1`	Number of PSM candidates reported per spectrum

Comet is the recommended starting point for most datasets. It handles both high-resolution (Orbitrap) and low-resolution (ion trap) MS2 data effectively.

MS-GF+¶

MS-GF+ uses a probabilistic scoring model based on the generating function of a fragmentation model. It is particularly well-suited for:

High-resolution MS2 data (Orbitrap, QTOF)
Unusual proteases (non-tryptic or multienzyme experiments)
Phosphoproteomics and other PTM-rich datasets

MS-GF+ typically identifies ~15% more PSMs than Comet on high-resolution MS2 data at the same FDR, making it valuable for sensitive analyses.

--search_engines msgf

MS-GF+ requires centroided spectra. quantms performs centroiding automatically if peak picking has not been applied upstream.

Sage¶

Sage is a high-performance search engine written in Rust. It is the fastest option in quantms — often 5–10× faster than Comet on the same hardware.

--search_engines sage

Sage uses a k-d tree-based fragment index for rapid spectrum matching and is designed for modern multi-core hardware. It is the recommended engine for:

Datasets with hundreds of raw files
Cloud-scale reanalysis of public proteomics data
Rapid exploratory searches before final analysis

Sage produces scores comparable in sensitivity to Comet for standard tryptic datasets, though it may have slightly reduced sensitivity for highly modified peptides compared to MS-GF+.

Multi-Engine Consensus (ConsensusID)¶

Running two or more engines in combination and merging results with ConsensusID improves both sensitivity (more PSMs identified by at least one engine) and specificity (PSMs confirmed by multiple engines are more reliable).

--search_engines "comet msgf"
--search_engines "comet sage"
--search_engines "comet msgf sage"

When more than one engine is specified, quantms automatically runs all selected engines and pipes their results through the OpenMS ConsensusID module.

How ConsensusID Works¶

ConsensusID merges PSMs from multiple engines at the spectrum level:

For each spectrum, PSM candidates from all engines are collected
A consensus score is computed — by default using the PEPMatrix method, which combines posterior error probabilities from each engine
The merged PSM list is passed to Percolator for unified FDR estimation

The recommended combination for most datasets is Comet + MS-GF+, which provides the best sensitivity-to-speed trade-off. Adding Sage reduces wall-clock time compared to running only Comet + MS-GF+ and often produces comparable results.

Which Engine to Choose¶

Is your dataset very large (>100 files) or do you need rapid turnaround?
  → Use Sage, or Sage + Comet

Do you need maximum sensitivity (phosphoproteomics, low-abundance proteome)?
  → Use MS-GF+, or Comet + MS-GF+

Standard tryptic proteomics, balanced speed and sensitivity?
  → Use Comet (default), or Comet + MS-GF+ for best results

Re-analyzing public PRIDE data at scale?
  → Use Sage for speed; add Comet if sensitivity is critical

For publication-quality analysis, running two engines (Comet + MS-GF+) is recommended. The additional computation time is typically 1.5–2× a single-engine run, and the sensitivity gain at the protein level is consistently 10–20%.

Common Search Parameters¶

These parameters apply to all engines:

Parameter	Default	Description
`--enzyme`	`Trypsin`	Digest enzyme (`Trypsin`, `LysC`, `GluC`, `chymotrypsin`, etc.)
`--num_enzyme_termini`	`2`	Enzyme termini: `2` (fully), `1` (semi), `0` (non-specific)
`--allowed_missed_cleavages`	`2`	Maximum missed cleavage sites
`--fixed_mods`	`Carbamidomethyl (C)`	Fixed modifications
`--variable_mods`	`Oxidation (M)`	Variable modifications (comma-separated)
`--add_decoys`	`true`	Auto-generate reversed decoy sequences
`--fdr_threshold`	`0.01`	PSM/peptide FDR threshold

When running from an SDRF file, search parameters encoded in the SDRF (comment[cleavage agent details], comment[modification parameters], tolerances) take precedence over command-line values.