File Formats¶

This page describes the file formats used as input to and output from quantms.

SDRF — Sample and Data Relationship Format¶

The SDRF-Proteomics format is the recommended input for quantms. It is a tab-delimited file that encodes both sample metadata and search parameters in a single table, making experiments fully self-describing and reproducible.

Column Conventions¶

Column names follow a controlled vocabulary: source name, comment[...], characteristics[...], factor value[...]
One row per raw file (or per channel for isobaric experiments)
File URIs in comment[file uri] support local paths (file:///) and remote URIs (s3://, https://)

Required Columns¶

Column	Description	Example
`source name`	Unique sample identifier	`ctrl_1`
`comment[file uri]`	URI of the spectra file	`file:///data/run1.mzML`
`comment[label]`	Labeling type and channel	`label free sample` or `TMT10plex-126`
`comment[fraction identifier]`	Fraction number for fractionated samples	`1`
`comment[data file]`	Raw file name	`run1.raw`
`characteristics[organism]`	Species	`Homo sapiens`

Common Optional Columns¶

Column	Description
`comment[cleavage agent details]`	Enzyme (PSI-MS CV term, e.g. `MS:1001251` for Trypsin)
`comment[modification parameters]`	Fixed and variable modifications (PSI-MS CV terms)
`comment[precursor mass tolerance]`	e.g. `10 ppm`
`comment[fragment mass tolerance]`	e.g. `0.02 Da`
`comment[dissociation method]`	`HCD`, `CID`, `ETD`
`factor value[disease]`	Experimental condition for DE analysis
`characteristics[pooled sample]`	Marks a reference channel in TMT experiments

Example SDRF Fragment¶

source name comment[file uri]   comment[label]  comment[fraction identifier]    characteristics[organism]   factor value[disease]
ctrl_1  file:///data/ctrl_1.mzML    label free sample   1   Homo sapiens    healthy
ctrl_2  file:///data/ctrl_2.mzML    label free sample   1   Homo sapiens    healthy
case_1  file:///data/case_1.mzML    label free sample   1   Homo sapiens    NASH
case_2  file:///data/case_2.mzML    label free sample   1   Homo sapiens    NASH

SDRF Resources¶

mzTab — Primary Result Format¶

mzTab is the HUPO-PSI standard for reporting mass spectrometry-based proteomics and metabolomics results. quantms produces mzTab as the primary output for both LFQ and TMT/iTRAQ workflows.

mzTab is a multi-section tab-delimited text file. Each line begins with a row type prefix:

Prefix	Section
`MTD`	Metadata (software versions, study variables, assay definitions)
`PRH`	Protein table header
`PRT`	Protein row
`PEH`	Peptide table header
`PEP`	Peptide row
`PSH`	PSM table header
`PSM`	PSM row

The protein section contains per-study-variable abundance values. The PSM section contains one row per identified spectrum with scores, FDR, and modified sequences.

mzTab files produced by quantms are suitable for:

Submission to PRIDE (EBI ProteomeXchange repository)
Direct loading into mokume for protein quantification
Conversion to other formats with qpx

MSstats CSV — Differential Expression Input¶

The MSstats CSV (also called msstats_in.csv) is a long-format table compatible with the OpenMStoMSstats() function in the MSstats R package.

Column Structure¶

Column	Description
`ProteinName`	Protein accession
`PeptideSequence`	Peptide sequence (with modifications)
`PrecursorCharge`	Precursor charge state
`FragmentIon`	Fragment ion identifier (LFQ) or channel (TMT)
`ProductCharge`	Product ion charge
`IsotopeLabelType`	`L` (light/LFQ) or `H` (heavy/reference)
`Condition`	Experimental condition
`BioReplicate`	Biological replicate identifier
`Run`	Raw file or run identifier
`Intensity`	Measured intensity (MS1 feature or reporter ion)

This format is exported by quantms as input for downstream statistical analysis with the MSstats R package. quantms does not run MSstats — it produces the input file.

Raw File Formats¶

quantms natively supports the following spectra file formats:

Format	Extension	Notes
mzML	`.mzML`	Open standard; preferred input; used directly without conversion
Thermo RAW	`.raw`	Auto-converted to mzML via ThermoRawFileParser
Bruker timsTOF	`.d`	Converted when `--convert_dotd true`

Compressed variants of all formats are accepted:

Compression	Extension
gzip	`.gz`
tar	`.tar`
tar + gzip	`.tar.gz`, `.tgz`
zip	`.zip`

Files are decompressed automatically before processing. Remote files are supported via URI in the SDRF comment[file uri] column: file:///, s3://, gs://, https://.

QPX Parquet Format¶

When running the identification-only subworkflow, quantms writes PSM results in the QPX parquet format. QPX is a columnar binary format designed for efficient storage and querying of large-scale proteomics datasets.

QPX files are stored under psm_tables/ in the output directory. They can be loaded with:

Python / pandas: pd.read_parquet("psm_tables/psms.parquet")
DuckDB: SELECT * FROM read_parquet('psm_tables/*.parquet')
qpx CLI: qpx convert --input psms.parquet --output out.mzTab

The QPX format is the native input format for mokume and qpx. For full column definitions and the QPX specification, see the quantms.io documentation.

Protein FASTA¶

Standard FASTA format. Decoy sequences are generated automatically by the pipeline unless --add_decoys false is set. Remove stop-codon characters (*) before use, as different search engines handle them differently.

--database /path/to/proteome.fasta
--add_decoys true     # default