File Formats¶
This page describes the file formats used as input to and output from quantms.
SDRF — Sample and Data Relationship Format¶
The SDRF-Proteomics format is the recommended input for quantms. It is a tab-delimited file that encodes both sample metadata and search parameters in a single table, making experiments fully self-describing and reproducible.
Column Conventions¶
- Column names follow a controlled vocabulary:
source name,comment[...],characteristics[...],factor value[...] - One row per raw file (or per channel for isobaric experiments)
- File URIs in
comment[file uri]support local paths (file:///) and remote URIs (s3://,https://)
Required Columns¶
| Column | Description | Example |
|---|---|---|
source name |
Unique sample identifier | ctrl_1 |
comment[file uri] |
URI of the spectra file | file:///data/run1.mzML |
comment[label] |
Labeling type and channel | label free sample or TMT10plex-126 |
comment[fraction identifier] |
Fraction number for fractionated samples | 1 |
comment[data file] |
Raw file name | run1.raw |
characteristics[organism] |
Species | Homo sapiens |
Common Optional Columns¶
| Column | Description |
|---|---|
comment[cleavage agent details] |
Enzyme (PSI-MS CV term, e.g. MS:1001251 for Trypsin) |
comment[modification parameters] |
Fixed and variable modifications (PSI-MS CV terms) |
comment[precursor mass tolerance] |
e.g. 10 ppm |
comment[fragment mass tolerance] |
e.g. 0.02 Da |
comment[dissociation method] |
HCD, CID, ETD |
factor value[disease] |
Experimental condition for DE analysis |
characteristics[pooled sample] |
Marks a reference channel in TMT experiments |
Example SDRF Fragment¶
source name comment[file uri] comment[label] comment[fraction identifier] characteristics[organism] factor value[disease]
ctrl_1 file:///data/ctrl_1.mzML label free sample 1 Homo sapiens healthy
ctrl_2 file:///data/ctrl_2.mzML label free sample 1 Homo sapiens healthy
case_1 file:///data/case_1.mzML label free sample 1 Homo sapiens NASH
case_2 file:///data/case_2.mzML label free sample 1 Homo sapiens NASH
SDRF Resources¶
mzTab — Primary Result Format¶
mzTab is the HUPO-PSI standard for reporting mass spectrometry-based proteomics and metabolomics results. quantms produces mzTab as the primary output for both LFQ and TMT/iTRAQ workflows.
mzTab is a multi-section tab-delimited text file. Each line begins with a row type prefix:
| Prefix | Section |
|---|---|
MTD |
Metadata (software versions, study variables, assay definitions) |
PRH |
Protein table header |
PRT |
Protein row |
PEH |
Peptide table header |
PEP |
Peptide row |
PSH |
PSM table header |
PSM |
PSM row |
The protein section contains per-study-variable abundance values. The PSM section contains one row per identified spectrum with scores, FDR, and modified sequences.
mzTab files produced by quantms are suitable for:
- Submission to PRIDE (EBI ProteomeXchange repository)
- Direct loading into mokume for protein quantification
- Conversion to other formats with qpx
MSstats CSV — Differential Expression Input¶
The MSstats CSV (also called msstats_in.csv) is a long-format table compatible with the OpenMStoMSstats() function in the MSstats R package.
Column Structure¶
| Column | Description |
|---|---|
ProteinName |
Protein accession |
PeptideSequence |
Peptide sequence (with modifications) |
PrecursorCharge |
Precursor charge state |
FragmentIon |
Fragment ion identifier (LFQ) or channel (TMT) |
ProductCharge |
Product ion charge |
IsotopeLabelType |
L (light/LFQ) or H (heavy/reference) |
Condition |
Experimental condition |
BioReplicate |
Biological replicate identifier |
Run |
Raw file or run identifier |
Intensity |
Measured intensity (MS1 feature or reporter ion) |
This format is exported by quantms as input for downstream statistical analysis with the MSstats R package. quantms does not run MSstats — it produces the input file.
Raw File Formats¶
quantms natively supports the following spectra file formats:
| Format | Extension | Notes |
|---|---|---|
| mzML | .mzML |
Open standard; preferred input; used directly without conversion |
| Thermo RAW | .raw |
Auto-converted to mzML via ThermoRawFileParser |
| Bruker timsTOF | .d |
Converted when --convert_dotd true |
Compressed variants of all formats are accepted:
| Compression | Extension |
|---|---|
| gzip | .gz |
| tar | .tar |
| tar + gzip | .tar.gz, .tgz |
| zip | .zip |
Files are decompressed automatically before processing. Remote files are supported via URI in the SDRF comment[file uri] column: file:///, s3://, gs://, https://.
QPX Parquet Format¶
When running the identification-only subworkflow, quantms writes PSM results in the QPX parquet format. QPX is a columnar binary format designed for efficient storage and querying of large-scale proteomics datasets.
QPX files are stored under psm_tables/ in the output directory. They can be loaded with:
- Python / pandas:
pd.read_parquet("psm_tables/psms.parquet") - DuckDB:
SELECT * FROM read_parquet('psm_tables/*.parquet') - qpx CLI:
qpx convert --input psms.parquet --output out.mzTab
The QPX format is the native input format for mokume and qpx. For full column definitions and the QPX specification, see the quantms.io documentation.
Protein FASTA¶
Standard FASTA format. Decoy sequences are generated automatically by the pipeline unless --add_decoys false is set. Remove stop-codon characters (*) before use, as different search engines handle them differently.