run_query_clocks.py¶

General-purpose CLI for applying all three aging clocks to any genes × samples count matrix.

Input format¶

A CSV or TSV where:

Rows are genes (Ensembl ENSNFUG... IDs or Atlas gene names)
Columns are samples named TISSUE_repN (e.g. Liver_rep1, Muscle_Rep3, SpinalCord_rep2)

Column names are parsed to extract the tissue label automatically. An optional --metadata CSV can override tissue assignments and provide age_days.

Usage¶

Scripts are in scripts/. Run from the repo root:

# Minimal — all clocks, auto-detect tissues, all available genes:
python scripts/run_query_clocks.py --counts query_data/toy.csv

# Select tissues and clocks, skip batch correction:
python scripts/run_query_clocks.py --counts my_counts.csv \
    --tissues Liver Muscle \
    --clocks bayesage2 pcr \
    --no-batch-correct

# Provide metadata explicitly:
python scripts/run_query_clocks.py --counts my_counts.csv \
    --metadata my_meta.csv   # columns: sample_id, tissue [, age_days]

# Atlas gene names already (skip Ensembl mapping):
python scripts/run_query_clocks.py --counts my_counts.csv \
    --gene-id-type atlas

# Custom output directory:
python scripts/run_query_clocks.py --counts my_counts.csv --out-dir results/

Options¶

Flag	Default	Description
`--counts`	(required)	genes × samples CSV or TSV
`--metadata`	—	optional sample metadata CSV (`sample_id`, `tissue`, `age_days`)
`--tissues`	auto-detected	Atlas tissue labels to include
`--clocks`	`bayesage2 pcr en`	subset of clocks to run
`--gene-id-type`	`auto`	`ensembl` (ENSNFUG IDs) / `atlas` (gene names) / `auto`
`--no-batch-correct`	off	skip ComBat-seq batch correction
`--m-values`	`25..200 step 5`	BayesAge2 gene-set sizes
`--n-components`	`5 10 15 20`	PCR components tested via LOSO-CV
`--top-n-var`	all genes	pre-filter to top-N most variable genes
`--out-dir`	`outputs/`	output base directory

Outputs¶

outputs/
  bayesage2/{tissue}_BayesAge2_predictions.csv
  bayesage2/{tissue}_BayesAge2_feature_importance.csv
  bayesage2/references/{tissue}_reference.tsv
  pcr/{tissue}_PCR_predictions.csv
  pcr/{tissue}_PCR_cv_metrics.csv
  pcr/{tissue}_PCR_feature_importance.csv
  pcr/{tissue}_PCR_loadings.tsv
  elastic_net/{tissue}_EN_predictions.csv
  elastic_net/{tissue}_EN_feature_importance.csv

Gene ID handling¶

run_query_clocks.py supports two gene ID formats:

Gene ID type	Example	What happens
`ensembl`	`ENSNFUG00015001234`	`GeneMapper.convert()` renames rows to Atlas names; unmapped genes dropped
`atlas`	`actb`, `LOC107374091`	No mapping — genes intersected with Atlas by name directly

When --gene-id-type auto (default), the script inspects the first 30 index entries:

Any ENSNFUG-prefixed entry → ensembl mode
Otherwise → atlas mode

Pass --gene-id-type ensembl or --gene-id-type atlas to force one mode.

When to use `--gene-id-type atlas`¶

Datasets produced by raw_RNAseq_process/run_rnaseq.sh (NCBI GTF pipeline) already use Atlas-compatible NCBI gene names. Pass --gene-id-type atlas (or rely on auto-detection) — GeneMapper is skipped and genes are matched by name.

Note: LOC139XXXXXX genes produced by the NfurGRZ-RIMD1 GTF have 0 % Atlas coverage (newer NCBI annotation not present in the Atlas). They are silently dropped at the intersection step. See Gene ID Mapping for coverage details.

PRJNA817434¶

PRJNA817434 uses Atlas-style gene IDs but its column names (Old_wt_fed_fat_1) do not follow the TISSUE_repN convention expected by run_query_clocks.py. Use the dedicated wrapper instead:

python scripts/run_PRJNA817434_clocks.py

Prerequisites¶

Run once before using PCR or EN:

python scripts/normalize_reference.py