preprocessing — `Preprocessor`¶

Gene filtering, sample stratification, and outlier removal before clock training.

File: src/preprocessing.py
Test: unittests/test_preprocessing.py

Methods¶

`filter_genes(counts, min_count=1)`¶

Removes genes whose total count across all samples is ≤ min_count. Default min_count=1 removes genes that are completely absent.

from src.preprocessing import Preprocessor

pp = Preprocessor()
filtered = pp.filter_genes(raw_counts)

`stratify(counts, metadata, tissue, sex=None)`¶

Subsets the count matrix and metadata to a specific tissue, and optionally to a single sex.

# All samples for Liver, both sexes (sex-combined mode used in this repo)
liver_counts, liver_meta = pp.stratify(counts, meta, tissue="Liver", sex=None)

# Female Liver samples only
liver_f_counts, liver_f_meta = pp.stratify(counts, meta, tissue="Liver", sex="F")

`detect_outliers(counts, n_sd=2.0)`¶

PCA-based outlier detection. Computes PC1 scores across all samples and flags any sample whose PC1 score deviates more than n_sd standard deviations from the mean.

Applied before BayesAge 2.0 and PCR. Not applied before Elastic Net.

Returns a 2-tuple (counts_clean, outlier_ids) — metadata is not modified by this method; the caller must filter it separately using the returned outlier_ids.

clean_counts, outlier_ids = pp.detect_outliers(counts, n_sd=2.0)
clean_meta = meta.drop(index=outlier_ids)

Order of operations¶

raw_counts
    └─► filter_genes()         # remove zero-count genes
    └─► stratify()             # subset to tissue (and sex)
    └─► detect_outliers()      # PCA-based outlier removal
    └─► normalize()            # FrequencyNormalizer or DESeq2Normalizer
    └─► clock.build_reference() / clock.fit()

preprocessing — Preprocessor¶

Methods¶

filter_genes(counts, min_count=1)¶

stratify(counts, metadata, tissue, sex=None)¶

detect_outliers(counts, n_sd=2.0)¶

Order of operations¶

preprocessing — `Preprocessor`¶

`filter_genes(counts, min_count=1)`¶

`stratify(counts, metadata, tissue, sex=None)`¶

`detect_outliers(counts, n_sd=2.0)`¶