gene_mapping — `GeneMapper`¶

Maps query Ensembl gene IDs (ENSNFUG...) to Atlas NCBI gene names.

File: src/gene_mapping.py
Test: unittests/test_gene_mapping.py

When mapping is needed¶

Dataset source	Gene ID format	Mapping needed
Ensembl pipeline	`ENSNFUG00000001234`	Yes — `GeneMapper.convert()`
`raw_RNAseq_process` output (NCBI GTF)	`actb`, `LOC107XXXXXX`	No — Atlas names directly

GeneMapper is only invoked when the input index contains ENSNFUG IDs. Datasets already using Atlas NCBI gene names (symbols + LOC107) skip mapping entirely and are intersected with the Atlas gene set by name.

The mapping problem (ENSNFUG datasets)¶

The KillifishAtlas uses NCBI gene names (e.g. actb, mb21d1, LOC107374091). Query datasets produced from newer NCBI genome assemblies use Ensembl IDs (ENSNFUG00000001234). Direct string matching fails for ~48% of genes.

Three-layer mapping strategy¶

Layer	Source	Genes mapped
1 — Direct	lowercase `gene_name` → Atlas gene name	7,775 genes
2 — BioMart	Ensembl `external_gene_name` for unmatched genes	+393 genes
3 — LOC107 via NCBI gene2ensembl	ENSNFUG → GeneID 107XXXXXX → `LOC107XXXXXX`	+4,685 genes

Total: 12,482 / 23,991 query genes mapped (52%)

Methods¶

`convert(counts)`¶

Maps the ENSNFUG row index to Atlas gene names. Drops unmapped genes and deduplicates any many-to-one mappings.

from src.gene_mapping import GeneMapper

mapper = GeneMapper()
atlas_counts = mapper.convert(ensembl_counts)

`build_and_save(...)`¶

Rebuilds the mapping table from scratch. Requires internet access for BioMart queries. Saves to data/gene_id_map.csv.

Not needed for normal use — the pre-built CSV is included in the repo.

Pre-built mapping files¶

Located in data/:

File	Description
`gene_id_map.csv`	Full unified mapping table (GTF + NCBI gene2ensembl + Atlas names)
`query_to_atlas_gene_mapping.csv`	Ensembl ID → Atlas gene name (used by `GeneMapper`)
`ncbi_gene2ensembl_nfurzeri.csv`	NCBI gene2ensembl mapping for N. furzeri

See Gene ID Mapping for details on how these files were built.

gene_mapping — GeneMapper¶