Reference genome¶
Species: Nothobranchius furzeri (African turquoise killifish)
Strain: GRZ
Assembly: GCF_043380555.1 NfurGRZ-RIMD1
Source: NCBI RefSeq, December 2024
Genome size: ~1.3 Gb
GTF file: data/GCF_043380555.1_NfurGRZ-RIMD1_genomic.gtf
The GTF is also used by raw_RNAseq_process/run_rnaseq.sh to build the STAR index
and quantify gene counts. See Raw RNA-seq Pipeline.
Gene ID Mapping¶
The KillifishAtlas uses NCBI-style gene names (e.g. actb, mb21d1, LOC107374091).
Query datasets produced from the NfurGRZ-RIMD1 genome assembly (NCBI RefSeq GCF_043380555.1)
use Ensembl IDs (ENSNFUG00000001234). Bridging these two namespaces requires a
multi-source mapping pipeline.
Building the unified map (data/build_gene_map.py)¶
Script: data/build_gene_map.py
Merges three sources into data/gene_id_map.csv:
Source 1: GCF_043380555.1_NfurGRZ-RIMD1_genomic.gtf
→ gtf_gene_id, gtf_gene_name, ncbi_gene_id
Source 2: ncbi_gene2ensembl_nfurzeri.csv
→ ncbi_gene_id ↔ ensembl_gene_id
Source 3: query_to_atlas_gene_mapping.csv
→ ensembl_gene_id → atlas_gene
Merge strategy:
- Parse
generecords from the GTF, extractinggene_id,genename, andGeneIDdb_xref. - Join with
gene2ensemblonncbi_gene_idto addensembl_gene_id. - Join with the Atlas mapping on
ensembl_gene_idto addatlas_gene. - Supplement with Ensembl IDs in
gene2ensemblnot reachable via the GTF. - Supplement with Atlas entries not reachable via either route.
Output columns¶
| Column | Description |
|---|---|
gtf_gene_id |
RefSeq gene ID from GTF (e.g. gene1234) |
gtf_gene_name |
Gene symbol from GTF (e.g. actb) |
ncbi_gene_id |
NCBI GeneID integer |
ensembl_gene_id |
Ensembl ID (ENSNFUG...) |
atlas_gene |
Atlas gene name used as row index in count matrices |
Three-layer mapping in GeneMapper¶
src/gene_mapping.py applies three layers in order:
| Layer | Method |
|---|---|
| 1 | Direct lowercase gene_name → Atlas |
| 2 | BioMart external_gene_name fallback |
| 3 | ENSNFUG → GeneID 107XXXXXX → LOC107XXXXXX |
Coverage against the Atlas gene symbols -> ENSNFUGxxx¶
Measured by unittests/test_gene_mapping.py against GSE308970_Counts_Atlas_allbatches_merged_v3.csv
If a query experiment measured all Atlas gene symbols by their ENSNFUG IDs, how many could GeneMapper successfully translate? (25,122 Atlas genes × 677 samples; run 2026-05-26) :
| Gene type | Atlas total | Covered by GeneMapper | Coverage |
|---|---|---|---|
Named symbols (e.g. actb) |
10,533 | 7,706 | 73.2 % |
LOC genes (e.g. LOC107374091) |
14,589 | 4,311 | 29.5 % |
| All genes | 25,122 | 12,017 | 47.8 % |
The lower LOC-gene coverage reflects a biological reality: most LOC107XXXXXX entries
in the Atlas are unannotated loci that lack Ensembl cross-references in either BioMart or
the NCBI gene2ensembl table. The 13,105 uncovered genes are dropped before clock training
and have negligible impact on clock accuracy because unannotated loci carry little
age-predictive signal.
Coverage: raw_RNAseq_process output (PRJNA817434)¶
Data produced by raw_RNAseq_process/run_rnaseq.sh uses NCBI-style gene names directly
from the GTF — no ENSNFUG IDs, so GeneMapper does not apply. Coverage is measured
as a direct set intersection with Atlas gene names.
Measured against raw_RNAseq_process/results/PRJNA817434/PRJNA817434_raw_count.csv
(36,530 genes × 9 samples; run 2026-05-26):
| Gene type | PRJNA817434 total | In Atlas | Coverage |
|---|---|---|---|
Named symbols (e.g. acsl4a) |
17,657 | 7,098 | 40.2 % |
LOC107 genes (e.g. LOC107374091) |
6,203 | 5,103 | 82.3 % |
LOC139 genes (e.g. LOC139071432) |
7,031 | 0 | 0.0 % |
| tRNA / KEG92 entries | 5,639 | — | excluded |
| Usable total | 36,530 | 12,249 | 33.5 % |
LOC139XXXXXX genes come from a newer NCBI annotation (GeneID range 139M) that does not
exist in the Atlas (which uses the older 107M range). They are unmappable without rebuilding
the Atlas with the updated GTF.