Raw RNA-seq Processing Pipeline¶
Location: raw_RNAseq_process/
Standalone pipeline for total RNA-seq data: GEO/SRA download → fastp QC → STAR alignment → count matrix. Self-contained; runs inside an Apptainer container.
Files¶
| File | Purpose |
|---|---|
TotalRNAseq.def |
Apptainer container definition: fastp, STAR 2.7.11b, samtools 1.22.1, Python + pandas |
setup_genome.sh |
Download killifish genome + GTF from NCBI; build STAR index |
run_rnaseq.sh |
Pipeline: GEO/SRA download → QC → alignment → count matrix |
Quick start¶
cd raw_RNAseq_process/
# 1. Build container (once, requires Apptainer)
apptainer build TotalRNAseq.sif TotalRNAseq.def
# 2. Download genome + build STAR index (once, ~30–60 min, ~30 GB RAM)
./setup_genome.sh -o /path/to/ref
# 3. Run pipeline from a GEO accession
./run_rnaseq.sh -g GSE123456 -i /path/to/ref/star_index
# 4. Or from explicit SRR accessions
./run_rnaseq.sh -i /path/to/ref/star_index SRR12345678 SRR12345679
# 5. Or from a TSV mapping file
./run_rnaseq.sh -m samples.tsv -i /path/to/ref/star_index
Pipeline steps¶
| Step | Tool | Input → Output |
|---|---|---|
| 1 | prefetch + fasterq-dump |
SRR accession → tmp/<s>_R1.fastq.gz, _R2.fastq.gz |
| 2 | fastp | paired/single FASTQ → trimmed FASTQ + QC JSON/HTML |
| 3 | STAR | trimmed FASTQ → sorted BAM + ReadsPerGene.out.tab |
| 3b | samtools | BAM → .bam.bai index |
| 4 | Python | all ReadsPerGene.out.tab → count_matrix.csv |
Trimmed FASTQs are deleted after STAR alignment (use -k to keep).
Steps are idempotent — completed samples are skipped on re-runs.
run_rnaseq.sh options¶
| Flag | Default | Description |
|---|---|---|
-g GSE_ID |
— | GEO series ID; fetches SRR list via esearch/efetch |
-m FILE |
— | TSV: sample_id <tab> SRR or sample_id <tab> R1.fastq.gz,R2.fastq.gz |
-i DIR |
required | STAR genome index directory (must contain SA file) |
-o DIR |
./results |
Output directory |
-s STRAND |
reverse |
reverse / forward / unstranded |
-e END |
paired |
paired / single |
-t INT |
8 |
Threads |
-k |
off | Keep trimmed FASTQs after alignment |
--skip-merge |
off | Process samples but skip final count matrix merge (for SLURM arrays) |
--merge-only |
off | Skip sample processing; only merge existing tab files |
Strandedness¶
| Library type | Flag | STAR column used |
|---|---|---|
| dUTP/TruSeq Stranded | -s reverse (default) |
column 4 |
| Forward-stranded | -s forward |
column 3 |
| Unstranded | -s unstranded |
column 2 |
Outputs¶
results/
count_matrix.csv ← genes × samples raw count matrix
qc_reports/<sample>_fastp.html/.json
STAR_out_<sample>_Aligned.sortedByCoord.out.bam
STAR_out_<sample>_Aligned.sortedByCoord.out.bam.bai
STAR_out_<sample>_ReadsPerGene.out.tab
The count_matrix.csv rows are gene IDs from the GTF; summary rows prefixed N_
(unmapped, multimapping, noFeature, ambiguous) are excluded automatically.
Setting up the genome reference (setup_genome.sh)¶
Downloads and indexes the N. furzeri GRZ reference genome for STAR alignment.
./setup_genome.sh # download + index into ./ref/
./setup_genome.sh -o /data/killifish # custom output directory
./setup_genome.sh -t 16 # more threads for STAR index build
./setup_genome.sh --skip-download # index only (files already present)
| Flag | Default | Description |
|---|---|---|
-o DIR |
./ref |
Output directory |
-t INT |
8 |
Threads for STAR --runMode genomeGenerate |
--skip-download |
off | Skip wget/gunzip; go straight to index build |
Reference genome details:
| Property | Value |
|---|---|
| Species | Nothobranchius furzeri (GRZ strain) |
| Assembly | GCF_043380555.1 NfurGRZ-RIMD1 |
| Source | NCBI RefSeq FTP, December 2024 |
| FASTA size | ~1.3 GB |
| STAR index size | ~27 GB |
genomeSAindexNbases |
14 (correct for genome > 1 Gb) |
Prerequisites¶
- Apptainer installed (
apptainer buildfor container) - SRA Toolkit (
prefetch,fasterq-dump) for SRA/GEO input - entrez-direct (
esearch,efetch) for-g GSE_IDmode only