Skip to content

Raw RNA-seq Processing Pipeline

Location: raw_RNAseq_process/

Standalone pipeline for total RNA-seq data: GEO/SRA download → fastp QC → STAR alignment → count matrix. Self-contained; runs inside an Apptainer container.

Files

File Purpose
TotalRNAseq.def Apptainer container definition: fastp, STAR 2.7.11b, samtools 1.22.1, Python + pandas
setup_genome.sh Download killifish genome + GTF from NCBI; build STAR index
run_rnaseq.sh Pipeline: GEO/SRA download → QC → alignment → count matrix

Quick start

cd raw_RNAseq_process/

# 1. Build container (once, requires Apptainer)
apptainer build TotalRNAseq.sif TotalRNAseq.def

# 2. Download genome + build STAR index (once, ~30–60 min, ~30 GB RAM)
./setup_genome.sh -o /path/to/ref

# 3. Run pipeline from a GEO accession
./run_rnaseq.sh -g GSE123456 -i /path/to/ref/star_index

# 4. Or from explicit SRR accessions
./run_rnaseq.sh -i /path/to/ref/star_index SRR12345678 SRR12345679

# 5. Or from a TSV mapping file
./run_rnaseq.sh -m samples.tsv -i /path/to/ref/star_index

Pipeline steps

Step Tool Input → Output
1 prefetch + fasterq-dump SRR accession → tmp/<s>_R1.fastq.gz, _R2.fastq.gz
2 fastp paired/single FASTQ → trimmed FASTQ + QC JSON/HTML
3 STAR trimmed FASTQ → sorted BAM + ReadsPerGene.out.tab
3b samtools BAM → .bam.bai index
4 Python all ReadsPerGene.out.tabcount_matrix.csv

Trimmed FASTQs are deleted after STAR alignment (use -k to keep). Steps are idempotent — completed samples are skipped on re-runs.


run_rnaseq.sh options

Flag Default Description
-g GSE_ID GEO series ID; fetches SRR list via esearch/efetch
-m FILE TSV: sample_id <tab> SRR or sample_id <tab> R1.fastq.gz,R2.fastq.gz
-i DIR required STAR genome index directory (must contain SA file)
-o DIR ./results Output directory
-s STRAND reverse reverse / forward / unstranded
-e END paired paired / single
-t INT 8 Threads
-k off Keep trimmed FASTQs after alignment
--skip-merge off Process samples but skip final count matrix merge (for SLURM arrays)
--merge-only off Skip sample processing; only merge existing tab files

Strandedness

Library type Flag STAR column used
dUTP/TruSeq Stranded -s reverse (default) column 4
Forward-stranded -s forward column 3
Unstranded -s unstranded column 2

Outputs

results/
  count_matrix.csv                            ← genes × samples raw count matrix
  qc_reports/<sample>_fastp.html/.json
  STAR_out_<sample>_Aligned.sortedByCoord.out.bam
  STAR_out_<sample>_Aligned.sortedByCoord.out.bam.bai
  STAR_out_<sample>_ReadsPerGene.out.tab

The count_matrix.csv rows are gene IDs from the GTF; summary rows prefixed N_ (unmapped, multimapping, noFeature, ambiguous) are excluded automatically.


Setting up the genome reference (setup_genome.sh)

Downloads and indexes the N. furzeri GRZ reference genome for STAR alignment.

./setup_genome.sh                         # download + index into ./ref/
./setup_genome.sh -o /data/killifish      # custom output directory
./setup_genome.sh -t 16                   # more threads for STAR index build
./setup_genome.sh --skip-download         # index only (files already present)
Flag Default Description
-o DIR ./ref Output directory
-t INT 8 Threads for STAR --runMode genomeGenerate
--skip-download off Skip wget/gunzip; go straight to index build

Reference genome details:

Property Value
Species Nothobranchius furzeri (GRZ strain)
Assembly GCF_043380555.1 NfurGRZ-RIMD1
Source NCBI RefSeq FTP, December 2024
FASTA size ~1.3 GB
STAR index size ~27 GB
genomeSAindexNbases 14 (correct for genome > 1 Gb)

Prerequisites

  • Apptainer installed (apptainer build for container)
  • SRA Toolkit (prefetch, fasterq-dump) for SRA/GEO input
  • entrez-direct (esearch, efetch) for -g GSE_ID mode only