RNA-seq Part 2

Installation, FASTQ to Quantification

This is the main practical page for the current release. It tracks the working local workflow from setup through alignment, BAM processing, and gene-level counting.

Hands-on Workflow STAR for Reference HISAT2 Recommended

Embedded video

Quick recap

Part 2 takes the conceptual workflow from Part 1 and turns it into terminal-based execution. The target output of this stage is a mapped and processed BAM file plus a gene-level count matrix that is ready for downstream differential expression work.

Repositories and resources used

  • Bioinformatics with BB video series for step-by-step explanation
  • Conda environment for reproducible tool installation
  • Reference genome and GTF annotation from a matched release
  • FastQC, Cutadapt, HISAT2, SAMtools, and Subread tools

System setup notes

Set up a clean working directory before running commands. Keep raw data, trimmed reads, references, alignments, counts, and QC outputs in clearly named folders. This avoids path confusion later in the workflow.

If you are working on Mac arm64, some tools or binaries may behave differently than expected. That is one reason the practical workflow may switch tools even when the original plan used something else.

Terminal environment notes

  • Activate the correct Conda environment before every tool run.
  • Confirm the working directory with pwd.
  • Use mkdir -p for output directories before running any command that writes files.
  • Keep genome FASTA and annotation GTF from the same build.

Installation notes

Install the tools inside a dedicated environment rather than into the base environment. This keeps the RNA-seq workflow isolated and easier to rerun. For the current tutorial run, HISAT2 is the recommended aligner path.

Dataset download

Download your practice dataset into a dedicated raw-data folder. Keep filenames consistent and confirm whether the reads are paired-end or single-end before proceeding to QC and trimming.

Raw FastQC

Start with a baseline quality check on the raw FASTQ files. This helps verify read quality, adapter content, duplication patterns, and whether trimming is necessary before alignment.

Trimming

Use Cutadapt to remove adapters and low-quality sequence when appropriate. Keep trimmed outputs in a separate folder rather than overwriting the raw reads.

Reference genome and annotation

Prepare the reference genome FASTA and annotation GTF in a reference directory. Version consistency matters. A mismatch between the genome build and annotation file is a common reason for poor alignment or counting problems.

Aligner step

The planned learning path includes STAR, but the recommended local workflow for this tutorial run uses HISAT2. Keeping both documented is useful: STAR remains a reference path, while HISAT2 reflects the working commands used in practice.

BAM processing

After alignment, convert SAM to BAM if needed, sort the BAM file, build an index, and use a quick summary check such as samtools flagstat. These steps make the alignment output ready for counting and later inspection.

FeatureCounts

Use featureCounts with the matching GTF annotation to generate gene-level counts. This count file is the bridge into downstream normalization and differential expression analysis.

Output summary

  • QC reports for raw reads
  • Trimmed FASTQ files
  • Reference indices
  • Sorted and indexed BAM files
  • flagstat alignment summaries
  • Gene count matrix from featureCounts

What comes next

Part 3 will extend this workflow into normalization, differential expression analysis with DESeq2, plotting, and interpretation. Until then, the counts produced here are the main handoff point for downstream analysis.