Installation, FASTQ to Quantification
This is the main practical page for the current release. It tracks the working local workflow from setup through alignment, BAM processing, and gene-level counting.
Embedded video
Quick recap
Part 2 takes the conceptual workflow from Part 1 and turns it into terminal-based execution. The target output of this stage is a mapped and processed BAM file plus a gene-level count matrix that is ready for downstream differential expression work.
Repositories and resources used
- Bioinformatics with BB video series for step-by-step explanation
- Conda environment for reproducible tool installation
- Reference genome and GTF annotation from a matched release
- FastQC, Cutadapt, HISAT2, SAMtools, and Subread tools
System setup notes
Set up a clean working directory before running commands. Keep raw data, trimmed reads, references, alignments, counts, and QC outputs in clearly named folders. This avoids path confusion later in the workflow.
If you are working on Mac arm64, some tools or binaries may behave differently than expected. That is one reason the practical workflow may switch tools even when the original plan used something else.
Terminal environment notes
- Activate the correct Conda environment before every tool run.
- Confirm the working directory with
pwd. - Use
mkdir -pfor output directories before running any command that writes files. - Keep genome FASTA and annotation GTF from the same build.
Installation notes
Install the tools inside a dedicated environment rather than into the base environment. This keeps the RNA-seq workflow isolated and easier to rerun. For the current tutorial run, HISAT2 is the recommended aligner path.
Dataset download
Download your practice dataset into a dedicated raw-data folder. Keep filenames consistent and confirm whether the reads are paired-end or single-end before proceeding to QC and trimming.
Raw FastQC
Start with a baseline quality check on the raw FASTQ files. This helps verify read quality, adapter content, duplication patterns, and whether trimming is necessary before alignment.
Trimming
Use Cutadapt to remove adapters and low-quality sequence when appropriate. Keep trimmed outputs in a separate folder rather than overwriting the raw reads.
Reference genome and annotation
Prepare the reference genome FASTA and annotation GTF in a reference directory. Version consistency matters. A mismatch between the genome build and annotation file is a common reason for poor alignment or counting problems.
Aligner step
The planned learning path includes STAR, but the recommended local workflow for this tutorial run uses HISAT2. Keeping both documented is useful: STAR remains a reference path, while HISAT2 reflects the working commands used in practice.
BAM processing
After alignment, convert SAM to BAM if needed, sort the BAM file, build an index, and use a quick summary check such as samtools flagstat. These steps make the alignment output ready for counting and later inspection.
FeatureCounts
Use featureCounts with the matching GTF annotation to generate gene-level counts. This count file is the bridge into downstream normalization and differential expression analysis.
Output summary
- QC reports for raw reads
- Trimmed FASTQ files
- Reference indices
- Sorted and indexed BAM files
flagstatalignment summaries- Gene count matrix from
featureCounts
What comes next
Part 3 will extend this workflow into normalization, differential expression analysis with DESeq2, plotting, and interpretation. Until then, the counts produced here are the main handoff point for downstream analysis.