Skip to main content

NGS: From FastQ to BAM

In this chapter we aim to explain the journey from raw FASTQ data to clinically relevant information on single-nucleotide polymorphism (SNPs). Our example data is purely for illustration purposes and mimics Illumina short read sequencing data from the MiSeq platform.

File Formats

We describe the file formats you will encounter in detail, as many beginners have never encountered them and some more experienced users have never opened a raw file. You don't need to understand the specifications in full, you just need to get a general understanding which information is stored.

FASTQ: Raw Sequencing Reads

FASTQ files encode (nucleotide) sequences coupled with quality values. This is typically the starting point of an in silico analysis in NGS (although some sequencers create files in a proprietary data format which must be converted into FASTQ beforehand). Each sequenced DNA molecule (also called a "read") is written separately into the file and spreads over four lines.

  1. Line: Encodes the sequence identifier, which stores information about the type of machine used, the ID of the run and information about the location of the read (inside the flow cell).
  2. Line: The raw (nucleotide) sequence
  3. Line: '+' character
  4. Line: Quality scores; each character corresponds with the same character from the sequence in line 2.
@M02092:100:000000000-C9F6R:1:1101:13338:2973 2:N:0:41
TAGTTAAGCAAAATACTAGATTTGAGGCACACAAACTCCTCTCCCTGCAGATTCATCATGCGGAACCGAGATGATGTAGCCAGCAGCATGTCGAAGATCTCCACCATGCCCTCTACACATTTTCCCTGGTTCCTATGAAAACATAGCAAAA
+
CCCRREFFEFFEGGGAFGGGGGHHHGHGGHHHGGHHHHHHHHHHHHHHHHHHHHHHHHHHGDGGGGGGGGGGHHHHHHHHHHHGHHHHHHHHHHGGGHHHHHHHHHGHHHHHHHHHHHHHHHHHHHHHHHHHHHGHHGHGHHHHHHHHFHG

Sequence Alignment Map (SAM)

After alignment of a FASTQ file to a reference genome, we receive a sequence alignment map (SAM) file or a binary alignment map (BAM) file. Both file types are interchangeable, while the binary file is optimized for faster access speed and not readable by a human eye.

Detailed Structure

Each line consists of at least 11 mandatory columns:

#ColumnDescription
1QNAMEQuery template NAME (e.g., sequence identifier)
2FLAGbitwise FLAG (e.g., if read is paired)
3RNAMEReferences sequence NAME (e.g., chromosome name)
4POSleftmost mapping position
5MAPQmapping Quality
6CIGARConcise Idiosyncratic Gapped Alignment Report (CIGAR) string
7RNEXTref. name of the mate/next read
8PNEXTposition of the mate/next read
9TLENobserved Template length
10SEQsegment sequence
11QUALquality score

.

Each line corresponds with an aligned read. The following sequence example is the same read as in the FASTQ example. It only spans one line in a SAM file, line breaks were inserted artificially for readability.

M02092:100:000000000-C9F6R:1:1101:13338:2973 99 chr6 152060961 42 151M 152060972 162
TATTTATTTATTTTTGCTATGTTTTCATAGGAACCAGGGAAAATGTGTAGAGGGCATGGTGGAGATCTTCGACATGCTGCTGGCTACATCATCTCGGTTCCGCATGATGAATCTGCAGGGAGAGGAGTTTGTGCCTCAAATCTATTATT
BECCCFFFFFFFGGGGGGGGGGHHHHHHHHHHHHGHHGGGHhhhhhhhhhhhhggghhhhhghhhhhhhhhhggghunhhhhhghnuuuuhhhhHhgggghggggghtthhhhhHhhhhgggggghghhh
handranaaaaaaaaaaaahAs:i:0 XN:1:0 XM:i:0 XO:1:0 XG:1:0 NM:1:0 MD:Z:151 YS:i:-13 YT:Z:CP

Variant Call Format (VCF)

The variant call format stores information about deviations from the reference genome, which were encountered in our sample.

Details

Detailed Structure There are eight mandatory columns.

#Description
1CHROMChromosome
2POSleftmost 1-based position
3IDidentifier, e.g., a dbSNP rs identifier; if unknown a "."
4REFreference base(s)
5ALTlist of alternative allele(s)
6QUALquality score
7FILTER"PASS" or reason of failure; "." if unknown
8INFOlist of key-value pairs (fields) describing the variation
9FORMAT(optional) list of fields for describing the samples
+SAMPLEsFor each (optional) sample described in the file, values are given for the fields listed in FORMAT

.

Each line contains information about a single variant.

chr6	152011739	.	C	A	0.0	.	AS_SB_TABLE=34,31|19,35;DP=120;ECNT=9;MBQ=38,39;MFRL=205,205;MMQ=60,60;MPOS=64;POPAF=7.30;TLOD=199.26;ANN=A|structural_interaction_variant|HIGH|ESR1|ENSG00000091831|interaction|2B23:B_353-B_394:ENST00000206249|protein_coding|5/8|c.1180C>A||||||	GT:AD:AF:DP:F1R2:F2R1:SB	0/1:65,54:0.456:119:65,54:0,0:34,31,19,35

Processing Steps

Our analysis journey begins with raw NGS reads in the FASTQ file format. At first, we need to align the reads with a reference genome (e.g., hg38) to understand where the raw reads may be positioned inside the human genome. After mapping we receive a SAM or BAM file, which stores our original reads together with a position information. Next, pre-processing steps are usually performed such as read deduplication and base quality score recalibration. These steps increase the reliability of the data and reduce biases. The resulting file is an analysis-ready BAM file.

These analysis-ready BAM files then undergo variant calling, which yields a VCF file which lists all deviations from the reference genome. Afterwards the raw variants are filtered and annotated to produce a list of candidate variants which are then analysed by a clinician or geneticist.

Sources & Further Reading