Monday, September 5, 2011

FASTQ file

FASTQ file is actually just a text file which store the reads generated from Next Generation Sequencer.
Each read is represented as four lines in the FASTQ file:
1. First line is always started by a special character '@' followed by the read ID
2. Second line is the nucleotide sequence of the read ID in first line.
3. Third line is always started by a special character '+' followed by any description. The description can be empty.
4. Fourth line is the base quality of the sequence in line 2. The base quality is encoded using ASCII character for brevity. Please note that the base quality encoding can be Phred+33, Phred+64 or Solexa+64

An example of a read (length 32bp) being represented in FASTQ file is as follow:
@read1
ACGTACGTACGTACGTACGTACGTACGTACTG
+
**(01+(*!!!9999987234963024-3+34


The base quality for the above example is encoded using Phred+33. Therefore, referring to the ASCII table, the character '*' is actually equal to 42. Since this is Phred+33 encoding, therefore the Phred score = 42-33 = 9. 


It is important to know the base quality encoding prior to alignment.

No comments:

Post a Comment