Tuesday, September 6, 2011

Convert BED to BAM format

BAM file has become the standard format for sequence alignment file.
Occasionally, we will see that we have BED file in hand, however some programs can only recognize BAM file. So, we need a way to convert the BED format to BAM format.

BedTools is a nice tool for this purpose. Download BedTools and install it in your computer.

Below is the command of how you can convert BED to BAM using bedTools:
bedToBam -i input.bed -g hg18.txt > input.bam

You need to create the genome file (in this case hg18.txt). This file should contains the chromosome name and the length of the chromosome.
The excerpt from hg18.txt:

chr1 247249719
chr10 135374737
chr11 134452384
chr12 132349534
chr13 114142980
chr14 106368585
chr15 100338915
...

Monday, September 5, 2011

Convert QSEQ to FASTQ

I have written a PERL script to convert Illumina QSEQ or Export file to FASTQ.
In order to run it, you will need PERL installed.
type in the command: "perl illumina2fastq" to get the help.
typical command:
perl illumina2fastq -i input_qseq.txt -o output.fastq

The script is as below (Please save the script as "illumina2fastq"):
===============================================

#!/usr/bin/perl -w

### Script: illumina2fastq
### Author: ngs-interest.blogspot.com
### Last Update: 7 September 2011

use strict;
use Getopt::Long;
use Pod::Usage;
use POSIX;

=head1 NAME

illumina2fastq

=head1 SYNOPSIS

illumina2fastq -[options]

-h - help
-i - input file (qseq or export file)
-o - output file (default: write to out.fastq)

example:
illumina2fastq -i input_qseq.txt ### input_qseq.txt is the "qseq" file and write output to out.fastq
illumina2fastq -i input_qseq.txt -o outputfile.fastq ### input_qseq.txt is the "qseq" file and write output to outputfile.fastq


=head1 DESCRIPTION

This script aims to create FASTQ file from the "qseq.txt" or "export.txt" file.

=cut

### option variables
my $help;
my $inFile;
my $outFile;
my $phredOffset;

### initialize option
Getopt::Long::Configure ('bundling');

if ( !GetOptions ('h'=>\$help, 'i=s'=>\$inFile, 'o=s'=>\$outFile) || !defined($inFile) ) {
if ($help) {
pod2usage(-verbose => 2);
} else {
pod2usage(1);
}
}

### Initialize all variables
if (!defined($outFile)) {
$outFile = "out.fastq";
}

if (!defined($phredOffset)) {
$phredOffset = 0;
}

### Printing Information
print STDERR "Input File\t\t: $inFile\n";
print STDERR "Output File\t\t: $outFile\n";

print STDERR "\n";
print "Creating $outFile file...\n";

### File Initialization
open (IN, $inFile) || die "Cannot open $inFile for reading!\n";
open (OUT, ">$outFile") || die "Cannot open $outFile for writing!\n";

while (<IN>) {
chomp;

my @f = split(/\t/);
$f[8] =~ s/\./N/g;

my $id = "$f[0]-$f[1]-$f[2]-$f[3]-$f[4]-$f[5]";

print OUT "\@$id\n$f[8]\n+$id\n$f[9]\n";
}
close (IN);
close (OUT);

print STDERR "Finish writing $outFile\n";
##=============================

FASTQ file

FASTQ file is actually just a text file which store the reads generated from Next Generation Sequencer.
Each read is represented as four lines in the FASTQ file:
1. First line is always started by a special character '@' followed by the read ID
2. Second line is the nucleotide sequence of the read ID in first line.
3. Third line is always started by a special character '+' followed by any description. The description can be empty.
4. Fourth line is the base quality of the sequence in line 2. The base quality is encoded using ASCII character for brevity. Please note that the base quality encoding can be Phred+33, Phred+64 or Solexa+64

An example of a read (length 32bp) being represented in FASTQ file is as follow:
@read1
ACGTACGTACGTACGTACGTACGTACGTACTG
+
**(01+(*!!!9999987234963024-3+34


The base quality for the above example is encoded using Phred+33. Therefore, referring to the ASCII table, the character '*' is actually equal to 42. Since this is Phred+33 encoding, therefore the Phred score = 42-33 = 9. 


It is important to know the base quality encoding prior to alignment.

Solexa Quality Score

The Solexa quality score is calculated differently from Phred Score. Instead of using the probability p, solexa quality score uses the odds p/(1-p).

Qsolexa = -10 log [p/(1-p)]
log is log base 10.
p = probability that the base calling is incorrect.

Solexa quality score is generated from the Illumina pipeline prior to version 1.3.

Phred Quality Score

Phred quality score is the standard quality score to determine the quality of a base calling.
Q = -10 log p
where,
Q = phred quality score
log is base 10.
p = probability that the base call is incorrect. Therefore, the smaller the p, the higher the Q.

Since p is the probability that the base call is incorrect, therefore we can view (1-p) as the correctness as well. If p = 0.01, then the probability that it is correct is 0.99 (i.e. 99% correct). This corresponds to Q = 20.

Q=20 ==> 99% accuracy
Q=30 ==> 99.9% accuracy
Q=40 ==> 99.99% accuracy

There is another variant of quality score, i.e. Solexa quality score which is generated from the Solexa pipeline (Illumina prior to version 1.3).

Illumina qseq file format

Each record is one line with tab separator in the following format:

  • Machine name: unique identifier of the sequencer.
  • Run number: unique number to identify the run on the sequencer.
  • Lane number: positive integer (currently 1-8).
  • Tile number: positive integer.
  • X: x coordinate of the spot. Integer (can be negative).
  • Y: y coordinate of the spot. Integer (can be negative).
  • Index: positive integer. No indexing should have a value of 1.
  • Read Number: 1 for single reads; 1 or 2 for paired ends.
  • Sequence (BASES)
  • Quality: the calibrated quality string. (QUALITIES)
  • Filter: Did the read pass filtering? 0 - No, 1 - Yes.

For more information, please refer to page 6 of: ftp://ftp.era.ebi.ac.uk/meta/doc/sra_1_1/SRA_File_Formats_Guide.pdf

Is there any Illumina directional paired-end mRNA-seq assay?

Today I am looking for information about paired-end data and am wondering whether Illumina has directional paired-end assay. Throughout my search, I can't find any information.
A quick search in Illumina website: http://www.illumina.com/applications/sequencing/rna.ilmn#strand_specific_rna_seq, it was mentioned that: "The Directional mRNA-Seq Sample Prep protocol is an experimental application of Illumina technology, requiring Illumina small RNA sequencing primers and the mRNA-Seq Sample Prep kit. This protocol is compatible with single-read flow cells only."

I am not sure if this means that the directional mRNA-seq experiment only applicable for single-end read?

After searching through the website, I bumped into this website: http://microarrays.nki.nl/index.php?page=samples-seq. It was mentioned that:

"4) Single read libraries cannot be used on Paired End flowcells."
"5) Paired End libraries can be read on Single read flowcells for the first read only."

Now, I think that there is no directional paired-end mRNA-seq assay in Illumina. I might be wrong. Waiting for correction from anybody out there.