PacBio Tutorial¶

Launch a generic AMI (m3.2xlarge), update and install basic software. You can use the generic ami-864d84ee or any other Ubuntu machine.

#update stuff
sudo apt-get update

#install basic software
sudo apt-get -y install screen git curl gcc make g++ python-dev unzip \
default-jre pkg-config

#install Perl modules required by PBcR, paste in to terminal one at a time..
#Will be a couple of prompts (answer YES to both)
sudo cpan App::cpanminus
sudo cpanm Statistics::Descriptive

#Install wgs-assembler
wget http://sourceforge.net/projects/wgs-assembler/files/wgs-assembler/wgs-8.2beta/wgs-8.2beta-Linux_amd64.tar.bz2
tar -jxf wgs-8.2beta-Linux_amd64.tar.bz2

#add wgs to $PATH
PATH=$PATH:$HOME/wgs-8.2beta/Linux-amd64/bin/

Download sample Lambda phage dataset. We are using this only because it is very small and can be assembled quickly and with limited hardware requirements. For a more challenging test (read: expert with a big computer) try one of publicly available PacBio datasets here: https://github.com/PacificBiosciences/DevNet/wiki/Datasets

#make sure you have the appropriate permissions to read and write.
sudo chown -R ubuntu:ubuntu /mnt
mkdir /mnt/data
cd /mnt/data

#Download the sample data
wget http://www.cbcb.umd.edu/software/PBcR/data/sampleData.tar.gz
tar -zxf sampleData.tar.gz
cd sampleData/

Convert fastA to faux-fastQ

#This is really old PacBio data, provided in fastA format. Look at the reads - note that they are not actually as long as I just told you they should be. The PacBio tech has improved massively over the past few years.

java -jar convertFastaAndQualToFastq.jar \
pacbio.filtered_subreads.fasta > pacbio.filtered_subreads.fastq

Run the assembly, using wgs, after error-correcting the reads. You could do the error correction separately, but no need to, here, for our purposes.

PBcR -length 500 -partitions 200 -l lambda -s pacbio.spec \
-fastq pacbio.filtered_subreads.fastq genomeSize=50000

Look at the output. The phage genome has been assembled into 2 contigs (meh). Try a larger dataset for a more difficult (and rewarding challenge)

LICENSE: This documentation and all textual/graphic site content is licensed under the Creative Commons - 0 License (CC0) -- fork @ github. Presentations (PPT/PDF) and PDFs are the property of their respective owners and are under the terms indicated within the presentation.