SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Pacific Biosciences



Similar Threads
Thread Thread Starter Forum Replies Last Post
Genome Size Estimation from PacBio Raw Reads jpummil Bioinformatics 7 10-10-2016 05:00 AM
Uploading PacBio raw data to ENA SRA maubp Pacific Biosciences 13 03-11-2016 06:07 AM
Import raw PacBio data from *.bax.h5 files reubennowell Pacific Biosciences 4 10-12-2015 09:02 AM
Raw counts of 12 column bed file (against multiple BAM) swaraj Bioinformatics 8 09-27-2013 05:02 AM
what is the file size for a 30X human genome sequencing file, raw and BAM? RNA-seq Illumina/Solexa 2 04-15-2011 11:27 AM

Reply
 
Thread Tools
Old 09-06-2017, 06:42 AM   #1
anotherSAM
Junior Member
 
Location: Paris

Join Date: Sep 2017
Posts: 5
Default PacBio raw .bam file

I have just received data from my first PacBio sequencing operation and was unaware that the new output format was in .bam file for raw reads, as they are calling the 'better fastq'.
I am using a pipeline, beginning with canu, which requires a fastq file however when using both samtools and bamtools to generate a fastq file from the bam file, the quality row just contains exclamation marks

samtools bam2fq data.bam > data.fastq
bamtools convert -format fastq -in data.bam -out data.fastq

e.g.
@read1
ATGCATGCAGCTGATGCTAGCATGCTACTAGTCGATCGTAGCTAGTCGATCGATGCTAGCATCGATGCTAGCTAGTCGATGCTAGCTGCGTAGCTGATGATGCTAGTCGACTGATACGAT
+
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!


Additional output files were a bam.pbi, an xml and a fasta file

Does anyone know how to handle the raw read bam files in order to generate fastq files with the appropriate quality score?

Last edited by anotherSAM; 09-06-2017 at 07:42 AM. Reason: grammar
anotherSAM is offline   Reply With Quote
Old 09-06-2017, 09:22 AM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,668
Default

It's possible that the bam file does not contain quality scores. Try using samtools view to convert the bam to sam, and post the first couple reads from it...
Brian Bushnell is offline   Reply With Quote
Old 09-06-2017, 09:35 AM   #3
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,435
Default

Have you tried bam2fastx from PacBio?
GenoMax is offline   Reply With Quote
Old 09-07-2017, 02:58 AM   #4
sklages
Senior Member
 
Location: Berlin, DE

Join Date: May 2008
Posts: 619
Default

What platform? At least for Sequel data the subread-bam/fastq files do not have quality values associated to the bases. So there is a dummy value for each base, zero, "!".
sklages is offline   Reply With Quote
Old 09-07-2017, 04:42 AM   #5
anotherSAM
Junior Member
 
Location: Paris

Join Date: Sep 2017
Posts: 5
Default

Code:
head -n1 test.sam 
m54072_170901_055052/5112430/0_2238	4	*	0	255	*	*	0	0	TTCCGGGGATGGGGGGTCTTGGTATTGGACATCTATATGGTTCCTTTCCACTAAACTTGAGGCATCAGGCCTGTTTGGACCGGAGTACGTAAATTTCGTTTTCGTTATTTTCGATCCATGGCTCATTCTTCGTTGGCGCTTTTCTATCAAGAGATGAGGAACCAGCTCTACTCTATGTT	!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!	bc:B:S,1,1	bq:i:85	cx:i:10	ip:B:C,1,61,22,90,8,8,72,3,7,40,9,68,4,15,8,12,72,34,156,61,10,3,50,9,70,74,16,17,2,28,6,38,60,13,11,13,122,14,42,70,7,12,14,32,25,21,38,14,7,8,34,2,1,7,5,12,60,61,6,5,8,9,52,3,16,182,7,86,44,15,56	np:i:1	pw:B:C,7,5,13,13,2,11,2,5,6,10,3,6,47,13,24,41,7,6,7,6,39,34,12,44,5,11,18,21,8,15,14,7,7,2,6,5,8,30,28,3,4,2,22,15,12,13,4,15,2,12,10,7,8,17,5,9,2,22,26,24,7,13,11,6,9,6,6,10,3,11,3,3,14,6,4,17,10     qe:i:2238	qs:i:0	rq:f:0.8	sn:B:f,7.81349,14.5671,6.42793,10.4147	zm:i:5112430RG:Z:25f8c430
Brian, This is what the respective sam file looks like (i removed the majority of the nucleotides, numeric values and exclamation points for better clarity).
Sklages, This is with sequel data, however if it is true that the bam files do not have a quality score, where is it? From what I had read, the new bam output was meant to replace fastq.
Genomax, I followed the link for installing the bam2fastx library tool, which considering no error messages occurred I thought had succeeded, but when I went to use the bam2fastq command, it was unavailable.

Thank you all for your replies

Last edited by GenoMax; 09-07-2017 at 04:57 AM. Reason: more clarity; added [CODE] tags
anotherSAM is offline   Reply With Quote
Old 09-07-2017, 04:59 AM   #6
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,435
Default

Quote:
Originally Posted by anotherSAM View Post
Genomax, I followed the link for installing the bam2fastx library tool, which considering no error messages occurred I thought had succeeded, but when I went to use the bam2fastq command, it was unavailable.
Did you check the local directory where you did the install for one called bam2fastx or something similar. The executables should be inside that directory, if the build was successful.
GenoMax is offline   Reply With Quote
Old 09-07-2017, 05:17 AM   #7
sklages
Senior Member
 
Location: Berlin, DE

Join Date: May 2008
Posts: 619
Default

Quote:
Sklages, This is with sequel data, however if it is true that the bam files do not have a quality score, where is it? From what I had read, the new bam output was meant to replace fastq.
Which BAM are you reading? "subreads"? PacBio stated back in february:

"The subread-bam/fastq files do not have quality values associated to the bases. In fact none of our SMRT Analysis tools use or need them (e.g. the much improved CCS2 algorithm doesn't need the qualities anymore), and because there is no "placeholder" quality value as far as I know (like N for bases), the qualities in the BAM files are set to the lowest value "!"."


And as our current BAM files also have "!" values, I assume that this has not changed (yet).
sklages is offline   Reply With Quote
Old 09-07-2017, 05:26 AM   #8
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,435
Default

Interesting that PacBio chose to set the quality values to lowest setting rather than highest (or somewhere north of Q30).
GenoMax is offline   Reply With Quote
Old 09-07-2017, 05:33 AM   #9
sklages
Senior Member
 
Location: Berlin, DE

Join Date: May 2008
Posts: 619
Default

That what my first thought too ... it is not a very smart idea choosing basically a "junk value" :-)
sklages is offline   Reply With Quote
Old 09-07-2017, 06:43 AM   #10
anotherSAM
Junior Member
 
Location: Paris

Join Date: Sep 2017
Posts: 5
Default

Genomax, I managed to get the bam2fastx tools working but they provided the same information as samtools and bamtools.
But as we have come to understand this is of no fault of the tools and is actually due to the output.
Sklages, So in this way the new bam files are only able to be analysed using SMRT software?
anotherSAM is offline   Reply With Quote
Old 09-07-2017, 06:49 AM   #11
sklages
Senior Member
 
Location: Berlin, DE

Join Date: May 2008
Posts: 619
Default

It's probably a good idea to adjust the quality dummy values to something "more or less good", e.g. 30 or more before using third-party tools ..
sklages is offline   Reply With Quote
Old 09-07-2017, 06:58 AM   #12
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,435
Default

You could use reformat.sh from BBMap suite to set a fake Q score of 30 for each base like this.

Code:
reformat.sh in=pbio_input.bam out=stdout.fa | reformat.sh in=stdin.fa out=new.fq.gz qfake=30
GenoMax is offline   Reply With Quote
Old 09-07-2017, 07:15 AM   #13
anotherSAM
Junior Member
 
Location: Paris

Join Date: Sep 2017
Posts: 5
Default

I will most likely substitute the quality scores, but how are they written in PacBio data.
But it is coded scores in fastq so 30 corresponds to something like >
This should work. It will mean canu is unable to trim reads and increase the real quality but I will also run it against illumina data to correct errors afterwards.

Thank you all

UPDATE: With substitution of a qscore of 30 for each base canu assembly has issues and produces a large number of contigs ~130. Possibly has issues in ranking overlap probabilities.

Last edited by anotherSAM; 09-11-2017 at 02:20 AM.
anotherSAM is offline   Reply With Quote
Reply

Tags
bam, fastq, pacbio, raw data

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:18 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO