Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • anotherSAM
    Junior Member
    • Sep 2017
    • 5

    PacBio raw .bam file

    I have just received data from my first PacBio sequencing operation and was unaware that the new output format was in .bam file for raw reads, as they are calling the 'better fastq'.
    I am using a pipeline, beginning with canu, which requires a fastq file however when using both samtools and bamtools to generate a fastq file from the bam file, the quality row just contains exclamation marks

    samtools bam2fq data.bam > data.fastq
    bamtools convert -format fastq -in data.bam -out data.fastq

    e.g.
    @read1
    ATGCATGCAGCTGATGCTAGCATGCTACTAGTCGATCGTAGCTAGTCGATCGATGCTAGCATCGATGCTAGCTAGTCGATGCTAGCTGCGTAGCTGATGATGCTAGTCGACTGATACGAT
    +
    !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!


    Additional output files were a bam.pbi, an xml and a fasta file

    Does anyone know how to handle the raw read bam files in order to generate fastq files with the appropriate quality score?
    Last edited by anotherSAM; 09-06-2017, 07:42 AM. Reason: grammar
  • Brian Bushnell
    Super Moderator
    • Jan 2014
    • 2709

    #2
    It's possible that the bam file does not contain quality scores. Try using samtools view to convert the bam to sam, and post the first couple reads from it...

    Comment

    • GenoMax
      Senior Member
      • Feb 2008
      • 7142

      #3
      Have you tried bam2fastx from PacBio?

      Comment

      • sklages
        Senior Member
        • May 2008
        • 628

        #4
        What platform? At least for Sequel data the subread-bam/fastq files do not have quality values associated to the bases. So there is a dummy value for each base, zero, "!".

        Comment

        • anotherSAM
          Junior Member
          • Sep 2017
          • 5

          #5
          Code:
          head -n1 test.sam 
          m54072_170901_055052/5112430/0_2238	4	*	0	255	*	*	0	0	TTCCGGGGATGGGGGGTCTTGGTATTGGACATCTATATGGTTCCTTTCCACTAAACTTGAGGCATCAGGCCTGTTTGGACCGGAGTACGTAAATTTCGTTTTCGTTATTTTCGATCCATGGCTCATTCTTCGTTGGCGCTTTTCTATCAAGAGATGAGGAACCAGCTCTACTCTATGTT	!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!	bc:B:S,1,1	bq:i:85	cx:i:10	ip:B:C,1,61,22,90,8,8,72,3,7,40,9,68,4,15,8,12,72,34,156,61,10,3,50,9,70,74,16,17,2,28,6,38,60,13,11,13,122,14,42,70,7,12,14,32,25,21,38,14,7,8,34,2,1,7,5,12,60,61,6,5,8,9,52,3,16,182,7,86,44,15,56	np:i:1	pw:B:C,7,5,13,13,2,11,2,5,6,10,3,6,47,13,24,41,7,6,7,6,39,34,12,44,5,11,18,21,8,15,14,7,7,2,6,5,8,30,28,3,4,2,22,15,12,13,4,15,2,12,10,7,8,17,5,9,2,22,26,24,7,13,11,6,9,6,6,10,3,11,3,3,14,6,4,17,10     qe:i:2238	qs:i:0	rq:f:0.8	sn:B:f,7.81349,14.5671,6.42793,10.4147	zm:i:5112430RG:Z:25f8c430
          Brian, This is what the respective sam file looks like (i removed the majority of the nucleotides, numeric values and exclamation points for better clarity).
          Sklages, This is with sequel data, however if it is true that the bam files do not have a quality score, where is it? From what I had read, the new bam output was meant to replace fastq.
          Genomax, I followed the link for installing the bam2fastx library tool, which considering no error messages occurred I thought had succeeded, but when I went to use the bam2fastq command, it was unavailable.

          Thank you all for your replies
          Last edited by GenoMax; 09-07-2017, 04:57 AM. Reason: more clarity; added [CODE] tags

          Comment

          • GenoMax
            Senior Member
            • Feb 2008
            • 7142

            #6
            Originally posted by anotherSAM View Post
            Genomax, I followed the link for installing the bam2fastx library tool, which considering no error messages occurred I thought had succeeded, but when I went to use the bam2fastq command, it was unavailable.
            Did you check the local directory where you did the install for one called bam2fastx or something similar. The executables should be inside that directory, if the build was successful.

            Comment

            • sklages
              Senior Member
              • May 2008
              • 628

              #7
              Sklages, This is with sequel data, however if it is true that the bam files do not have a quality score, where is it? From what I had read, the new bam output was meant to replace fastq.
              Which BAM are you reading? "subreads"? PacBio stated back in february:

              "The subread-bam/fastq files do not have quality values associated to the bases. In fact none of our SMRT Analysis tools use or need them (e.g. the much improved CCS2 algorithm doesn't need the qualities anymore), and because there is no "placeholder" quality value as far as I know (like N for bases), the qualities in the BAM files are set to the lowest value "!"."


              And as our current BAM files also have "!" values, I assume that this has not changed (yet).

              Comment

              • GenoMax
                Senior Member
                • Feb 2008
                • 7142

                #8
                Interesting that PacBio chose to set the quality values to lowest setting rather than highest (or somewhere north of Q30).

                Comment

                • sklages
                  Senior Member
                  • May 2008
                  • 628

                  #9
                  That what my first thought too ... it is not a very smart idea choosing basically a "junk value" :-)

                  Comment

                  • anotherSAM
                    Junior Member
                    • Sep 2017
                    • 5

                    #10
                    Genomax, I managed to get the bam2fastx tools working but they provided the same information as samtools and bamtools.
                    But as we have come to understand this is of no fault of the tools and is actually due to the output.
                    Sklages, So in this way the new bam files are only able to be analysed using SMRT software?

                    Comment

                    • sklages
                      Senior Member
                      • May 2008
                      • 628

                      #11
                      It's probably a good idea to adjust the quality dummy values to something "more or less good", e.g. 30 or more before using third-party tools ..

                      Comment

                      • GenoMax
                        Senior Member
                        • Feb 2008
                        • 7142

                        #12
                        You could use reformat.sh from BBMap suite to set a fake Q score of 30 for each base like this.

                        Code:
                        reformat.sh in=pbio_input.bam out=stdout.fa | reformat.sh in=stdin.fa out=new.fq.gz qfake=30

                        Comment

                        • anotherSAM
                          Junior Member
                          • Sep 2017
                          • 5

                          #13
                          I will most likely substitute the quality scores, but how are they written in PacBio data.
                          But it is coded scores in fastq so 30 corresponds to something like >
                          This should work. It will mean canu is unable to trim reads and increase the real quality but I will also run it against illumina data to correct errors afterwards.

                          Thank you all

                          UPDATE: With substitution of a qscore of 30 for each base canu assembly has issues and produces a large number of contigs ~130. Possibly has issues in ranking overlap probabilities.
                          Last edited by anotherSAM; 09-11-2017, 02:20 AM.

                          Comment

                          • Lsterck
                            Junior Member
                            • Feb 2011
                            • 2

                            #14
                            Hi,

                            interesting approach, I might give that a try as well.

                            I was just wondering why canu would have more trouble with all qscores put to 30 than when all were still 0 ?
                            Also, canu accepts both fasta as fastq as input (or even a mixture) might it then not simply ignore any quality score present?

                            Comment

                            • gringer
                              David Eccles (gringer)
                              • May 2011
                              • 845

                              #15
                              As fas as I know, canu also doesn't use quality scores for assembly. By default it does its own error correction based on an all-vs-all read mapping, and produces corrected FASTA files (i.e. without quality scores) as intermediate data files prior to assembly.

                              Comment

                              Latest Articles

                              Collapse

                              • GATTACAT
                                Reply to Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                                by GATTACAT
                                Love this - good data definitely starts from good input, and poor input can only give relatively poor data. I particularly like the mention of Nanodrop/absorbance based methods for quantification. It's such a toss up if you'll get an accurate reading or what amounts to a randomly generated number, and a lot of library/sequencing related issues can be traced back to poor quant.
                                07-01-2026, 11:43 AM
                              • SEQadmin2
                                Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                                by SEQadmin2


                                I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                                Here are nine questions we think about, in roughly the order they matter, before...
                                06-18-2026, 07:11 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, Yesterday, 11:08 AM
                              0 responses
                              6 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-30-2026, 05:37 AM
                              0 responses
                              11 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-26-2026, 11:10 AM
                              0 responses
                              19 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-17-2026, 06:09 AM
                              0 responses
                              53 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...