Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • [Problem] The pileup inconsistence between 0.1.8 and 0.1.16 samtools

    Dear all,

    [I sent the following message to samtools-help mailling list, but no luck. Therefore, I post it here. I am sorry for sending the message twice.]

    I tried to use samtools version 0.1.8 and version 0.1.16 to generate .plp files on the same bam file:

    samtools.0.1.8 pileup -f genome_hg19.fa -c accepted_hits.sorted.bam > normal.raw.plp
    samtools.0.1.16 pileup -f genome_hg19.fa -c accepted_hits.sorted.bam > normal.raw.plp

    However, I found that the plp file generated by 0.1.16 version is several times bigger than the plp file compiled by 0.1.8 version of samtools.

    I tried -B option to disable BAQ, but the result is still the same. I checked the coverage on the genome of the two plp files, and I found they have huge difference.

    What happened? I googled the question and found someone has ever asked the same question recently:
    Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc


    It seemed like not just me suffering this problem.

    Could someone please help me explain the problem? Thanks!

    Bests,

    Woody

  • #2
    Further Information

    I tried to use a relatively small bam file to see how big the difference is:

    ###Generate plp file with different versions of samtools###
    09:19:38 woody@normal:samtools pileup -f ~/Documents/cDNA/chromosome/hg19/tmp/genome_hg19.fa -c normal_chr6.sorted.bam > normal_chr6.raw.plp
    09:20:17 woody@normal:samtools.0.1.8 pileup -f ~/Documents/cDNA/chromosome/hg19/tmp/genome_hg19.fa -c normal_chr6.sorted.bam > normal_chr6.old.plp

    ###Check file size###
    -rw-r--r-- 1 woody staff 17M 5 31 09:20 normal_chr6.old.plp
    -rw-r--r-- 1 woody staff 781M 5 31 09:19 normal_chr6.raw.plp

    ###Count the lines in the files###
    09:22:15 woody@normal:wc -l normal_chr6.raw.plp
    88556 normal_chr6.raw.plp
    09:22:22 woody@normal:wc -l normal_chr6.old.plp
    8872 normal_chr6.old.plp

    As you can see, the file size is a lot bigger for the normal_chr6.raw.plp (compiled by 0.1.16 version of samtools). The number of lines is 10 times larger.

    I have also checked the plp files with my naked eyes, and I found that the alignment results are slightly different. The numbers of hits on the genomic position are different between the two files.

    Could anyone helps? Please.

    Bests,

    Woody

    Comment


    • #3
      mpileup generates different results too

      [Command]
      10:32:54 woody@normal:samtools.0.1.8 mpileup -f genome_hg19.fa normal_chr6.sorted.bam > normal_chr6.old.mplp
      [mpileup] 1 samples in 1 input files
      10:36:57 woody@normal:samtools mpileup -f genome_hg19.fa normal_chr6.sorted.bam > normal_chr6.raw.mplp
      [mpileup] 1 samples in 1 input files
      <mpileup> Set max per-file depth to 8000

      [File sizes]
      -rw-r--r-- 1 woody staff 5.4M 3 9 15:18 normal_chr6.bam
      -rw-r--r-- 1 woody staff 17M 6 1 10:32 normal_chr6.old.mplp
      -rw-r--r-- 1 woody staff 17M 5 31 09:20 normal_chr6.old.plp
      -rw-r--r-- 1 woody staff 781M 6 1 10:30 normal_chr6.raw.mplp
      -rw-r--r-- 1 woody staff 781M 5 31 09:19 normal_chr6.raw.plp

      Even I used mpileup, the difference is still very obvious.

      Please somebody help answering the question.

      Are you sure the latest version is the correct one? or the corrupted one?

      Woody

      Comment


      • #4
        Could post the input BAM for us to take a look at it? Also, you could try to narrow it down for us by:
        A: trying the releases between 0.1.8 and 0.1.16.
        B: after A, finding the commit that produces the increase, if it exists.

        Looking through the commit log, over 9 months of development exists between 0.1.8 and 0.1.16, so I am not shocked if there are differences (improvements), but the behavior you are seeing is rather odd.

        Comment


        • #5
          Thanks for answering

          Thanks, nilshomer.

          I have already tried the step A and posted results previously.

          I wish I could find the commit.
          Last edited by woodydon; 06-01-2011, 04:48 AM. Reason: removed hyperlinks

          Comment


          • #6
            Originally posted by woodydon View Post
            Thanks, nilshomer.

            I have already tried the step A and posted results previously.

            I wish I could find the commit.
            I only see 0.1.8 and 0.1.16, and not 0.1.9 through 0.1.15. What is your impediment to finding the commit? Go through each intermediate commit until you find the difference.

            Comment


            • #7
              Happy now?

              Originally posted by nilshomer View Post
              I only see 0.1.8 and 0.1.16, and not 0.1.9 through 0.1.15. What is your impediment to finding the commit? Go through each intermediate commit until you find the difference.
              I use another Ubuntu Linux server to generate these files since I originally use Mac OS.

              -rw-r--r-- 1 aiia aiia 18M 2011-06-02 17:06 chr6.1.4.plp
              -rw-r--r-- 1 aiia aiia 18M 2011-06-02 17:05 chr6.1.5.plp
              -rw-r--r-- 1 aiia aiia 18M 2011-06-02 16:57 chr6.1.6.plp
              -rw-r--r-- 1 aiia aiia 18M 2011-06-02 16:51 chr6.1.7a.plp
              -rw-r--r-- 1 aiia aiia 18M 2011-06-02 16:35 chr6.1.8.plp
              -rw-r--r-- 1 aiia aiia 797M 2011-06-02 16:53 chr6.1.9.plp
              -rw-r--r-- 1 aiia aiia 782M 2011-06-02 16:45 chr6.1.10.plp
              -rw-r--r-- 1 aiia aiia 782M 2011-06-02 17:14 chr6.1.11.plp
              -rw-r--r-- 1 aiia aiia 782M 2011-06-02 17:15 chr6.1.12a.plp
              -rw-r--r-- 1 aiia aiia 782M 2011-06-02 17:16 chr6.1.13.plp
              -rw-r--r-- 1 aiia aiia 782M 2011-06-02 17:20 chr6.1.14.plp
              -rw-r--r-- 1 aiia aiia 782M 2011-06-02 17:21 chr6.1.15.plp
              -rw-r--r-- 1 aiia aiia 782M 2011-06-02 16:41 chr6.1.16.plp

              I have tested all compilable versions of samtools. My results showed that after version 0.1.9 the file size became very big.

              The 0.1.9 version had the following changes:

              * A reference skip (the N CIGAR operator) is shown as '<' or '>' depending on the strand. Tview is also changed accordingly.
              * Accelerated pileup. The plain pileup is about 50% faster.

              Any ideas?

              I will write a script to generate my own plp-alike file and see what I can find.

              Comment


              • #8
                If you have time to run multiple versions, you should first have a look at the small and big files. It is very easy to figure out what make one file big. I guess you are using RNA-seq.

                Anyway, it does not matter. pileup has gone forever.

                Comment


                • #9
                  Ok, so now we know a change occurred between 0.1.8 and 0.1.9, can you identify the commit (use svn checkout to iterate through the commits)? If not, send us the input file(s) and someone here could try.

                  Like Heng (lh3) said, pileup will be removed in future versions but the size increase seems to be also affecting mpileup, which will not be removed.

                  Comment


                  • #10
                    Identified the source causing the problem

                    Dear all,

                    Sorry about the delay.

                    I have figured out why newer samtools generates larger plp files.

                    I am using RNA-Seq data. It is the main reason why samtools went wrong.

                    Apparently, after 0.1.9, pileup/mpileup will fulfill the gap between the read mates automatically while converting the sorted bam file to plp file.

                    In other words, when dealing with the alignment results of junction reads in RNA-Seq, pileup/mpileup will automatically fill the gap (intron) with a serials of Ns.

                    I guess that the gap fulfilling process is the reason of "50% faster" since the later pileup/mpileup doesn't have to take care about the gap anymore.

                    However, the change will totally mess up the alignment results; especially, for my RNA-Seq data.

                    Here's an illustration:

                    Comment


                    • #11
                      Originally posted by woodydon View Post
                      Dear all,

                      Sorry about the delay.

                      I have figured out why newer samtools generates larger plp files.

                      I am using RNA-Seq data. It is the main reason why samtools went wrong.

                      Apparently, after 0.1.9, pileup/mpileup will fulfill the gap between the read mates automatically while converting the sorted bam file to plp file.

                      In other words, when dealing with the alignment results of junction reads in RNA-Seq, pileup/mpileup will automatically fill the gap (intron) with a serials of Ns.
                      That's interesting. I'd actually wondered about doing coverage calculations with and without the span of Ns between properly mapped forward and reverse reads - in the context of paired end genomic reads, I think you can make a good case either way. Might be worth asking about this change on the 0.1.9 release notes thread:
                      Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

                      Comment


                      • #12
                        Thank you for replying

                        Originally posted by maubp View Post
                        That's interesting. I'd actually wondered about doing coverage calculations with and without the span of Ns between properly mapped forward and reverse reads - in the context of paired end genomic reads, I think you can make a good case either way. Might be worth asking about this change on the 0.1.9 release notes thread:
                        http://seqanswers.com/forums/showthread.php?t=7542

                        You are right about the coverage calculation. With and without the span of Ns will make quite different coverage calculations based on my own experiments.

                        By the way, from my own experiments, 0.1.9 pileup seemed only take properly mapped pairs.

                        Comment


                        • #13
                          I got the same problem when converting bam to pileup. Did you find a solution?

                          Comment


                          • #14
                            Originally posted by zengzheng123 View Post
                            I got the same problem when converting bam to pileup. Did you find a solution?
                            Hi Zeng Zheng,

                            You could just use the samtools 0.1.8. Sorry for late replying.

                            Woody

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Current Approaches to Protein Sequencing
                              by seqadmin


                              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                              04-04-2024, 04:25 PM
                            • seqadmin
                              Strategies for Sequencing Challenging Samples
                              by seqadmin


                              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                              03-22-2024, 06:39 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, 04-11-2024, 12:08 PM
                            0 responses
                            27 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-10-2024, 10:19 PM
                            0 responses
                            30 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-10-2024, 09:21 AM
                            0 responses
                            26 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-04-2024, 09:00 AM
                            0 responses
                            52 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X