Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • A doubt about mpileup

    Hi,

    I used the command samtools mpileup -f /GenoStorage/Genomas/hg19/hg19RefGenome.fa sample_align_sorted.bam.bam > sample_mpileup

    It was surprisingly quick, then I put

    ls -lah sample02187A_mpileup
    -rw-rw-r-- 1 meusebio meusebio 5.8G Feb 10 15:03 sample_mpileup

    just to see if it looked ok.
    Then:
    [meusebio@IMMGene1 meusebio]$ bcftools view -vcg sample_mpileup > sample.vcf
    [bcf_sync] incorrect number of fields (0 != 5) at 0:0
    [afs] 0:0.000


    What should I do?

    Thank you in advance

  • #2
    The mpileup command you used generates a flat text pileup that is not suitable for import into bcftools, though it should not be quick. Try looking at that pileup file, see if it looks like it should.

    To make a vcf, try something like

    samtools mpileup -ugf ref.fa file.bam | bcftools view -bcvg - > file.bcf
    bcftools view file.bcf > file.vcf
    Personally, I always add a -B to the mpileup command, because I've occasionally found that SNPs got missed if I didn't, and I'd much rather deal with false positives than false negatives.

    Comment


    • #3
      To day I used:

      samtools mpileup -d8000 -B -ugf /GenoStorage/Genomas/hg19/hg19RefGenome.fa sample02187A_align_sorted.bam | bcftools view -bvcg - > varSample02187A.bcf
      [mpileup] 1 samples in 1 input files
      [bcfview] 100000 sites processed.
      [afs] 0:73270.205 1:12061.094 2:14668.700
      [bcfview] 200000 sites processed.
      [afs] 0:93673.767 1:2809.202 2:3517.031
      [bcfview] 300000 sites processed.
      [afs] 0:62908.369 1:16718.513 2:20373.117
      [bcfview] 400000 sites processed.
      [afs] 0:62494.168 1:16735.723 2:20770.109
      [bcfview] 500000 sites processed.
      [afs] 0:67447.351 1:14587.030 2:17965.618
      [bcfview] 600000 sites processed.
      [afs] 0:67444.849 1:14410.756 2:18144.395
      [bcfview] 700000 sites processed.
      [afs] 0:69741.610 1:13586.786 2:16671.604
      [bcfview] 800000 sites processed.
      [afs] 0:66746.668 1:14996.990 2:18256.342
      [afs] 0:7849.588 1:1177.825 2:1518.587


      Only took me 9 minutes. It's really strange

      Comment


      • #4
        I started almost from the beginning.
        I also marked the duplicates.

        My mpileup as 14GB when I use the command -f, and he is smaller when I use -ugf

        The odd thing is when I use -ugf I see some strange symbols.

        In both cases the mpileup goes on for your if I only use the mpileup command, but if I save the output in a file or do the bcftools part, it takes only a few minutes

        Comment


        • #5
          I don't see how marking duplicates matters. I don't think mpileup pays attention to that.

          Using -ug makes a file that is suitable for import into bcftools for SNP calling. Without it, mpileup makes a human readable pileup. The pileup is going to have one line for every base in your reference, so if your reference is large, it'll be a huge file. But you can eyeball it to see how mpileup is interpreting your .bam

          Comment


          • #6
            This was my command:

            samtools mpileup -P ILLUMINA -d8000 -B -ugf /GenoStorage/Genomas/hg19/hg19RefGenome.fa sample02187A_align_sorted.bam | bcftools view -bcvg - > varSample02187A.bcf
            [mpileup] 1 samples in 1 input files
            [bcfview] 100000 sites processed.
            [afs] 0:73270.205 1:12061.094 2:14668.700
            [bcfview] 200000 sites processed.
            [afs] 0:93673.767 1:2809.202 2:3517.031
            [bcfview] 300000 sites processed.
            [afs] 0:62908.369 1:16718.513 2:20373.117
            [bcfview] 400000 sites processed.
            [afs] 0:62494.168 1:16735.723 2:20770.109
            [bcfview] 500000 sites processed.
            [afs] 0:67447.351 1:14587.030 2:17965.618
            [bcfview] 600000 sites processed.
            [afs] 0:67444.849 1:14410.756 2:18144.395
            [bcfview] 700000 sites processed.
            [afs] 0:69741.610 1:13586.786 2:16671.604
            [bcfview] 800000 sites processed.
            [afs] 0:66746.668 1:14996.990 2:18256.342
            [afs] 0:7849.588 1:1177.825 2:1518.587

            It always stops in here, it took 9 minutes to do this and generated a .bcf file empty.
            I really can't understand what is wrong with the command

            Comment


            • #7
              I really can't understand why mpileup is so quick just when I save the output for a file or piping with bcftools.
              While doing the following command I saw

              samtools mpileup -P ILLUMINA -d8000 -A -B -C50 -uf /GenoStorage/Genomas/hg19/hg19RefGenome.fa sample02187A_align_sorted.bam
              [mpileup] 1 samples in 1 input files
              \ufffdBC\ufffd\ufffd9\ufffdBCF\ufffdchr1chr2chr3chr4chr5chr6chr7chr8chr9chr10chr11chr12chr13chr14chr15chr16chr17chr18chr19chr20chr21chr22chrXchrYchrMsample%##samtoolsVersion=0.1.18 (r982:295)

              It's normal?
              Another thing that a thought was if I should need my memory swap to be the double of my ram memory. This is true?

              I know I'm being really annoying. I'm really sorry.

              Comment


              • #8
                For the third time, why don't you look at the human readable pileup output of mpileup?

                Comment


                • #9
                  It looks like this

                  chr1 10018 c 1 ^!. D
                  chr1 10019 t 1 . F
                  chr1 10020 a 1 . F
                  chr1 10021 a 1 . F
                  chr1 10022 c 1 . H
                  chr1 10023 c 1 . H
                  chr1 10024 c 1 . H
                  chr1 10025 t 1 . H
                  chr1 10026 a 1 . H
                  chr1 10027 a 2 .^!. IF
                  chr1 10028 c 2 .. JD
                  chr1 10029 c 2 .. J+
                  chr1 10030 c 2 .. J2
                  chr1 10031 t 2 .. J=
                  chr1 10032 a 2 .. JC
                  chr1 10033 a 2 .. CD
                  chr1 10034 c 2 .. HC
                  chr1 10035 c 2 .. GC
                  chr1 10036 c 2 .. G@
                  chr1 10037 t 2 .. JF
                  chr1 10038 a 2 .. GH
                  chr1 10039 a 2 .. GG
                  chr1 10040 c 2 .. IH
                  chr1 10041 c 2 .. CI
                  chr1 10042 c 2 .. ED
                  chr1 10043 t 2 .. HG
                  chr1 10044 a 2 .. GC
                  chr1 10045 a 3 ..^2c I@#
                  chr1 10046 c 3 .., JB#
                  chr1 10047 c 3 .., IBC
                  chr1 10048 c 4 ..,^+. I98F

                  Comment


                  • #10
                    The good news is at least mpileup has figure out that the reference fasta you gave it is the one used to make the .bam file. So it has the reference all squared away.

                    This part has a coverage of 2, and 1 in places, but in isn't this region telomeric junk in the human genome? Is the coverage higher in a more representaitve region of the genome?

                    What % of your reads aligned? How many reads aligned total? Maybe this run was junk, and that's why nothing is working.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Advancing Precision Medicine for Rare Diseases in Children
                      by seqadmin




                      Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                      12-16-2024, 07:57 AM
                    • seqadmin
                      Recent Advances in Sequencing Technologies
                      by seqadmin



                      Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                      Long-Read Sequencing
                      Long-read sequencing has seen remarkable advancements,...
                      12-02-2024, 01:49 PM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 12-17-2024, 10:28 AM
                    0 responses
                    27 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 12-13-2024, 08:24 AM
                    0 responses
                    43 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 12-12-2024, 07:41 AM
                    0 responses
                    29 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 12-11-2024, 07:45 AM
                    0 responses
                    42 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X