Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • A doubt about mpileup

    Hi,

    I used the command samtools mpileup -f /GenoStorage/Genomas/hg19/hg19RefGenome.fa sample_align_sorted.bam.bam > sample_mpileup

    It was surprisingly quick, then I put

    ls -lah sample02187A_mpileup
    -rw-rw-r-- 1 meusebio meusebio 5.8G Feb 10 15:03 sample_mpileup

    just to see if it looked ok.
    Then:
    [meusebio@IMMGene1 meusebio]$ bcftools view -vcg sample_mpileup > sample.vcf
    [bcf_sync] incorrect number of fields (0 != 5) at 0:0
    [afs] 0:0.000


    What should I do?

    Thank you in advance

  • #2
    The mpileup command you used generates a flat text pileup that is not suitable for import into bcftools, though it should not be quick. Try looking at that pileup file, see if it looks like it should.

    To make a vcf, try something like

    samtools mpileup -ugf ref.fa file.bam | bcftools view -bcvg - > file.bcf
    bcftools view file.bcf > file.vcf
    Personally, I always add a -B to the mpileup command, because I've occasionally found that SNPs got missed if I didn't, and I'd much rather deal with false positives than false negatives.

    Comment


    • #3
      To day I used:

      samtools mpileup -d8000 -B -ugf /GenoStorage/Genomas/hg19/hg19RefGenome.fa sample02187A_align_sorted.bam | bcftools view -bvcg - > varSample02187A.bcf
      [mpileup] 1 samples in 1 input files
      [bcfview] 100000 sites processed.
      [afs] 0:73270.205 1:12061.094 2:14668.700
      [bcfview] 200000 sites processed.
      [afs] 0:93673.767 1:2809.202 2:3517.031
      [bcfview] 300000 sites processed.
      [afs] 0:62908.369 1:16718.513 2:20373.117
      [bcfview] 400000 sites processed.
      [afs] 0:62494.168 1:16735.723 2:20770.109
      [bcfview] 500000 sites processed.
      [afs] 0:67447.351 1:14587.030 2:17965.618
      [bcfview] 600000 sites processed.
      [afs] 0:67444.849 1:14410.756 2:18144.395
      [bcfview] 700000 sites processed.
      [afs] 0:69741.610 1:13586.786 2:16671.604
      [bcfview] 800000 sites processed.
      [afs] 0:66746.668 1:14996.990 2:18256.342
      [afs] 0:7849.588 1:1177.825 2:1518.587


      Only took me 9 minutes. It's really strange

      Comment


      • #4
        I started almost from the beginning.
        I also marked the duplicates.

        My mpileup as 14GB when I use the command -f, and he is smaller when I use -ugf

        The odd thing is when I use -ugf I see some strange symbols.

        In both cases the mpileup goes on for your if I only use the mpileup command, but if I save the output in a file or do the bcftools part, it takes only a few minutes

        Comment


        • #5
          I don't see how marking duplicates matters. I don't think mpileup pays attention to that.

          Using -ug makes a file that is suitable for import into bcftools for SNP calling. Without it, mpileup makes a human readable pileup. The pileup is going to have one line for every base in your reference, so if your reference is large, it'll be a huge file. But you can eyeball it to see how mpileup is interpreting your .bam

          Comment


          • #6
            This was my command:

            samtools mpileup -P ILLUMINA -d8000 -B -ugf /GenoStorage/Genomas/hg19/hg19RefGenome.fa sample02187A_align_sorted.bam | bcftools view -bcvg - > varSample02187A.bcf
            [mpileup] 1 samples in 1 input files
            [bcfview] 100000 sites processed.
            [afs] 0:73270.205 1:12061.094 2:14668.700
            [bcfview] 200000 sites processed.
            [afs] 0:93673.767 1:2809.202 2:3517.031
            [bcfview] 300000 sites processed.
            [afs] 0:62908.369 1:16718.513 2:20373.117
            [bcfview] 400000 sites processed.
            [afs] 0:62494.168 1:16735.723 2:20770.109
            [bcfview] 500000 sites processed.
            [afs] 0:67447.351 1:14587.030 2:17965.618
            [bcfview] 600000 sites processed.
            [afs] 0:67444.849 1:14410.756 2:18144.395
            [bcfview] 700000 sites processed.
            [afs] 0:69741.610 1:13586.786 2:16671.604
            [bcfview] 800000 sites processed.
            [afs] 0:66746.668 1:14996.990 2:18256.342
            [afs] 0:7849.588 1:1177.825 2:1518.587

            It always stops in here, it took 9 minutes to do this and generated a .bcf file empty.
            I really can't understand what is wrong with the command

            Comment


            • #7
              I really can't understand why mpileup is so quick just when I save the output for a file or piping with bcftools.
              While doing the following command I saw

              samtools mpileup -P ILLUMINA -d8000 -A -B -C50 -uf /GenoStorage/Genomas/hg19/hg19RefGenome.fa sample02187A_align_sorted.bam
              [mpileup] 1 samples in 1 input files
              \ufffdBC\ufffd\ufffd9\ufffdBCF\ufffdchr1chr2chr3chr4chr5chr6chr7chr8chr9chr10chr11chr12chr13chr14chr15chr16chr17chr18chr19chr20chr21chr22chrXchrYchrMsample%##samtoolsVersion=0.1.18 (r982:295)

              It's normal?
              Another thing that a thought was if I should need my memory swap to be the double of my ram memory. This is true?

              I know I'm being really annoying. I'm really sorry.

              Comment


              • #8
                For the third time, why don't you look at the human readable pileup output of mpileup?

                Comment


                • #9
                  It looks like this

                  chr1 10018 c 1 ^!. D
                  chr1 10019 t 1 . F
                  chr1 10020 a 1 . F
                  chr1 10021 a 1 . F
                  chr1 10022 c 1 . H
                  chr1 10023 c 1 . H
                  chr1 10024 c 1 . H
                  chr1 10025 t 1 . H
                  chr1 10026 a 1 . H
                  chr1 10027 a 2 .^!. IF
                  chr1 10028 c 2 .. JD
                  chr1 10029 c 2 .. J+
                  chr1 10030 c 2 .. J2
                  chr1 10031 t 2 .. J=
                  chr1 10032 a 2 .. JC
                  chr1 10033 a 2 .. CD
                  chr1 10034 c 2 .. HC
                  chr1 10035 c 2 .. GC
                  chr1 10036 c 2 .. G@
                  chr1 10037 t 2 .. JF
                  chr1 10038 a 2 .. GH
                  chr1 10039 a 2 .. GG
                  chr1 10040 c 2 .. IH
                  chr1 10041 c 2 .. CI
                  chr1 10042 c 2 .. ED
                  chr1 10043 t 2 .. HG
                  chr1 10044 a 2 .. GC
                  chr1 10045 a 3 ..^2c I@#
                  chr1 10046 c 3 .., JB#
                  chr1 10047 c 3 .., IBC
                  chr1 10048 c 4 ..,^+. I98F

                  Comment


                  • #10
                    The good news is at least mpileup has figure out that the reference fasta you gave it is the one used to make the .bam file. So it has the reference all squared away.

                    This part has a coverage of 2, and 1 in places, but in isn't this region telomeric junk in the human genome? Is the coverage higher in a more representaitve region of the genome?

                    What % of your reads aligned? How many reads aligned total? Maybe this run was junk, and that's why nothing is working.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Essential Discoveries and Tools in Epitranscriptomics
                      by seqadmin




                      The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                      04-22-2024, 07:01 AM
                    • seqadmin
                      Current Approaches to Protein Sequencing
                      by seqadmin


                      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                      04-04-2024, 04:25 PM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Yesterday, 11:49 AM
                    0 responses
                    15 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-24-2024, 08:47 AM
                    0 responses
                    16 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-11-2024, 12:08 PM
                    0 responses
                    61 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 10:19 PM
                    0 responses
                    60 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X