Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • samtools problems with reference in pileup file

    Hello,

    I am using samtools to generate a pileup file from a Mosaik genome assembly. My problem is that the pileup file contains all "N's" in the reference sequence field (column 3). I have double checked that I am using the exact same multi-sequence fasta file as the reference for both Mosaik and samtools.

    Has anyone ever run into this before? I am using MosaikSort and MosaikText to generate a sorted bam file for input into samtools. Without the reference, I cannot get SNP quality scores, which is the whole point of my analysis.

    Thanks for your help,
    Sarah
    [email protected]

  • #2
    Hi,

    I am having the same problem. I am mapping reads to a list of contigs from a de novo assembly using bwa. I have converted the bwa sam alignment file to a sorted bam file using samtools. When I try running pileup to identify variants, I get only N's in the reference base position. I have checked the reference fasta file and don't find anything wrong.

    Thanks!

    Renee
    [email protected]

    Comment


    • #3
      Originally posted by skingan View Post
      Hello,

      I am using samtools to generate a pileup file from a Mosaik genome assembly. My problem is that the pileup file contains all "N's" in the reference sequence field (column 3). I have double checked that I am using the exact same multi-sequence fasta file as the reference for both Mosaik and samtools.

      Has anyone ever run into this before? I am using MosaikSort and MosaikText to generate a sorted bam file for input into samtools. Without the reference, I cannot get SNP quality scores, which is the whole point of my analysis.

      Thanks for your help,
      Sarah
      [email protected]
      Post the command you are using so we can help
      Originally posted by reneeg36 View Post
      Hi,

      I am having the same problem. I am mapping reads to a list of contigs from a de novo assembly using bwa. I have converted the bwa sam alignment file to a sorted bam file using samtools. When I try running pileup to identify variants, I get only N's in the reference base position. I have checked the reference fasta file and don't find anything wrong.

      Thanks!

      Renee
      [email protected]
      Did you specify the FASTA file with the "-f" option?

      Comment


      • #4
        Originally posted by nilshomer View Post
        Post the command you are using so we can help

        Did you specify the FASTA file with the "-f" option?
        Yes, here is what I have been trying:

        command:
        samtools pileup -f reference.fa sorted.bam > raw.txt

        output (e.g.):
        chr7:150849-150991 22 N 2 GG CC


        Renee

        Comment


        • #5
          my command:

          samtools pileup -f myref.fas my_sorted_assembly.bam -c > mydata.pileup

          my output (e.g.):

          NT_033777 27895205 N A 24 0 6 17 aaaaaaaaaaaaaaaaa $))))))()')))")))

          Comment


          • #6
            Originally posted by skingan View Post
            my command:

            samtools pileup -f myref.fas my_sorted_assembly.bam -c > mydata.pileup

            my output (e.g.):

            NT_033777 27895205 N A 24 0 6 17 aaaaaaaaaaaaaaaaa $))))))()')))")))
            Is there an N at that position in the reference?

            Comment


            • #7
              No. All of the positions in the pileup file are N's and my reference sequences do not consist of all N's.
              Sarah

              Comment


              • #8
                I had this same problem, and after seeing no solution here did some more digging, and have a possible solution for you.

                I noticed that the ref.fa.fai file for my whole genome was 0 kb. The .fai is used by samtools when building the pileup. When I ran the command to re-build the .fai:

                samtools faidx reference.fa

                I got the following error message:

                [fai_build_core] different line length in sequence 'scaffold_14'.
                Segmentation fault

                No doubt this same message occurred the first time I ran the pileup command (which also builds the .fai if it doesn't exist), but I apparently didn't pay attention. After that first time, the .fai file EXISTED so no errors were subsequently reported when I ran pileup again.

                In my case, there was an extra line after scaffold_14. I removed this, and re-built the .fai using the samtools faidx command and then re-ran the pileup command. My pileup then contained the reference base as intended!

                Hope this helps y'all find the solution to your problem.
                Best,
                Shannon
                University of Texas at Austin

                Comment


                • #9
                  Ran into the same problem. It may be worth someone adding this to the faidx documentation regarding null strings in the reference or make the thrown error more descriptive.

                  Comment


                  • #10
                    It turned out to be a similar problem to the one SMHfrog had. In my reference file, each chromosome sequence was on a single line, so when samtools built the .fai file there was a segmentation fault because of the length of the sequence. I used a different reference with line breaks and it worked. I used the same reference file for the Mosaik run and the pileup build.
                    Sarah

                    Comment


                    • #11
                      I got the same problem.
                      chr17 418628 N 54
                      chr17 418629 N 58
                      chr17 418630 N 57

                      Comment


                      • #12
                        I had the same problem. In my case, I had colons in the reference sequence names, something like "Region1:1-100". When I removed them, samtools pileup worked as expected.

                        Comment


                        • #13
                          I also had this experience, in my case the problem disappeared when I removed spaces in the reference sequence name.

                          Comment


                          • #14
                            Hi
                            I'm having the same problem with Ns in my pileup file and have tried everything mentioned above (thanks for suggestions!). I am using:

                            ./samtools pileup data.sorted.bam -f reference.fasta > data.pileup

                            My reference .fai file looks like this:

                            chr2L 49364325 7 60 61
                            chr2R 61545105 50187078 60 61
                            chr3L 41963435 112757942 60 61
                            chr3R 53200684 155420775 60 61
                            chrUNKN 42389979 209508147 60 61
                            chrX 24393108 252604632 60 61
                            chrY 237045 277404298 60 61

                            Any ideas?

                            Comment


                            • #15
                              Here's another possible solution - the headers are not consistent between SAM/BAM and the original fasta:

                              Even though the reference file was the same one in both cases, sometimes aligners just write a substring out into the SAM file. Samtools seems to take the full header.

                              For example the first contiguous part of my genome header is
                              gi|110645304|ref|NC_002516.2|

                              However in my SAM file the aligner has only written
                              NC_002516.2

                              Samtools has written the full header to the .fa.fai index
                              gi|110645304|ref|NC_002516.2|

                              .. and this does not match.

                              Solution:

                              Try correcting the original header on the reference fasta to just the substring which the aligner uses.
                              eg
                              gi|110645304|ref|NC_002516.2|
                              to
                              NC_002516.2

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              11 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              51 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              68 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X