Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • N in pileup file

    Hi,

    I have sequence-information generated on a illumina sequencer. The aligment was done with ELAND and the output is a .bam file.

    Now I wanted to generate a pileup-file with the mpileup feature from samtools. I used this command to perform that:
    samtools mpileup -f REFERENCEFILE.fa SEUQENCES.bam > pileup.tab

    The Problem is that I don't get a reference base, but only N where I expect A,T,G or G.

    I also used the sort command before.
    Maybe something with the faidx was wrong, because I only have 1 Lane in the .fai file.

    Does someone has an answer for that.

    Thanks a lot.

  • #2
    You have to have a proper faidx file for pileup to recognize the equivalence between the reference sequence and the references named in the .bam file. The faidx file is supposed to be rather short, but if it's broken, that would explain what you are seeing.

    Sometimes, I make reference files where there aren't line breaks in the middle of the sequence, but samtools faidx won't tolerate this. So if you did this, remake the reference file so there's a line break every 60 or 80 bases, or whatever, and rerun the faidx command.

    The second thing to check is if the reference names in the .bam really match the reference names in your reference fasta. Spaces, or special characters may be treated differently between your aligner and samtools, so fixing the names might help.

    Comment


    • #3
      Hi,

      thanks for the ideas.

      I used the "view" command to have a look at the .bam file. It is written that it was alignt with a file called "chr1.fa". Therefore I changed my reference file to chr1.fa and also the fasta header to that name. Afterwards I did faidx again.
      But unfortunately it didn't work and I still have N.

      Comment


      • #4
        It's not about the name of the files, it's about the name in the .sam file, and the name of each sequence in the reference multi-fasta.

        You need the text after the '>' to match the text in column 3 of the .sam file.

        Comment


        • #5
          I am having this N problem also at the moment and it is driving me nuts. I have simplified the headers in my fasta file to just numbers from 1 to over 1million (each record is only 64 bases in length). I then created bowtie index and performed mapping with bowtie to map short reads back to these 'consensus tags'. I converted the generated sam file to bam and sorted. Using mpileup results in SNPs called at every base position due to the reference being treated as N.

          When I try to create an indexed file from my fasta using faidx i get a .fai file. It is quite small and doesnt contain any actual sequence data just the following:

          1 64 3 64 65
          2 64 71 64 65
          3 64 139 64 65
          4 64 207 64 65
          5 64 275 64 65
          ............................................

          Anyone any ideas?

          Comment


          • #6
            That's a perfectly normal faidx file, if your chromosomes are named 1,2,3,4, and 5, and each is 64 letters long.

            Comment


            • #7
              OK, I can generate an .FAI file ok from the fasta file. Any ideas on why I am getting the problem with my pileup basically treating my reference as all Ns and therefore calling SNPs at every position.

              Cheers

              Comment


              • #8
                Generally, it means there is a mismatch between what your .bam files says each reference sequence is named, and what your reference fasta says each reference contig is named. So check first to make sure that you really are using the same reference file in the mpileup that you used in aligning.

                Second, I'd try simplifying the names of the reference sequences. Maybe there are spaces, or special characters in the names of your reference sequences, and your aligner handles that differently than mpileup.

                Comment


                • #9
                  Thanks for the suggestions.

                  This is what I initially thought the problem was after searching around on this forum, therefore I renamed my fasta reference with simple number naming scheme. This is the file I used to build my index and after mapping with bowtie and checking the 3rd column in the sam file I have the same simple numbering scheme.

                  I guess the fact that each fasta record in my reference file is 64bp and doesn't even fill a single line is hardly an issue.

                  Comment


                  • #10
                    just use the latest version of samtools

                    Originally posted by Firebird View Post
                    Hi,

                    I have sequence-information generated on a illumina sequencer. The aligment was done with ELAND and the output is a .bam file.

                    Now I wanted to generate a pileup-file with the mpileup feature from samtools. I used this command to perform that:
                    samtools mpileup -f REFERENCEFILE.fa SEUQENCES.bam > pileup.tab

                    The Problem is that I don't get a reference base, but only N where I expect A,T,G or G.

                    I also used the sort command before.
                    Maybe something with the faidx was wrong, because I only have 1 Lane in the .fai file.

                    Does someone has an answer for that.

                    Thanks a lot.
                    I meet the same problem, when I update my samtools, things go to ok.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Current Approaches to Protein Sequencing
                      by seqadmin


                      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                      04-04-2024, 04:25 PM
                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 04-11-2024, 12:08 PM
                    0 responses
                    25 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 10:19 PM
                    0 responses
                    28 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 09:21 AM
                    0 responses
                    24 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-04-2024, 09:00 AM
                    0 responses
                    52 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X