Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Calling SNPs from RNAseq bam files

    Hello SeqAnswers community,

    I am trying to use Samtools to call SNPs from my RNAseq data. I already have bam files, so I tried to use the following commands.

    #Method 1 -
    samtools sort file.bam file_sorted
    samtools mpileup -d8000 -uf genome.fasta bamfile1.bam | bcftools view -vcg - > s.vcf
    #When I opened the s.vcf file, it was a very large file (MB), it was showing a varient in every position, "N" was the reference for every position, and almost every "ALT" was multiple bases (A, T, G)

    #Method 2 -
    samtool sort file.bam file_sorted
    samtools mpileup -d8000 -uf sequence_genome.fasta file_sorted.bam | bcftools view -bvcg - > file_samtools.raw.bcf
    bcftools view file_samtools.raw.bcf | vcfutils.pl varFilter -D100 > file_samtools.vcf
    #When I opened this vcf file, there were no SNPs. Below are the last four lines.

    ##FORMAT=<ID=DV,Number=1,Type=Integer,Description="# high-quality non-reference bases">
    ##FORMAT=<ID=SP,Number=1,Type=Integer,Description="Phred-scaled strand bias P-value">
    ##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
    ##INFO=<ID=PR,Number=1,Type=Integer,Description="# permutations yielding a smaller PCHI2.">
    #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT file_sorted.bam

    Does anyone know what is happening? I also had trouble using GATK, so please don't stear me in that direction.

    Thanks and God bless,
    Jason

  • #2
    What does "samtools flagstat" tell you?
    Can you view your mapped reads in SAM format using "samtools view"?

    Can you dump a random couple of dozen mapped alignments? Do these look right?

    Comment


    • #3
      Thanks for your quick response Richard! I was already home when I saw it though. Here's what I could get.

      ##"samtool flagstat" ##
      samtool flagstat file.bam

      11959974 + 0 in total (QC-passed reads + QC-failed reads)
      0 + 0 duplicates
      8460790 + 0 mapped (70.74%:-nan%)
      0 + 0 paired in sequencing
      0 + 0 read1
      0 + 0 read2
      0 + 0 properly paired (-nan%:-nan%)
      0 + 0 with itself and mate mapped
      0 + 0 singletons (-nan%:-nan%)
      0 + 0 with mate mapped to a different chr
      0 + 0 with mate mapped to a different chr (mapQ>=5)


      # To get only the mapped reads use the parameter 'F', which works like -v of grep.
      samtool view -b -F 4 file_sorted.bam > file_sorted_mapped.bam


      ## Bam to Sam ###
      samtool view -h -o out.file_sorted_mapped.sam file_sorted_mapped.bam

      ## samtools view ##

      samtool view -c -S out.file_sorted_mapped.sam

      [samopen] SAM header is present: 4184 sequences.
      mapped count (8460790) #Value matched to above matched value in flagstat.

      ## Sam to Bam ##
      samtool view -bS out.file_sorted_mapped.sam > out.file_sorted_mapped.bam
      [samopen] SAM header is present: 4184 sequences.

      # Find SNPs using SAMtools
      samtool mpileup -d8000 -uf sequence_genome.fasta out.CLJUaligned_0112-02_sorted_mapped.bam | /root/Tools/samtools-0.1.19/bcftools/bcftools view -vcg - > s.vcf

      [bam_header_read] EOF marker is absent. The input is probably truncated.
      [mpileup] 1 samples in 1 input files
      [bcfview] 100000 sites processed.
      [afs] 0:6507.765 1:2715.281 2:90776.954
      [bcfview] 200000 sites processed.
      [afs] 0:8265.636 1:2648.100 2:89086.264
      [bcfview] 300000 sites processed.
      [afs] 0:10133.539 1:2403.082 2:87463.378
      [afs] 0:289.935 1:75.121 2:2426.943

      # Opened vcf file -- same result as before i.e. "N" position in referene file.

      Comment


      • #4
        The N's in mpileup tell me that there is likely a mismatch between what you said the reference was, and what you actually aligned to. So start by double-checking that the reference names in your .bam really match the ones in the reference. For instance, maybe there are weird characters, or spaces, or something that are being handled differently between the aligner and samtools.

        Once that checks out, I'd remake the reference index with samtools fiadx. Make sure it completes with no errors.

        For the one that is all header, you see that you have some filtering steps in your command lines. Why don't you start by taking those away. Maybe they are filtering away all your SNPs.

        Also, I think with all your bam to sam and sam to bam, you might have lost your headers. You need -h to make sure those get converted too.
        Last edited by swbarnes2; 05-08-2014, 09:21 AM.

        Comment


        • #5
          Thank you for the suggestions swbarnes2. I agree that the N's make it seem like there is something wrong with the reference. My reference has a single header line and each subsequent line has 70 nucleotides. In the .fai file there is one line, which looks similar to the header of the fasta reference file. The last part of this line, which I discovered provides information about the "offset within the FASTA file of this sequence's first base", "the number of bases on each line", and "the number of bytes in each line, including the newline" (http://manpages.ubuntu.com/manpages/...5/faidx.5.html), has the numbers 80, 70, and 71, which seem to be correct since the header is 78 letters/ numbers long (plus two bytes for the header) and every subsequent line is 70 letters (plus one byte for the line).

          I used "samtools view out.file_sorted_mapped.bam" to view the bam file. Just looking at the first few lines, I noticed that the lines (aka the header, quality, sequence lines) were wrapping. Not sure if this would cause a problem.

          I also tried this command (samtools tview out.file_sorted_mapped.bam sequence_genome.fasta). I found that my second line was N's and my third line was my reference sequence. So the same sort of problem seen here (http://seqanswers.com/forums/showthread.php?t=5360). However, none of their solutions seemed correct.

          I remade the index file with samtools faidx as you recommended and it came back with no errors. When I opened it, it looked the same as my orginial .fai file (described above).

          I also tried using the "-h" (header option) in the samtools mpileup command as you recommended, and this got rid of the " EOF marker is absent. The input is probably truncated" comment. However, there were still no SNPs in the .vcf file.

          When I take away the varfilter, the file becomes giant, and if I use -D10, there are no SNPs found.

          So, I'm not sure how tview and the mpileup command are related, but I think that the N's in the tview are the same N's in the mpileup output.

          So, how can we get mpileup to use the third line (the reference sequence) of the tview rather than the second line (N's)?

          Thanks,
          Jason

          Comment


          • #6
            I also just saw that the reference in tview stops at around 1350 bases. What does that mean?

            Comment


            • #7
              Hi, jmwhitha. Have you solved your problem? I got the same problem with so many N's on ref in my vcf.file. If you have solved it, Hope to hear from you.
              Thank you.

              Comment


              • #8
                I would still like to know how, but I ended up using Tablet to confirm the mutations that were found when I did genome sequencing.

                God bless,
                Jason

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM
                • seqadmin
                  Techniques and Challenges in Conservation Genomics
                  by seqadmin



                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                  Avian Conservation
                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                  03-08-2024, 10:41 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Yesterday, 06:37 PM
                0 responses
                10 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, Yesterday, 06:07 PM
                0 responses
                9 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-22-2024, 10:03 AM
                0 responses
                49 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-21-2024, 07:32 AM
                0 responses
                67 views
                0 likes
                Last Post seqadmin  
                Working...
                X