Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Calling SNPs from RNAseq bam files

    Hello SeqAnswers community,

    I am trying to use Samtools to call SNPs from my RNAseq data. I already have bam files, so I tried to use the following commands.

    #Method 1 -
    samtools sort file.bam file_sorted
    samtools mpileup -d8000 -uf genome.fasta bamfile1.bam | bcftools view -vcg - > s.vcf
    #When I opened the s.vcf file, it was a very large file (MB), it was showing a varient in every position, "N" was the reference for every position, and almost every "ALT" was multiple bases (A, T, G)

    #Method 2 -
    samtool sort file.bam file_sorted
    samtools mpileup -d8000 -uf sequence_genome.fasta file_sorted.bam | bcftools view -bvcg - > file_samtools.raw.bcf
    bcftools view file_samtools.raw.bcf | vcfutils.pl varFilter -D100 > file_samtools.vcf
    #When I opened this vcf file, there were no SNPs. Below are the last four lines.

    ##FORMAT=<ID=DV,Number=1,Type=Integer,Description="# high-quality non-reference bases">
    ##FORMAT=<ID=SP,Number=1,Type=Integer,Description="Phred-scaled strand bias P-value">
    ##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
    ##INFO=<ID=PR,Number=1,Type=Integer,Description="# permutations yielding a smaller PCHI2.">
    #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT file_sorted.bam

    Does anyone know what is happening? I also had trouble using GATK, so please don't stear me in that direction.

    Thanks and God bless,
    Jason

  • #2
    What does "samtools flagstat" tell you?
    Can you view your mapped reads in SAM format using "samtools view"?

    Can you dump a random couple of dozen mapped alignments? Do these look right?

    Comment


    • #3
      Thanks for your quick response Richard! I was already home when I saw it though. Here's what I could get.

      ##"samtool flagstat" ##
      samtool flagstat file.bam

      11959974 + 0 in total (QC-passed reads + QC-failed reads)
      0 + 0 duplicates
      8460790 + 0 mapped (70.74%:-nan%)
      0 + 0 paired in sequencing
      0 + 0 read1
      0 + 0 read2
      0 + 0 properly paired (-nan%:-nan%)
      0 + 0 with itself and mate mapped
      0 + 0 singletons (-nan%:-nan%)
      0 + 0 with mate mapped to a different chr
      0 + 0 with mate mapped to a different chr (mapQ>=5)


      # To get only the mapped reads use the parameter 'F', which works like -v of grep.
      samtool view -b -F 4 file_sorted.bam > file_sorted_mapped.bam


      ## Bam to Sam ###
      samtool view -h -o out.file_sorted_mapped.sam file_sorted_mapped.bam

      ## samtools view ##

      samtool view -c -S out.file_sorted_mapped.sam

      [samopen] SAM header is present: 4184 sequences.
      mapped count (8460790) #Value matched to above matched value in flagstat.

      ## Sam to Bam ##
      samtool view -bS out.file_sorted_mapped.sam > out.file_sorted_mapped.bam
      [samopen] SAM header is present: 4184 sequences.

      # Find SNPs using SAMtools
      samtool mpileup -d8000 -uf sequence_genome.fasta out.CLJUaligned_0112-02_sorted_mapped.bam | /root/Tools/samtools-0.1.19/bcftools/bcftools view -vcg - > s.vcf

      [bam_header_read] EOF marker is absent. The input is probably truncated.
      [mpileup] 1 samples in 1 input files
      [bcfview] 100000 sites processed.
      [afs] 0:6507.765 1:2715.281 2:90776.954
      [bcfview] 200000 sites processed.
      [afs] 0:8265.636 1:2648.100 2:89086.264
      [bcfview] 300000 sites processed.
      [afs] 0:10133.539 1:2403.082 2:87463.378
      [afs] 0:289.935 1:75.121 2:2426.943

      # Opened vcf file -- same result as before i.e. "N" position in referene file.

      Comment


      • #4
        The N's in mpileup tell me that there is likely a mismatch between what you said the reference was, and what you actually aligned to. So start by double-checking that the reference names in your .bam really match the ones in the reference. For instance, maybe there are weird characters, or spaces, or something that are being handled differently between the aligner and samtools.

        Once that checks out, I'd remake the reference index with samtools fiadx. Make sure it completes with no errors.

        For the one that is all header, you see that you have some filtering steps in your command lines. Why don't you start by taking those away. Maybe they are filtering away all your SNPs.

        Also, I think with all your bam to sam and sam to bam, you might have lost your headers. You need -h to make sure those get converted too.
        Last edited by swbarnes2; 05-08-2014, 09:21 AM.

        Comment


        • #5
          Thank you for the suggestions swbarnes2. I agree that the N's make it seem like there is something wrong with the reference. My reference has a single header line and each subsequent line has 70 nucleotides. In the .fai file there is one line, which looks similar to the header of the fasta reference file. The last part of this line, which I discovered provides information about the "offset within the FASTA file of this sequence's first base", "the number of bases on each line", and "the number of bytes in each line, including the newline" (http://manpages.ubuntu.com/manpages/...5/faidx.5.html), has the numbers 80, 70, and 71, which seem to be correct since the header is 78 letters/ numbers long (plus two bytes for the header) and every subsequent line is 70 letters (plus one byte for the line).

          I used "samtools view out.file_sorted_mapped.bam" to view the bam file. Just looking at the first few lines, I noticed that the lines (aka the header, quality, sequence lines) were wrapping. Not sure if this would cause a problem.

          I also tried this command (samtools tview out.file_sorted_mapped.bam sequence_genome.fasta). I found that my second line was N's and my third line was my reference sequence. So the same sort of problem seen here (http://seqanswers.com/forums/showthread.php?t=5360). However, none of their solutions seemed correct.

          I remade the index file with samtools faidx as you recommended and it came back with no errors. When I opened it, it looked the same as my orginial .fai file (described above).

          I also tried using the "-h" (header option) in the samtools mpileup command as you recommended, and this got rid of the " EOF marker is absent. The input is probably truncated" comment. However, there were still no SNPs in the .vcf file.

          When I take away the varfilter, the file becomes giant, and if I use -D10, there are no SNPs found.

          So, I'm not sure how tview and the mpileup command are related, but I think that the N's in the tview are the same N's in the mpileup output.

          So, how can we get mpileup to use the third line (the reference sequence) of the tview rather than the second line (N's)?

          Thanks,
          Jason

          Comment


          • #6
            I also just saw that the reference in tview stops at around 1350 bases. What does that mean?

            Comment


            • #7
              Hi, jmwhitha. Have you solved your problem? I got the same problem with so many N's on ref in my vcf.file. If you have solved it, Hope to hear from you.
              Thank you.

              Comment


              • #8
                I would still like to know how, but I ended up using Tablet to confirm the mutations that were found when I did genome sequencing.

                God bless,
                Jason

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Investigating the Gut Microbiome Through Diet and Spatial Biology
                  by seqadmin




                  The human gut contains trillions of microorganisms that impact digestion, immune functions, and overall health1. Despite major breakthroughs, we’re only beginning to understand the full extent of the microbiome’s influence on health and disease. Advances in next-generation sequencing and spatial biology have opened new windows into this complex environment, yet many questions remain. This article highlights two recent studies exploring how diet influences microbial...
                  02-24-2025, 06:31 AM
                • seqadmin
                  Quality Control Essentials for Next-Generation Sequencing Workflows
                  by seqadmin




                  Like all molecular biology applications, next-generation sequencing (NGS) workflows require diligent quality control (QC) measures to ensure accurate and reproducible results. Proper QC begins at nucleic acid extraction and continues all the way through to data analysis. This article outlines the key QC steps in an NGS workflow, along with the commonly used tools and techniques.

                  Nucleic Acid Quality Control
                  Preparing for NGS starts with isolating the...
                  02-10-2025, 01:58 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 03-03-2025, 01:15 PM
                0 responses
                46 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 02-28-2025, 12:58 PM
                0 responses
                164 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 02-24-2025, 02:48 PM
                0 responses
                522 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 02-21-2025, 02:46 PM
                0 responses
                255 views
                0 likes
                Last Post seqadmin  
                Working...
                X