Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Missing V600E mutation in WES of A375 cell line? Samtools problem?

    I have whole-exome data from A375 cell line generated using Illumina HiSeq. So I created the following pipeline to find SNPs and INDELs. In particular, this cell line should have BRAF V600E mutation which I did not find. Could anyone help me with what went wrong in my pipeline and how to correct it so I can re-discover V600E mutation in BRAF?

    1) bwa aln -t 8 hg19.fa S375_R1.fastq > S375_1.sai
    2) bwa aln -t 8 hg19.fa S375_R2.fastq > S375_2.sai
    3) bwa sampe -a 200 hg19.fa S375_1.sai S375_2.sai S375_R1.fastq S375_R2.fastq > S375_NoIndex_L007.sam
    4) samtools view -bS S375_NoIndex_L007.sam > S375_NoIndex_L007.bam
    5) samtools sort S375_NoIndex_L007.bam S375_NoIndex_L007.sorted
    6) samtools index S375_NoIndex_L007.sorted.bam
    7) samtools mpileup -uf hg19.fa S375_NoIndex_L007.sorted.bam | bcftools view -bvcg - > S375_NoIndex_L007.raw.bcf
    8) bcftools view S375_NoIndex_L007.raw.bcf | vcfutils.pl varFilter -D200 > S375_NoIndex_L007_var_d200.flt.vcf


    I only found BRAF introns using snpEff as below which dosen't make sense.


    # Command line: SnpEff eff -v -i vcf -o txt hg19 S375_NoIndex_L007_var_d200.flt.vcf
    7 140434607 * +A INS Hom 26.5 169 BRAF.7 BRAF mRNA NM_004333
    INTRON 2301
    7 140434607 * +AA INS Hom 26.5 169 BRAF.7 BRAF mRNA NM_004333
    INTRON 2301
    7 140437592 G A SNP Hom 13.7 4 BRAF.7 BRAF mRNA NM_004333
    INTRON 2301
    7 140453315 T C SNP Hom 176 28 BRAF.7 BRAF mRNA NM_004333
    INTRON 2301
    7 140470691 C T SNP Hom 21.8 2 BRAF.7 BRAF mRNA NM_004333
    INTRON 2301
    7 140495573 T C SNP Hom 15.9 2 BRAF.7 BRAF mRNA NM_004333
    INTRON 2301
    7 140499107 C T SNP Hom 22.8 3 BRAF.7 BRAF mRNA NM_004333
    INTRON 2301
    7 140510939 A G SNP Het 24 6 BRAF.7 BRAF mRNA NM_004333
    INTRON 2301
    7 140526728 T C SNP Het 7.8 6 BRAF.7 BRAF mRNA NM_004333
    INTRON 2301
    7 140529784 A G SNP Hom 13 2 BRAF.7 BRAF mRNA NM_004333
    INTRON 2301
    7 140532269 G C SNP Hom 123 5 BRAF.7 BRAF mRNA NM_004333
    INTRON 2301
    7 140535237 A G SNP Het 6.2 6 BRAF.7 BRAF mRNA NM_004333
    INTRON 2301
    7 140538991 T C SNP Hom 106 4 BRAF.7 BRAF mRNA NM_004333
    INTRON 2301
    7 140547384 A G SNP Hom 6.02 4 BRAF.7 BRAF mRNA NM_004333
    INTRON 2301
    7 140554320 T C SNP Het 3.01 5 BRAF.7 BRAF mRNA NM_004333
    INTRON 2301
    7 140560023 T C SNP Hom 32 3 BRAF.7 BRAF mRNA NM_004333
    INTRON 2301
    7 140561044 G A SNP Het 4.13 6 BRAF.7 BRAF mRNA NM_004333
    INTRON 2301
    7 140561121 A C SNP Hom 65.3 5 BRAF.7 BRAF mRNA NM_004333
    INTRON 2301
    7 140562068 G T SNP Het 8.64 4 BRAF.7 BRAF mRNA NM_004333
    INTRON 2301

  • #2
    What depth of coverage are you averaging for you sample?
    You have specified a maximum cut off of 200 for your variant calling, is it possible that the site you are looking at has coverage >200?

    Comment


    • #3
      When all else fails, eyeball the raw data. You can look at the .sam file at that locus, and look at the pileup at that locus. Or grep the raw reads. Do your reads really show the thing you think they should? Sometimes, submitters mess up.

      If you aren't seeing things you think you should be seeing, stop filtering so much. You are setting a rather low max insert size, and you are setting a coverage cap. Maybe those figures are wrong for your data.

      Also, you don't actually have to index to use mpileup. And you can save space by piping the output from sampe into samtools view, so that it goes straight into a .bam.

      Comment


      • #4
        Missing V600E mutation in WES of A375 cell line? Samtools problem?

        So I converted bam to bed format. I am not able to figure out if the bam file contains this mutation or not. Any help?

        Genomic Position for V600E = chr7:140453136-140453136

        Mutation Syntax CDS: c.1799T>A

        Mutation NT: t>a

        240 rows contain the coordinate 140453136 as seen below. How do I find if it contains the mutation V600E?

        chr7 140453135 140453235 CRKLB09060:5209CKACXX:7:1102:17012:134505/1 60 +

        chr7 140453136 140453236 CRKLB09060:5209CKACXX:7:1303:20425:8020/1 60 +

        chr7 140453136 140453236 CRKLB09060:5209CKACXX:7:1305:11651:47745/1 60 -

        Comment


        • #5
          Why not load the bam into IGV and look at that position? Other thing to consider, are you really sure you have A375?

          Comment


          • #6
            I don't see why converting the .bam to .bed would help. Use samtools view to look at the .sam entries that cross your mutation. Look at the actual
            DNA sequence of those reads. Or, make a pileup of the region, and look at that.

            Or, if you have coordinates of reads that cross your mutation locus, use grep to examine those reads out of the fastq.

            Comment


            • #7
              Mutation found in bam, but not vcf file

              So I found the mutation present in bam file (after I aligned just to chr7 and viewed it in IGV). My sense is if it's present in just aligning to chr7, it is also present in aligning to entire hg19.

              So my follow-up question is can I replace the rest of the pipeline with something else? What other options I can try in mpileup or how can I view the output of mpileup to see where the mutation got filtered?

              I obviously don't see it in the vcf output (this is only SNPs I get, other than alterations pointed to in introns)

              # Command line: SnpEff eff -minQ 0 -i vcf hg19 S375_d500.flt.vcf
              7 140430228 T C SNP Het 9.52 5 BRAF.7 BRAF mRNA NM_004333 DOWNSTREAM: 3585 bases
              7 140430271 A G SNP Het 15.1 9 BRAF.7 BRAF mRNA NM_004333 DOWNSTREAM: 3542 bases
              7 140432218 G A SNP Het 4.13 7 BRAF.7 BRAF mRNA NM_004333 DOWNSTREAM: 1595 bases
              7 140432223 G T SNP Het 10.4 7 BRAF.7 BRAF mRNA NM_004333 DOWNSTREAM: 1590 bases
              7 140432312 C T SNP Het 3.01 8 BRAF.7 BRAF mRNA NM_004333 DOWNSTREAM: 1501 bases
              7 140432360 C T SNP Het 3.01 8 BRAF.7 BRAF mRNA NM_004333 DOWNSTREAM: 1453 bases

              Comment


              • #8
                I'm confused why you realigned to chromosome 7 then view in igv instead of just looking at the initial hg19 alignment. But based on what you said the most likely issue is filtering.

                What is the coverage at that position? As swbarnes suggested do an mpileup for that specific position. Alternatively, in igv mouse over each non-reference read a your position and record the mapping quality and base phred value. Most likely there is an issue with the input data quality if the mutation is visible in igv causing low mapping qualities and/orlow base qualities.

                Comment


                • #9
                  You still haven't provided all the information that would be useful. What are the quality scores of the letters that show the alternate letter? I'm guessing that the region you are interested in in uniquely mappable, but maybe you should check that by looking at the mapping quality of reads in this region from the whole genome alignment.

                  If the .bam shows the SNP, but mpileup won't call it, then it's probably because of your mpileup settings. Do you have a good reason for that -a 200 option in your command line? What happens if you leave that out? Have you actually looked at your data to see if that stringent an insert size is appropriate? Have you tried looking at the pileup file that mpileup generated with those settings? Does that show the SNP? Sometimes the BAQ calculations can wrongly mess up the quality of real SNPs, such that the SNP caller won't call them. You can see this in the pileup, the quality values will be terrible, while the .sam will show that the quality is fine, so look at that. If that's the case, BAQ can be disengaged in the pileup command with -B

                  Comment


                  • #10
                    OK, good news - so now I am able to see the mutation when I tried "-D 1000" (max depth coverage option in vcftools). Just to answer "Jon_Keats" question that I tried just aligning to chr7 becuase the files are so huge that it take me forever to copy it to local computer to view anthing in IGV. So now I see mutation (just aligning to chr 7) after using "-D 1000" option.

                    But I don't understand what this means biologically and I am exploring. Any suggestions will be appreciated. I saw the mutation at "coverage" = 513 as outputted by snpeff software. This is the reason I didn't see it as I was using -D 200 and -D 500.

                    I am also now getting the results from aligning to entire hg19 and using the same parameters. WIll let you know what I get. Please comment on why is there a need to use such a high "D" option and what is the biological significance. I really appreciate your comments and suggestions.

                    Comment


                    • #11
                      By the sounds of it, hard to transfer and coverage >500x I'd guess you have very high likely excessive coverage of this sample. You are running into one of filters designed to prevent erroneous reads that map to repeats from being called in SNPs. the -D 100 makes a lot of sense for whole genomes at 30x average coverage or -D 200 for whole exomes with ~100x coverage on average.

                      Suggestions
                      1) Looking at your commands I'd second swbarnes and remove the -a 200 option from bwa sampe (let bwa do it on the fly...things work better that way)
                      2) You should mark the duplicate reads using Picard I'd bet many of your reads are duplicates.
                      3) If this filter is freaking you out try using GATK for SNP calling instead as it does a downsampling to get around this issue

                      Comment


                      • #12
                        Hey,

                        I will start the run from your option (1) right now to see what difference it makes, and will update.

                        Regarding your suggestion of option (2), I tried Picard but it keeps complaining about memory.

                        Haven't tried option (3) yet, but first will run based on your option (1) and will update. Thanks very much.

                        Comment


                        • #13
                          Also, I forgot to mention that I did two things as suggested by swbarnes2, first use:

                          1) -B option and -d1000000 in mpileup
                          2) used -D 1000 option in vcftools.

                          So will also check if -B option was necessary.

                          Comment


                          • #14
                            Hey Guys,

                            Just want to update that I see BRAF mutation mainly because of the -d 1000 option.

                            Now, I am in a position to decide cut-offs. I did little background read and saw Q>20 and coverage>100 being used. I couldn't find what cut-off for "maximum coverage" could be. Certainly it has to be atleast 600, but I don't know where to draw the line.

                            I am not happy with Picard as it keep complaning about disk space as I saw some people using it and re-aligning etc to re-calibrate quality scores. Is this really necessary?

                            Can't I use some "standard" cut-offs? What do you recommend?

                            Comment


                            • #15
                              Originally posted by angerusso View Post
                              Hey Guys,

                              Just want to update that I see BRAF mutation mainly because of the -d 1000 option.

                              Now, I am in a position to decide cut-offs. I did little background read and saw Q>20 and coverage>100 being used. I couldn't find what cut-off for "maximum coverage" could be. Certainly it has to be atleast 600, but I don't know where to draw the line.

                              I am not happy with Picard as it keep complaning about disk space as I saw some people using it and re-aligning etc to re-calibrate quality scores. Is this really necessary?

                              Can't I use some "standard" cut-offs? What do you recommend?
                              the last step: vcfutils.pl varFilter option : -d minimum read depth
                              -Q: minimum RMS mapping quality for SNP and other filter opition you can set.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              9 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X