SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Request for sequencing a whole chromosome cell line Iredc27 Illumina/Solexa 1 10-06-2011 02:49 AM
Request for sequencing a whole chromosome cell line Iredc27 General 2 10-05-2011 02:37 PM
Need Reference Annotation for MCF-7 Cell Line Transcripts AndrewCarpenter Bioinformatics 0 07-20-2010 07:52 AM
MCF-7 Cell Line Transcript Reference Annotation Needed AndrewCarpenter De novo discovery 0 07-19-2010 09:18 AM
Mutation of FOXL2 in granulosa-cell tumors of the ovary. Chien-Yuan Chen Literature Watch 0 07-20-2009 04:01 PM

Reply
 
Thread Tools
Old 02-08-2012, 01:59 PM   #1
angerusso
Member
 
Location: US

Join Date: Oct 2011
Posts: 47
Default Missing V600E mutation in WES of A375 cell line? Samtools problem?

I have whole-exome data from A375 cell line generated using Illumina HiSeq. So I created the following pipeline to find SNPs and INDELs. In particular, this cell line should have BRAF V600E mutation which I did not find. Could anyone help me with what went wrong in my pipeline and how to correct it so I can re-discover V600E mutation in BRAF?

1) bwa aln -t 8 hg19.fa S375_R1.fastq > S375_1.sai
2) bwa aln -t 8 hg19.fa S375_R2.fastq > S375_2.sai
3) bwa sampe -a 200 hg19.fa S375_1.sai S375_2.sai S375_R1.fastq S375_R2.fastq > S375_NoIndex_L007.sam
4) samtools view -bS S375_NoIndex_L007.sam > S375_NoIndex_L007.bam
5) samtools sort S375_NoIndex_L007.bam S375_NoIndex_L007.sorted
6) samtools index S375_NoIndex_L007.sorted.bam
7) samtools mpileup -uf hg19.fa S375_NoIndex_L007.sorted.bam | bcftools view -bvcg - > S375_NoIndex_L007.raw.bcf
8) bcftools view S375_NoIndex_L007.raw.bcf | vcfutils.pl varFilter -D200 > S375_NoIndex_L007_var_d200.flt.vcf


I only found BRAF introns using snpEff as below which dosen't make sense.


# Command line: SnpEff eff -v -i vcf -o txt hg19 S375_NoIndex_L007_var_d200.flt.vcf
7 140434607 * +A INS Hom 26.5 169 BRAF.7 BRAF mRNA NM_004333
INTRON 2301
7 140434607 * +AA INS Hom 26.5 169 BRAF.7 BRAF mRNA NM_004333
INTRON 2301
7 140437592 G A SNP Hom 13.7 4 BRAF.7 BRAF mRNA NM_004333
INTRON 2301
7 140453315 T C SNP Hom 176 28 BRAF.7 BRAF mRNA NM_004333
INTRON 2301
7 140470691 C T SNP Hom 21.8 2 BRAF.7 BRAF mRNA NM_004333
INTRON 2301
7 140495573 T C SNP Hom 15.9 2 BRAF.7 BRAF mRNA NM_004333
INTRON 2301
7 140499107 C T SNP Hom 22.8 3 BRAF.7 BRAF mRNA NM_004333
INTRON 2301
7 140510939 A G SNP Het 24 6 BRAF.7 BRAF mRNA NM_004333
INTRON 2301
7 140526728 T C SNP Het 7.8 6 BRAF.7 BRAF mRNA NM_004333
INTRON 2301
7 140529784 A G SNP Hom 13 2 BRAF.7 BRAF mRNA NM_004333
INTRON 2301
7 140532269 G C SNP Hom 123 5 BRAF.7 BRAF mRNA NM_004333
INTRON 2301
7 140535237 A G SNP Het 6.2 6 BRAF.7 BRAF mRNA NM_004333
INTRON 2301
7 140538991 T C SNP Hom 106 4 BRAF.7 BRAF mRNA NM_004333
INTRON 2301
7 140547384 A G SNP Hom 6.02 4 BRAF.7 BRAF mRNA NM_004333
INTRON 2301
7 140554320 T C SNP Het 3.01 5 BRAF.7 BRAF mRNA NM_004333
INTRON 2301
7 140560023 T C SNP Hom 32 3 BRAF.7 BRAF mRNA NM_004333
INTRON 2301
7 140561044 G A SNP Het 4.13 6 BRAF.7 BRAF mRNA NM_004333
INTRON 2301
7 140561121 A C SNP Hom 65.3 5 BRAF.7 BRAF mRNA NM_004333
INTRON 2301
7 140562068 G T SNP Het 8.64 4 BRAF.7 BRAF mRNA NM_004333
INTRON 2301
angerusso is offline   Reply With Quote
Old 02-10-2012, 09:24 AM   #2
prm36
Member
 
Location: Edinburgh

Join Date: Mar 2010
Posts: 16
Default

What depth of coverage are you averaging for you sample?
You have specified a maximum cut off of 200 for your variant calling, is it possible that the site you are looking at has coverage >200?
prm36 is offline   Reply With Quote
Old 02-10-2012, 11:26 AM   #3
swbarnes2
Senior Member
 
Location: San Diego

Join Date: May 2008
Posts: 912
Default

When all else fails, eyeball the raw data. You can look at the .sam file at that locus, and look at the pileup at that locus. Or grep the raw reads. Do your reads really show the thing you think they should? Sometimes, submitters mess up.

If you aren't seeing things you think you should be seeing, stop filtering so much. You are setting a rather low max insert size, and you are setting a coverage cap. Maybe those figures are wrong for your data.

Also, you don't actually have to index to use mpileup. And you can save space by piping the output from sampe into samtools view, so that it goes straight into a .bam.
swbarnes2 is offline   Reply With Quote
Old 02-10-2012, 05:17 PM   #4
angerusso
Member
 
Location: US

Join Date: Oct 2011
Posts: 47
Question Missing V600E mutation in WES of A375 cell line? Samtools problem?

So I converted bam to bed format. I am not able to figure out if the bam file contains this mutation or not. Any help?

Genomic Position for V600E = chr7:140453136-140453136

Mutation Syntax CDS: c.1799T>A

Mutation NT: t>a

240 rows contain the coordinate 140453136 as seen below. How do I find if it contains the mutation V600E?

chr7 140453135 140453235 CRKLB09060:5209CKACXX:7:1102:17012:134505/1 60 +

chr7 140453136 140453236 CRKLB09060:5209CKACXX:7:1303:20425:8020/1 60 +

chr7 140453136 140453236 CRKLB09060:5209CKACXX:7:1305:11651:47745/1 60 -
angerusso is offline   Reply With Quote
Old 02-10-2012, 08:59 PM   #5
Jon_Keats
Senior Member
 
Location: Phoenix, AZ

Join Date: Mar 2010
Posts: 279
Default

Why not load the bam into IGV and look at that position? Other thing to consider, are you really sure you have A375?
Jon_Keats is offline   Reply With Quote
Old 02-10-2012, 09:26 PM   #6
swbarnes2
Senior Member
 
Location: San Diego

Join Date: May 2008
Posts: 912
Default

I don't see why converting the .bam to .bed would help. Use samtools view to look at the .sam entries that cross your mutation. Look at the actual
DNA sequence of those reads. Or, make a pileup of the region, and look at that.

Or, if you have coordinates of reads that cross your mutation locus, use grep to examine those reads out of the fastq.
swbarnes2 is offline   Reply With Quote
Old 02-12-2012, 09:04 AM   #7
angerusso
Member
 
Location: US

Join Date: Oct 2011
Posts: 47
Default Mutation found in bam, but not vcf file

So I found the mutation present in bam file (after I aligned just to chr7 and viewed it in IGV). My sense is if it's present in just aligning to chr7, it is also present in aligning to entire hg19.

So my follow-up question is can I replace the rest of the pipeline with something else? What other options I can try in mpileup or how can I view the output of mpileup to see where the mutation got filtered?

I obviously don't see it in the vcf output (this is only SNPs I get, other than alterations pointed to in introns)

# Command line: SnpEff eff -minQ 0 -i vcf hg19 S375_d500.flt.vcf
7 140430228 T C SNP Het 9.52 5 BRAF.7 BRAF mRNA NM_004333 DOWNSTREAM: 3585 bases
7 140430271 A G SNP Het 15.1 9 BRAF.7 BRAF mRNA NM_004333 DOWNSTREAM: 3542 bases
7 140432218 G A SNP Het 4.13 7 BRAF.7 BRAF mRNA NM_004333 DOWNSTREAM: 1595 bases
7 140432223 G T SNP Het 10.4 7 BRAF.7 BRAF mRNA NM_004333 DOWNSTREAM: 1590 bases
7 140432312 C T SNP Het 3.01 8 BRAF.7 BRAF mRNA NM_004333 DOWNSTREAM: 1501 bases
7 140432360 C T SNP Het 3.01 8 BRAF.7 BRAF mRNA NM_004333 DOWNSTREAM: 1453 bases
angerusso is offline   Reply With Quote
Old 02-12-2012, 11:27 AM   #8
Jon_Keats
Senior Member
 
Location: Phoenix, AZ

Join Date: Mar 2010
Posts: 279
Default

I'm confused why you realigned to chromosome 7 then view in igv instead of just looking at the initial hg19 alignment. But based on what you said the most likely issue is filtering.

What is the coverage at that position? As swbarnes suggested do an mpileup for that specific position. Alternatively, in igv mouse over each non-reference read a your position and record the mapping quality and base phred value. Most likely there is an issue with the input data quality if the mutation is visible in igv causing low mapping qualities and/orlow base qualities.
Jon_Keats is offline   Reply With Quote
Old 02-12-2012, 11:57 AM   #9
swbarnes2
Senior Member
 
Location: San Diego

Join Date: May 2008
Posts: 912
Default

You still haven't provided all the information that would be useful. What are the quality scores of the letters that show the alternate letter? I'm guessing that the region you are interested in in uniquely mappable, but maybe you should check that by looking at the mapping quality of reads in this region from the whole genome alignment.

If the .bam shows the SNP, but mpileup won't call it, then it's probably because of your mpileup settings. Do you have a good reason for that -a 200 option in your command line? What happens if you leave that out? Have you actually looked at your data to see if that stringent an insert size is appropriate? Have you tried looking at the pileup file that mpileup generated with those settings? Does that show the SNP? Sometimes the BAQ calculations can wrongly mess up the quality of real SNPs, such that the SNP caller won't call them. You can see this in the pileup, the quality values will be terrible, while the .sam will show that the quality is fine, so look at that. If that's the case, BAQ can be disengaged in the pileup command with -B
swbarnes2 is offline   Reply With Quote
Old 02-13-2012, 03:41 PM   #10
angerusso
Member
 
Location: US

Join Date: Oct 2011
Posts: 47
Default

OK, good news - so now I am able to see the mutation when I tried "-D 1000" (max depth coverage option in vcftools). Just to answer "Jon_Keats" question that I tried just aligning to chr7 becuase the files are so huge that it take me forever to copy it to local computer to view anthing in IGV. So now I see mutation (just aligning to chr 7) after using "-D 1000" option.

But I don't understand what this means biologically and I am exploring. Any suggestions will be appreciated. I saw the mutation at "coverage" = 513 as outputted by snpeff software. This is the reason I didn't see it as I was using -D 200 and -D 500.

I am also now getting the results from aligning to entire hg19 and using the same parameters. WIll let you know what I get. Please comment on why is there a need to use such a high "D" option and what is the biological significance. I really appreciate your comments and suggestions.
angerusso is offline   Reply With Quote
Old 02-13-2012, 04:00 PM   #11
Jon_Keats
Senior Member
 
Location: Phoenix, AZ

Join Date: Mar 2010
Posts: 279
Default

By the sounds of it, hard to transfer and coverage >500x I'd guess you have very high likely excessive coverage of this sample. You are running into one of filters designed to prevent erroneous reads that map to repeats from being called in SNPs. the -D 100 makes a lot of sense for whole genomes at 30x average coverage or -D 200 for whole exomes with ~100x coverage on average.

Suggestions
1) Looking at your commands I'd second swbarnes and remove the -a 200 option from bwa sampe (let bwa do it on the fly...things work better that way)
2) You should mark the duplicate reads using Picard I'd bet many of your reads are duplicates.
3) If this filter is freaking you out try using GATK for SNP calling instead as it does a downsampling to get around this issue
Jon_Keats is offline   Reply With Quote
Old 02-13-2012, 04:49 PM   #12
angerusso
Member
 
Location: US

Join Date: Oct 2011
Posts: 47
Default

Hey,

I will start the run from your option (1) right now to see what difference it makes, and will update.

Regarding your suggestion of option (2), I tried Picard but it keeps complaining about memory.

Haven't tried option (3) yet, but first will run based on your option (1) and will update. Thanks very much.
angerusso is offline   Reply With Quote
Old 02-13-2012, 05:02 PM   #13
angerusso
Member
 
Location: US

Join Date: Oct 2011
Posts: 47
Default

Also, I forgot to mention that I did two things as suggested by swbarnes2, first use:

1) -B option and -d1000000 in mpileup
2) used -D 1000 option in vcftools.

So will also check if -B option was necessary.
angerusso is offline   Reply With Quote
Old 02-15-2012, 09:04 PM   #14
angerusso
Member
 
Location: US

Join Date: Oct 2011
Posts: 47
Default

Hey Guys,

Just want to update that I see BRAF mutation mainly because of the -d 1000 option.

Now, I am in a position to decide cut-offs. I did little background read and saw Q>20 and coverage>100 being used. I couldn't find what cut-off for "maximum coverage" could be. Certainly it has to be atleast 600, but I don't know where to draw the line.

I am not happy with Picard as it keep complaning about disk space as I saw some people using it and re-aligning etc to re-calibrate quality scores. Is this really necessary?

Can't I use some "standard" cut-offs? What do you recommend?
angerusso is offline   Reply With Quote
Old 02-17-2012, 03:04 AM   #15
heiya
Member
 
Location: China

Join Date: Nov 2011
Posts: 14
Default

Quote:
Originally Posted by angerusso View Post
Hey Guys,

Just want to update that I see BRAF mutation mainly because of the -d 1000 option.

Now, I am in a position to decide cut-offs. I did little background read and saw Q>20 and coverage>100 being used. I couldn't find what cut-off for "maximum coverage" could be. Certainly it has to be atleast 600, but I don't know where to draw the line.

I am not happy with Picard as it keep complaning about disk space as I saw some people using it and re-aligning etc to re-calibrate quality scores. Is this really necessary?

Can't I use some "standard" cut-offs? What do you recommend?
the last step: vcfutils.pl varFilter option : -d minimum read depth
-Q: minimum RMS mapping quality for SNP and other filter opition you can set.
heiya is offline   Reply With Quote
Old 02-19-2012, 06:02 PM   #16
angerusso
Member
 
Location: US

Join Date: Oct 2011
Posts: 47
Default

Quote:
Originally Posted by heiya View Post
the last step: vcfutils.pl varFilter option : -d minimum read depth
-Q: minimum RMS mapping quality for SNP and other filter opition you can set.
Thank you very much. Yes, I am not using these to set-up my filters.
angerusso is offline   Reply With Quote
Reply

Tags
missing v600e mutation, whole exome

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:37 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO