SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
BFAST and Variant Calling nexgengirl Bioinformatics 8 01-02-2013 06:03 AM
variant calling kjaja Bioinformatics 1 11-04-2011 08:16 AM
SliderII: High Quality SNP Calling Using Illumina Data at Shallow Coverage nmalhis Bioinformatics 10 10-10-2011 12:49 AM
low 454 coverage combined with high solexa coverage strob Bioinformatics 7 10-07-2010 11:14 AM
New Paper: High Quality SNP Calling Using Illumina Data at Shallow Coverage nmalhis Bioinformatics 0 03-01-2010 03:40 PM

Reply
 
Thread Tools
Old 08-05-2009, 09:39 AM   #1
dgmacarthur
Junior Member
 
Location: sydney

Join Date: Jun 2008
Posts: 3
Default Variant calling for high-coverage Illumina data

Hi,

I've got a fairly unusual data-set: Illumina PE 50 bp data on a 25 kb region of human genomic DNA sequenced in ~1,500 samples to very high coverage, typically over 300X. (For the curious, it was generated using barcoding of 96 samples per lane.)

I've mapped all of the reads with Maq and finished BAM conversion and pileup with SAMtools. Now my mission is to call genotypes for SNPs, indels and SVs as comprehensively as possible for all of the samples.

I've started playing around with SAMtools varFilter for SNP and indel calling, but before I get too involved in building up a pipeline for this I was hoping to get some insight from the various experts here: given this data-set, what approaches would you use for SNP, indel and SV calling? Are there reliable pipelines and parameters established for these analyses that anyone can recommend?

Cheers,


Daniel.
dgmacarthur is offline   Reply With Quote
Old 02-03-2010, 11:36 AM   #2
erichpowell
Member
 
Location: South Florida

Join Date: Nov 2009
Posts: 10
Default

Daniel,

How did the variant calling work out for you? I, too, have next-gen illumina data, but mine covers the whole exome and so is not quite of such a high depth.

I have been using MAQ's built-in variant caller. The problem that I'm finding is that using the easyrun parameters I get >180,000 snps in the filtered snp list (cns.filter.snp).

This seems like waaaay to many to be biologically plausible -- I was expecting something closer to 20K -- but I don't know how to go about distinguishing the true calls from the false-positives.

Did you encounter any similar issues?
erichpowell is offline   Reply With Quote
Old 02-03-2010, 11:55 AM   #3
NextGenSeq
Senior Member
 
Location: USA

Join Date: Apr 2009
Posts: 482
Default

I would see
http://www.nature.com/nature/journal...ture08250.html

For the parameters used for SNP calling for whole exome sequencing.
NextGenSeq is offline   Reply With Quote
Old 03-31-2010, 11:50 AM   #4
erichpowell
Member
 
Location: South Florida

Join Date: Nov 2009
Posts: 10
Default

I finally got to the bottom of why I had so many variants. I had neglected to run the sol2sanger command, which, as Heng Li warns, results in unreliable snp calls. (This is because all of the base-qualities appear higher than they actually are).

After running this command and re-aligning the data, the number of variants (in cns.final.snp) dropped to about ~50,000.
erichpowell is offline   Reply With Quote
Old 03-31-2010, 12:02 PM   #5
erichpowell
Member
 
Location: South Florida

Join Date: Nov 2009
Posts: 10
Default Reported read depth equals 255, but pileup shows otherwise

I have come across another issue. It seems like in the cns.snp file, the maximum read depth which MAQ will report is 255. This is despite the fact that the pileup command indicates that there were many more reads covering this location.

(The only explanation I can think of for this is that by limiting the number of reads to 255, it allows MAQ to use a smaller data type and thus reduce its memory footprint).

Regardless of why it's done, it raises a number of questions about the reads that are not reported. In particular, we are developing our own genotype-caller and, for the purposes of comparison, we need to know exactly which reads MAQ is using when it makes a call. How can we get the id's of the reads that are used?

Another pertinent question is: If only 255 reads are used to make the call, how are those reads chosen? Are the highest quality ones used?
erichpowell is offline   Reply With Quote
Old 03-31-2010, 05:13 PM   #6
jeffhsu3
Junior Member
 
Location: Cleveland, OH

Join Date: Jan 2010
Posts: 5
Default

I think a read ceiling is set to guard against background noise.

I am not sure about MAQ, but if you use SAMTools pileup command with the consensus calling, which uses the same model as MAQ, it reports all the reads, which can then be filtered using samtools.pl varFilter function. The default max reads is set to 100 though.

Someone correct me if I'm wrong as I'm new to all this.
jeffhsu3 is offline   Reply With Quote
Old 04-01-2010, 01:02 PM   #7
bioinfosm
Senior Member
 
Location: USA

Join Date: Jan 2008
Posts: 482
Default

Correct. MAQ is tested for depth of coverage 20-40x I believe, and more depth adds noise leading to more SNP calls.

Perhaps you could randomly select 40-60x average coverage for your samples, and then maq align and call SNPs.
__________________
--
bioinfosm
bioinfosm is offline   Reply With Quote
Old 04-02-2010, 06:26 AM   #8
NextGenSeq
Senior Member
 
Location: USA

Join Date: Apr 2009
Posts: 482
Default

High coverage usually means it is a repetitive region.
NextGenSeq is offline   Reply With Quote
Old 04-06-2010, 08:50 AM   #9
erichpowell
Member
 
Location: South Florida

Join Date: Nov 2009
Posts: 10
Default

Quote:
Correct. MAQ is tested for depth of coverage 20-40x I believe, and more depth adds noise leading to more SNP calls.
I am having a tough time understanding what "noise" you are referring to. Are you talking about "noise" from reads showing other nucleotides, because my understanding is that, when doing concensus calling, MAQ considers only the reads that correspond to the two most frequently observed nucleotides. (This means reads showing additional (different) nucleotides are discarded).
erichpowell is offline   Reply With Quote
Old 04-09-2010, 08:33 AM   #10
bioinfosm
Senior Member
 
Location: USA

Join Date: Jan 2008
Posts: 482
Default

By noise I mean the instrument error rate. My understanding is that using too high depth of coverage increases the errors at each position making it hard to call accurately.

We wanted to work on high depth sequencing data to call rare variants, but were unable to go below 2%. For regular homozygous or heterozygous calls, 20-40x depth of coverage seems to give best results.
__________________
--
bioinfosm
bioinfosm is offline   Reply With Quote
Old 03-30-2011, 02:29 PM   #11
Papillon
Member
 
Location: New York

Join Date: Mar 2011
Posts: 13
Default

I've got a similar problem: most of the exome is highly covered by at least 200x.

I already convert Illumina scores to Sanger scores by using the latest version of BWA and adding -I, but I still have too many false positives, and worse, too many false negatives, since high covered areas are treated as background noise/contig collapsing due to repetitive regions and get a low score.

When I clean up my pileup file to a minimal coverage of 10x and a consensus-, snp- and RMS score of at least 15, I still have over 80,000 variants left.

Does anyone has any thoughts on how to tackle this problem?

Thanks a bunch!

Last edited by Papillon; 03-30-2011 at 02:31 PM.
Papillon is offline   Reply With Quote
Old 04-04-2011, 11:03 AM   #12
bioinfosm
Senior Member
 
Location: USA

Join Date: Jan 2008
Posts: 482
Default

You mention BWA.. what is the variant caller you used? Maybe trying a different variant caller would give you a better perspective
__________________
--
bioinfosm
bioinfosm is offline   Reply With Quote
Old 04-04-2011, 11:24 AM   #13
Papillon
Member
 
Location: New York

Join Date: Mar 2011
Posts: 13
Default

Thank you for responding! I'm using SAMTools to do my variant calling.

I am still a student, quite new to exome analysis and since I am self educating I sometimes make rookie mistakes. But I have changed tactics:

I now first filter out reads with a low mapping quality within the SAM-file and reads that are 'B' flagged for over 90% by Illumina within FASTQ-files and then remove duplicate reads with Picard. I also increased the maximum allowed coverage to reduce the number of false negatives.

I hope the extreme high coverage will reduce and I can set the maximum allowed coverage to a respected value again (like maybe 2x average coverage).
I do expect my false positive and false negative read is dropped. It is running now, so I will know soon.
Papillon is offline   Reply With Quote
Old 04-04-2011, 12:47 PM   #14
NextGenSeq
Senior Member
 
Location: USA

Join Date: Apr 2009
Posts: 482
Default

Actually in whole exome data you should be getting 200K to 250K SNPs, whole genome data about 3 million SNPs.
NextGenSeq is offline   Reply With Quote
Old 04-04-2011, 01:14 PM   #15
Papillon
Member
 
Location: New York

Join Date: Mar 2011
Posts: 13
Default

Wouldn't that mean that there are relatively speaking more SNP's in the exome than in the genome?? The exome is much more conserved, so my estimate would be that 200k to 250k would be way to much.

Besides, most studies speak about 15,000 - 20,000, although I believe the actual number will be higher.

Please correct me if I make a miss assumption somewhere.
Papillon is offline   Reply With Quote
Old 04-07-2011, 10:48 AM   #16
Papillon
Member
 
Location: New York

Join Date: Mar 2011
Posts: 13
Default

Quote:
Originally Posted by Papillon View Post
I now first filter out reads with a low mapping quality within the SAM-file and reads that are 'B' flagged for over 90% by Illumina within FASTQ-files...
This doesn't seems to work. Although alignment looks okay at first sight, creating the SAM file fails hopelessly (as you can see below). The same thing happened to me when I tried to map fastq-files in which I trimmed the bad read-ends.

Since mapping went well with untouched fastq-files, would it be safe for me to assume that tempering with these files can be quite tricky?
The only other cause could be using a newer version of BWA - v. 0.5.9b - and using the -I option for Illumina scoring.
Those are the only differences between being able to map at all and the absolute failures.

In retrospect, this method wasn't powerful enough to begin with, so I removed it from the pipeline.

[bwa_sai2sam_pe_core] time elapses: 10.97 sec
[bwa_sai2sam_pe_core] changing coordinates of 18 alignments.
[bwa_sai2sam_pe_core] align unmapped mate...
[bwa_paired_sw] 19395 out of 60331 Q17 singletons are mated.
[bwa_paired_sw] 238 out of 193695 Q17 discordant pairs are fixed.
[bwa_sai2sam_pe_core] time elapses: 22668.70 sec
[bwa_sai2sam_pe_core] refine gapped alignments... 0.81 sec
[bwa_sai2sam_pe_core] print alignments... 2.02 sec
[bwa_sai2sam_pe_core] 262144 sequences have been processed.
[bwa_sai2sam_pe_core] convert to sequence coordinate...
[infer_isize] (25, 50, 75) percentile: (16981, 39972, 70412)
[infer_isize] low and high boundaries: 101 and 177274 for estimating avg and std
[infer_isize] inferred external isize from 39 pairs: 42807.359 +/- 29283.488
[infer_isize] skewness: 0.344; kurtosis: -1.096; ap_prior: 1.00e-05
[infer_isize] inferred maximum insert size: 220265 (6.06 sigma)
[bwa_sai2sam_pe_core] time elapses: 11.07 sec
[bwa_sai2sam_pe_core] changing coordinates of 12 alignments.
[bwa_sai2sam_pe_core] align unmapped mate...
[bwa_paired_sw] 20365 out of 59571 Q17 singletons are mated.
[bwa_paired_sw] 253 out of 194428 Q17 discordant pairs are fixed.
[bwa_sai2sam_pe_core] time elapses: 27222.88 sec
[bwa_sai2sam_pe_core] refine gapped alignments... 0.83 sec
[bwa_sai2sam_pe_core] print alignments... 2.03 sec
[bwa_sai2sam_pe_core] 524288 sequences have been processed.
[bwa_sai2sam_pe_core] convert to sequence coordinate...
[infer_isize] fail to infer insert size: too few good pairs
Papillon is offline   Reply With Quote
Old 04-07-2011, 12:52 PM   #17
genlyai
Member
 
Location: Boston, MA

Join Date: Aug 2009
Posts: 39
Default

It should not be tricky, but you need to remove both sequences from a pair of reads if you decide to remove one. Although I'm not familiar with BWA, that seems like your problem.
genlyai is offline   Reply With Quote
Old 04-07-2011, 01:28 PM   #18
Papillon
Member
 
Location: New York

Join Date: Mar 2011
Posts: 13
Default

In the end I only removed ~109,000 reads out of ~92,000,000 per fastq file, so I'd expected that BWA would only have difficulties with pairing those few reads. Trimming reads (not removing them) seemed to cause similar problems, although I have to admit there were > ~50,000 with a 100% Q2/B flag, so that would be identical to removing them.

Somewhere on this forum, the same problem is described (thread: 'BWA sampe hanging'). Same symptoms, slightly different causes, but it seems to agree with your answer that all pairs have to be lined up correctly and it seems that a small change can cause serious problems.
Papillon is offline   Reply With Quote
Old 04-08-2011, 06:53 AM   #19
genlyai
Member
 
Location: Boston, MA

Join Date: Aug 2009
Posts: 39
Default

I'm not familiar with BWA, but on bowtie, it just uses the read order to find read pairs, so as soon as you remove one read but not its pair, all subsequent reads go out of alignment and can't be correctly paired.
genlyai is offline   Reply With Quote
Old 04-08-2011, 07:34 AM   #20
Papillon
Member
 
Location: New York

Join Date: Mar 2011
Posts: 13
Default

Thank you for responding! I remapped my data again, using the latest version of BWA and it worked out great, so the most likely cause for my previous failure would indeed be that the read order was disturbed and therefor caused all problems.

BWA does try to fix it, so it seems, but it takes an insane amount of time and in the end the results are not quite what you would expect (see my previous copy-paste of BWA's output).
Papillon is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:21 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO