Unconfigured Ad

**erichpowell** · 02-03-2010, 11:36 AM

Daniel,

How did the variant calling work out for you? I, too, have next-gen illumina data, but mine covers the whole exome and so is not quite of such a high depth.

I have been using MAQ's built-in variant caller. The problem that I'm finding is that using the easyrun parameters I get >180,000 snps in the filtered snp list (cns.filter.snp).

This seems like waaaay to many to be biologically plausible -- I was expecting something closer to 20K -- but I don't know how to go about distinguishing the true calls from the false-positives.

Did you encounter any similar issues?

**NextGenSeq** · 02-03-2010, 11:55 AM

I would see

303 See Other

http://www.nature.com/nature/journal/v461/n7261/full/nature08250.html

For the parameters used for SNP calling for whole exome sequencing.

**erichpowell** · 03-31-2010, 10:50 AM

I finally got to the bottom of why I had so many variants. I had neglected to run the sol2sanger command, which, as Heng Li warns, results in unreliable snp calls. (This is because all of the base-qualities appear higher than they actually are).

After running this command and re-aligning the data, the number of variants (in cns.final.snp) dropped to about ~50,000.

**erichpowell** · 03-31-2010, 11:02 AM

Reported read depth equals 255, but pileup shows otherwise

I have come across another issue. It seems like in the cns.snp file, the maximum read depth which MAQ will report is 255. This is despite the fact that the pileup command indicates that there were many more reads covering this location.

(The only explanation I can think of for this is that by limiting the number of reads to 255, it allows MAQ to use a smaller data type and thus reduce its memory footprint).

Regardless of why it's done, it raises a number of questions about the reads that are not reported. In particular, we are developing our own genotype-caller and, for the purposes of comparison, we need to know exactly which reads MAQ is using when it makes a call. How can we get the id's of the reads that are used?

Another pertinent question is: If only 255 reads are used to make the call, how are those reads chosen? Are the highest quality ones used?

**jeffhsu3** · 03-31-2010, 04:13 PM

I think a read ceiling is set to guard against background noise.

I am not sure about MAQ, but if you use SAMTools pileup command with the consensus calling, which uses the same model as MAQ, it reports all the reads, which can then be filtered using samtools.pl varFilter function. The default max reads is set to 100 though.

Someone correct me if I'm wrong as I'm new to all this.

**bioinfosm** · 04-01-2010, 12:02 PM

Correct. MAQ is tested for depth of coverage 20-40x I believe, and more depth adds noise leading to more SNP calls.

Perhaps you could randomly select 40-60x average coverage for your samples, and then maq align and call SNPs.

**NextGenSeq** · 04-02-2010, 05:26 AM

High coverage usually means it is a repetitive region.

**erichpowell** · 04-06-2010, 07:50 AM

Correct. MAQ is tested for depth of coverage 20-40x I believe, and more depth adds noise leading to more SNP calls.

I am having a tough time understanding what "noise" you are referring to. Are you talking about "noise" from reads showing other nucleotides, because my understanding is that, when doing concensus calling, MAQ considers only the reads that correspond to the two most frequently observed nucleotides. (This means reads showing additional (different) nucleotides are discarded).

**bioinfosm** · 04-09-2010, 07:33 AM

By noise I mean the instrument error rate. My understanding is that using too high depth of coverage increases the errors at each position making it hard to call accurately.

We wanted to work on high depth sequencing data to call rare variants, but were unable to go below 2%. For regular homozygous or heterozygous calls, 20-40x depth of coverage seems to give best results.

**Papillon** · 03-30-2011, 01:29 PM

I've got a similar problem: most of the exome is highly covered by at least 200x.

I already convert Illumina scores to Sanger scores by using the latest version of BWA and adding -I, but I still have too many false positives, and worse, too many false negatives, since high covered areas are treated as background noise/contig collapsing due to repetitive regions and get a low score.

When I clean up my pileup file to a minimal coverage of 10x and a consensus-, snp- and RMS score of at least 15, I still have over 80,000 variants left.

Does anyone has any thoughts on how to tackle this problem?

Thanks a bunch!

**bioinfosm** · 04-04-2011, 10:03 AM

You mention BWA.. what is the variant caller you used? Maybe trying a different variant caller would give you a better perspective

**Papillon** · 04-04-2011, 10:24 AM

Thank you for responding! I'm using SAMTools to do my variant calling.

I am still a student, quite new to exome analysis and since I am self educating I sometimes make rookie mistakes. But I have changed tactics:

I now first filter out reads with a low mapping quality within the SAM-file and reads that are 'B' flagged for over 90% by Illumina within FASTQ-files and then remove duplicate reads with Picard. I also increased the maximum allowed coverage to reduce the number of false negatives.

I hope the extreme high coverage will reduce and I can set the maximum allowed coverage to a respected value again (like maybe 2x average coverage).
I do expect my false positive and false negative read is dropped. It is running now, so I will know soon.

**NextGenSeq** · 04-04-2011, 11:47 AM

Actually in whole exome data you should be getting 200K to 250K SNPs, whole genome data about 3 million SNPs.

**Papillon** · 04-04-2011, 12:14 PM

Wouldn't that mean that there are relatively speaking more SNP's in the exome than in the genome?? The exome is much more conserved, so my estimate would be that 200k to 250k would be way to much.

Besides, most studies speak about 15,000 - 20,000, although I believe the actual number will be higher.

Please correct me if I make a miss assumption somewhere.

Topics	Statistics	Last Post
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, 06-05-2026, 10:09 AM	0 responses 14 views 0 reactions	Last Post by SEQadmin2 06-05-2026, 10:09 AM
A New Single-Cell Method Maps DNA-Protein Interactions by SEQadmin2 Started by SEQadmin2, 06-04-2026, 08:59 AM	0 responses 26 views 0 reactions	Last Post by SEQadmin2 06-04-2026, 08:59 AM
Long-Read RNA Sequencing Uncovers a Hidden Layer of Immune Cell Regulation by SEQadmin2 Started by SEQadmin2, 06-02-2026, 12:03 PM	0 responses 33 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 12:03 PM
DNA Methylation Study Reveals How Epigenetic Changes Pass Between Generations by SEQadmin2 Started by SEQadmin2, 06-02-2026, 11:40 AM	0 responses 23 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 11:40 AM

Unconfigured Ad

Variant calling for high-coverage Illumina data

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News