Seqanswers Leaderboard Ad

**rbagnall** · 10-21-2013, 02:32 PM

Hi evakoe,

One explanation may be that the Illumina TruSeq exome (62Mb) includes untranslated regions, both 5' and 3', in the target regions - it has one of the largest target regions of all the exome enrichment kits. In the paper you mentioned, they used the Agilent SureSelect version 2 capture kit, which does not include these regions (although there is an extended version that does).

Also, VQSR may not be the best approach for genotyping such a small cohort of samples. GATK recommends at least 30 exomes for this to work well. Perhaps try the alternative hard filter approach, also detailed in the best practice documents. There are many exome datasets from the 1000 Genomes Project that can be downloaded to increase you sample size, and to check the genotypes you get with published data.

In general, I have found that each exome has 18-20,000 SNVs in the coding regions plus first/last 2bp of introns (i.e. synonymous, non-synonymous, nonsense and splice site).

Also, does your group of samples have close family members that allow you to check the genotypes for mendelian consistency, i.e. parents and children? This could highlight if there are many false positives.

**evakoe** · 10-22-2013, 02:06 AM

Hi rbagnall,

thank you very much for your reply. I think you made a good point with the large enrichment region of the TruSeq kit. I will look into this further.

I realize that 15 samples is too little for accurate genotyping, but I was always a bit hesitant to include 1000 Genomes samples, because of the technical differences in the dataset. I can imagine that one introduces different kind of biases that screw the results, I should probably still give it a shot.

I did check for medelian consistency as far as my samples allowed (which wasn't very far though, since I have only one father and two sons that closely related), but there a problem wasn't obvious.

Thanks again, I do appriciate your opinon.

**evakoe** · 10-23-2013, 05:27 AM

Hello again rbagnall and others,

maybe I should have formulated my last post more like a question.

1. Do you think that adding 1000Genomes samples to my cohort will introduce some biases to the variant calling due to the differences in data generation?

2. I don't only have non-uniform coverage between the samples, but also non-uniform coverage across the target regions in one sample. Do you think this could be one reason for the high number of variants observed?

Thank you

**rbagnall** · 10-23-2013, 03:56 PM

Hi,

Since you already have BAM files on your samples, it would take a couple of hours to run GATK universal genotyper with the hard filters, using the best practice guidelines. If the variant you are looking for has reasonable coverage in the 4 affected family members (autosomal dominant?) then this approach should call the variant. Since you also have some other non-affected family members, you may strike lucky and find what you are looking for without much further variant calling.

If you want to use VQSR and add in extra 1000 Genomes exome data there are a couple of things I would consider.

1. 1000 Genomes exome data was generated by Illumina GAIIX, Illumina HiSeq 2000, ABI SOLiD and 454 machines, and each has their own genotyping/error biases. Since you used MiSeq, I would lean towards exome data that has been generated using the HiSeq2000, rather than SOLiD or 454.

2. The exome capture kits used by the 1000 Genomes varied during the project, reflecting the release of newer kits as they came out. I would try and use data captured with one of the kits with the larger regions, such as Nimblegen SeqCap EZ Human Exome v2.0, as used by Baylor College of Medicine (BCM). 1000 Genomes did not use the TruSeq kit, which you used, as far as I am aware. You should make a new target interval list that has the overlapping regions of your TruSeq targets and the Nimblegen SeqCap EZ Human Exome v2.0 (e.g. use Bedtools: intersectbed to do this). This will ensure that all of your target regions are present in your data and the 1000 Genomes data.

See the supplementary information from the 1000 Genomes paper for more on this, particularly from page 14:

303 See Other

http://www.nature.com/nature/journal/v491/n7422/extref/nature11632-s1.pdf

3. You might also want to select samples that have the same ethnicity as your samples, and you can find this info from the 1000 Genomes website

ftp://ftp.1000genomes.ebi.ac.uk/vol1...es_samples.xls

If you are looking for european samples (CEPH, GBR) you can download the data from the short read archives, the index of which is here:

http://sra.dnanexus.com/runs/SRR716435/studies

But really, universal genotyper with hard filters may be a lot quicker, then look for rare, damaging variants shared by all affected family members, possibly in reasonable candidate genes for the disease. Then look at the variants (using IGV or your favourite viewer) to see if they seem real, before genotyping to show co-segregation with disease, and publishing before home time.

**evakoe** · 10-25-2013, 12:16 AM

Dear rbagnall, thank you very much for your great suggestions.
Eva

Topics	Statistics	Last Post
Bacterial Timeline Study Suggests Oxygen Use Preceded Photosynthesis by seqadmin Started by seqadmin, Today, 12:59 PM	0 responses 6 views 0 reactions	Last Post by seqadmin Today, 12:59 PM
New Software Simplifies 3D Gene Expression Mapping by seqadmin Started by seqadmin, Yesterday, 10:17 AM	0 responses 8 views 0 reactions	Last Post by seqadmin Yesterday, 10:17 AM
AI Tool Creates High-Resolution 3D Maps of the Mouse Brain by seqadmin Started by seqadmin, 03-20-2025, 05:03 AM	0 responses 49 views 0 reactions	Last Post by seqadmin 03-20-2025, 05:03 AM
Studying Microbial Gene Transfer with RNA Barcoding by seqadmin Started by seqadmin, 03-19-2025, 07:27 AM	0 responses 60 views 0 reactions	Last Post by seqadmin 03-19-2025, 07:27 AM

Seqanswers Leaderboard Ad

too high number of variants for human whole exome sequencing

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News