SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Indel detection in high coverage amplicons sequenced by MiSeq lpalacios Bioinformatics 2 02-24-2015 07:04 AM
find areas of very low and extra high coverage in NGS data willMD Bioinformatics 4 04-04-2013 11:33 AM
Variant calling on custom amplicons with extremely high coverage memento Bioinformatics 2 02-18-2013 09:38 PM
Low frequency indel detection NestorNotabilis Bioinformatics 3 05-31-2012 12:25 PM
MAQ and indel detection fadista Bioinformatics 1 09-03-2008 01:07 AM

Reply
 
Thread Tools
Old 02-17-2015, 12:01 PM   #1
alexholman
Junior Member
 
Location: Boston, MA

Join Date: Feb 2015
Posts: 6
Default Indel detection in NGS high coverage amplicons

I am attempting to detect indels from a panel of clones resulting from CRISPR targeted deletion. Regions around the target were PCR amplified to produce a roughly 160bp amplicon, which was then sequenced with as a PE150 run.
I've been banging my head against finding a tool that can:

1) Detect which clones have indels
2) Identify the location of these indels, ideally in a VCF or similar file such that the full panel of 96 clones can visualized (i.e. in IGV).
3) Provide annotation details about the read depth at the indel position and percentage of the sequences that contain an indel.
4) Is a stand-alone tool that I can install on my Unix box.

This question is very similar to one asked back in 2012, which was never sufficiently answered.
http://seqanswers.com/forums/showthread.php?t=25190

So far, I have tried a number of methods, each of which have some key failings.

CRISPR Genome Analyzer:
crispr-ga.net
Doesn't localize the indel detected. Only available as a web app; I need something that I can install locally and hammer on.

Outknocker:
Doesn't give localizations. Web based.

Pindel:
Pindel ends up missing a number of indels that are clearly there when viewing the reads from the BAM file. Also, pindel format is not good for looking across many clones, and the conversion pindel-to-vcf loses any depth and read-percentage information.

GATK HaplotypeCaller:
Misses a lot of very obvious indels. The caller appears to not be optimized for non-tiled non-fragmented reads.

GASVPRO and SVseq:
Simply don't detect any indels where they clearly exist. Again, the callers seem non-optimized for amplicon sequencing.

XHMM:
Refuses to do the PCA normalization with fewer targets than samples

CONTRA:
Doesn't deal well with small analysis windows for detecting indels shorter than 11bp.

CRISPR-GA uses a BLAT based alignment for their backend, but I can't find any additional information on how I would do further indel detection after a basic BLAT alignment. Any ideas on a BLAT based indel calling solution?
This seems like it should be a dead simple problem, something that was solved back in the 80's, but I can't find a good reference to a tool to use. Any help is appreciated.

Thanks
alexholman is offline   Reply With Quote
Old 02-18-2015, 12:11 AM   #2
sarvidsson
Senior Member
 
Location: Berlin, Germany

Join Date: Jan 2015
Posts: 137
Default

I have the same experience as you - all popular variant calling tools out there are optimized for shotgun libraries. For targeted resequencing with PCR amplicons I do a pileup with samtools and parse the output, calling variants at suitable thresholds.
sarvidsson is offline   Reply With Quote
Old 02-18-2015, 08:38 AM   #3
KaiYe
Senior Member
 
Location: amsterdam

Join Date: Jun 2009
Posts: 133
Default

Pindel is able to call indels from many samples at the same time and give you indication of which samples are the carriers. The new pindel2vcf gives you a clean way to view the result with the numbers of reads supporting ref and alt alleles.

you could put a list of bam files in the config file and run all samples together.
KaiYe is offline   Reply With Quote
Old 02-18-2015, 12:21 PM   #4
Matt Kearse
Member
 
Location: New Zealand

Join Date: Mar 2014
Posts: 20
Default

Just wondering if you've tried Geneious? It's commercial software, so maybe that doesn't meet your requirements, and I'm biased since I wrote the variant caller in it, but I've never seen it fail to call an obvious variant.

If you're willing to share your data along with the locations of some obvious indels that aren't getting called, then I can run it through Geneious for you and let you know how it goes.
Matt Kearse is offline   Reply With Quote
Old 02-24-2015, 06:59 AM   #5
alexholman
Junior Member
 
Location: Boston, MA

Join Date: Feb 2015
Posts: 6
Default

Well, after much frustration, I stumbled onto the package freebayes "Bayesian haplotype-based polymorphism discovery and genotyping"
https://github.com/ekg/freebayes

The caller seems to work well on amplicon data and ended up being the cleanest and most complete VCF file (with ref and alt allele frequencies).
alexholman is offline   Reply With Quote
Old 02-24-2015, 07:02 AM   #6
alexholman
Junior Member
 
Location: Boston, MA

Join Date: Feb 2015
Posts: 6
Default

sorry, double post

Last edited by alexholman; 02-24-2015 at 07:03 AM. Reason: double post
alexholman is offline   Reply With Quote
Old 02-24-2015, 10:32 PM   #7
Matt Kearse
Member
 
Location: New Zealand

Join Date: Mar 2014
Posts: 20
Default

Alex shared some of his data with me to run through Geneious and we corresponded a bit by email, so I thought I'd share the results with anyone else who's interested.

Geneious called the indels although it split them into multiple adjacent indels in some situations which isn't ideal. I hope to improve this soon.

I also ran his data through FreeBayes, which also found the obvious indels he expected, but it didn't find other 'obvious' indels he wasn't aware of.

The main problem was that the data was poorly aligned. For example, in one sample, one allele had a 29bp deletion and the other allele a 44bp deletion in the same region. The alignment created using BWA mem had failed to span the 44bp deletion, so no neither Geneious nor FreeBayes would call this indel from this alignment. I generated a better alignment using Geneious, and then Geneious called both indels, although split it a way that made it difficult to infer the two alleles. FreeBayes still failed to identify the two alleles in this case even when provided with an improved alignment.

For indels like this, I recommend aligning using either Geneious, or BBMap which both successfully span large indels. Or maybe other aligners have settings to tweak that will improve results around indels.

And for variant calling on this type of data, both Geneious or FreeBayes do OK, although neither works perfectly on the data Alex provided even when I generated a better alignment.
Matt Kearse is offline   Reply With Quote
Old 02-25-2015, 02:43 AM   #8
sarvidsson
Senior Member
 
Location: Berlin, Germany

Join Date: Jan 2015
Posts: 137
Default

Quote:
Originally Posted by Matt Kearse View Post
I also ran his data through FreeBayes, which also found the obvious indels he expected, but it didn't find other 'obvious' indels he wasn't aware of.
That's my experience with Freebayes as well, I haven't found a good set of parameters to work with the amplicon data I have. I typically use GMAP (when I have Sanger data) and GSNAP (for Illumina data) to align this kind of data, and with my own custom caller on pileups from samtools am quite happy...
sarvidsson is offline   Reply With Quote
Old 02-25-2015, 02:46 AM   #9
sarvidsson
Senior Member
 
Location: Berlin, Germany

Join Date: Jan 2015
Posts: 137
Default

Quote:
Originally Posted by Matt Kearse View Post
For indels like this, I recommend aligning using either Geneious, or BBMap which both successfully span large indels. Or maybe other aligners have settings to tweak that will improve results around indels.
One of my colleagues has a Geneious license, so I might try it. However, I'm a command line guy - is there a way to automate Geneious for amplicon data? (I typically have thousands of samples with custom inline barcoding schemes)
sarvidsson is offline   Reply With Quote
Old 02-25-2015, 02:03 PM   #10
Matt Kearse
Member
 
Location: New Zealand

Join Date: Mar 2014
Posts: 20
Default

Quote:
Originally Posted by sarvidsson View Post
One of my colleagues has a Geneious license, so I might try it. However, I'm a command line guy - is there a way to automate Geneious for amplicon data? (I typically have thousands of samples with custom inline barcoding schemes)
Unfortunately no, there isn't a Geneious command line interface. You can align or variant call in bulk by selecting all the data sets and choosing the options once.

Or if that's not sufficient you can put together workflows with optional custom code fragments. See https://www.youtube.com/watch?v=uvgB2_YBmD4 for a short demo of workflows.

Also, one limitation of Geneious is that you can't yet export to VCF format so you'll have to settle for CSV export for now.
Matt Kearse is offline   Reply With Quote
Old 02-26-2015, 12:20 AM   #11
sarvidsson
Senior Member
 
Location: Berlin, Germany

Join Date: Jan 2015
Posts: 137
Default

Quote:
Originally Posted by Matt Kearse View Post
Unfortunately no, there isn't a Geneious command line interface. You can align or variant call in bulk by selecting all the data sets and choosing the options once.
Or if that's not sufficient you can put together workflows with optional custom code fragments. See https://www.youtube.com/watch?v=uvgB2_YBmD4 for a short demo of workflows.
I'll have a look at the custom workflows. Would it be possible to bulk import thousands of (typically paired) FASTQ files and assigning sample IDs to them?

Quote:
Originally Posted by Matt Kearse View Post
Also, one limitation of Geneious is that you can't yet export to VCF format so you'll have to settle for CSV export for now.
VCF would be nice but is not a must. If the CSV format contain enough data I could genereate VCF from it where needed. First I'd like to compare the aligner/caller to our current amplicon re-sequencing pipeline.
sarvidsson is offline   Reply With Quote
Old 02-26-2015, 01:58 PM   #12
Matt Kearse
Member
 
Location: New Zealand

Join Date: Mar 2014
Posts: 20
Default

Quote:
Originally Posted by sarvidsson View Post
Would it be possible to bulk import thousands of (typically paired) FASTQ files and assigning sample IDs to them?
If prior to import you give the FASTQ files names that match their sample ID then their file name becomes the effective sample ID. Paired files should have an suffix (e.g 1 or 2) which will get stripped from the name when you pair them within Geneious which can be done in bulk.

It's probably best you just try it with a sample or two to start with to see if Geneious gives acceptable results on your data.
Matt Kearse is offline   Reply With Quote
Old 02-27-2015, 05:50 AM   #13
maxsalm
Member
 
Location: London

Join Date: Feb 2015
Posts: 18
Default

You may also find Scalpel useful (http://scalpel.sourceforge.net/) which uses an assembly step during indel calling (http://www.ncbi.nlm.nih.gov/pubmed/25128977 ) that may help with some of the alignment-derived false negatives.
maxsalm is offline   Reply With Quote
Old 03-09-2015, 04:57 PM   #14
gerbarinov
Junior Member
 
Location: RUSSIA MOSCOW

Join Date: Mar 2015
Posts: 3
Default

Quote:
Originally Posted by alexholman View Post
I am attempting to detect indels from a panel of clones resulting from CRISPR targeted deletion. Regions around the target were PCR amplified to produce a roughly 160bp amplicon, which was then sequenced with as a PE150 run.
I've been banging my head against finding a tool that can:

1) Detect which clones have indels
2) Identify the location of these indels, ideally in a VCF or similar file such that the full panel of 96 clones can visualized (i.e. in IGV).
3) Provide annotation details about the read depth at the indel position and percentage of the sequences that contain an indel.
4) Is a stand-alone tool that I can install on my Unix bo
Thank you, for your very informative topic with a happy ending
gerbarinov is offline   Reply With Quote
Old 07-24-2015, 01:02 PM   #15
Robin
Member
 
Location: US

Join Date: Nov 2009
Posts: 10
Default

Quote:
Originally Posted by alexholman View Post
Well, after much frustration, I stumbled onto the package freebayes "Bayesian haplotype-based polymorphism discovery and genotyping"
https://github.com/ekg/freebayes

The caller seems to work well on amplicon data and ended up being the cleanest and most complete VCF file (with ref and alt allele frequencies).
I have Crisp dataset with PE 250bp, and I tried Bayesian software tool, but I see very little INDEL in the vcf result file, and I can see the INDEL in IGV when the bam file is loaded into IGV view. I would like you to share some of detail info with me.
1) I used bwa-mem as an aligner, and which one you used?
2) my freebayes command line:
freebayes -f /home/db/chr5.fa --region chr5:112818960-112819204 my_sorted.bam > results.vcf
Robin is offline   Reply With Quote
Old 07-27-2015, 08:51 AM   #16
alexholman
Junior Member
 
Location: Boston, MA

Join Date: Feb 2015
Posts: 6
Default

There are two flags that I'm using that are not in your command.
--no-snps
suppresses calling SNPs, because for CRISPR analysis you don't really care about them
and
--use-duplicate-reads
I have a feeling that this one is what you need. When next-gen sequencing amplicons most of your reads are going to be duplicates, simply because you are sequencing identical amplicons. You need to keep these in to get proper depth for the analysis. Make sure your alignment pipeline is not removing duplicates, and use the above flag in freebayes to make sure the indel analysis is using them.
alexholman is offline   Reply With Quote
Old 07-27-2015, 10:20 AM   #17
Robin
Member
 
Location: US

Join Date: Nov 2009
Posts: 10
Default

Quote:
Originally Posted by alexholman View Post
There are two flags that I'm using that are not in your command.
--no-snps
suppresses calling SNPs, because for CRISPR analysis you don't really care about them
and
--use-duplicate-reads
I have a feeling that this one is what you need. When next-gen sequencing amplicons most of your reads are going to be duplicates, simply because you are sequencing identical amplicons. You need to keep these in to get proper depth for the analysis. Make sure your alignment pipeline is not removing duplicates, and use the above flag in freebayes to make sure the indel analysis is using them.
I just tried with two additional parameters on the command line, and the result is still the same as my early command line's results. Just let you know that my paired-end reads (250bp) is completed overlap with each other along the amplicons sequences region. I used the BWA-MEM aligned them with genomic reference hg38-chr5.fa only.

here is the samtools flaystat info:

-bash-4.1$ samtools flagstat test_clean.sorted.bam
18255 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
18243 + 0 mapped (99.93%:-nan%)
18255 + 0 paired in sequencing
9104 + 0 read1
9151 + 0 read2
0 + 0 properly paired (0.00%:-nan%)
18231 + 0 with itself and mate mapped
12 + 0 singletons (0.07%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

you can see there is no properly paired in the sample.
Thanks

R
Robin is offline   Reply With Quote
Old 07-27-2015, 10:38 AM   #18
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

You have a different number of read 1 and read 2. It looks like the pairing got corrupted in preprocessing, which would explain why none are proper pairs. You should reprocess the raw reads using both files at the same time, and only pair-aware tools such as BBDuk, to keep the order intact. Then remap. And if you are interested in indels, I suggest using BBMap for mapping. Alternately, if the pairs are all supposed to overlap, you can get more accurate indel calls by merging them first and mapping the merged reads rather than mapping them as pairs.
Brian Bushnell is offline   Reply With Quote
Old 07-27-2015, 11:48 AM   #19
Robin
Member
 
Location: US

Join Date: Nov 2009
Posts: 10
Default

Quote:
Originally Posted by Brian Bushnell View Post
You have a different number of read 1 and read 2. It looks like the pairing got corrupted in preprocessing, which would explain why none are proper pairs. You should reprocess the raw reads using both files at the same time, and only pair-aware tools such as BBDuk, to keep the order intact. Then remap. And if you are interested in indels, I suggest using BBMap for mapping. Alternately, if the pairs are all supposed to overlap, you can get more accurate indel calls by merging them first and mapping the merged reads rather than mapping them as pairs.
The fastq files have low quality scores and contamination and I used the fastx-toolkit to trimmed some of reads base on the quality scores. The results of PE reads are not matched. I will use BBDuk tool to make PE reads match again, and realigning them with bwa-mem.

You suggested to use "merged" two reads into one fastq file and map them as single-end reads. I am not sure that freebayes software will work with the single-end reads file.

Thanks

R
Robin is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:38 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO