SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Sequence read (id) specific kmer frequency calculation arif.at.iut Bioinformatics 4 12-02-2015 12:58 PM
Kmer frequency runnerBio88 Bioinformatics 2 10-31-2015 04:58 PM
Estimating heterozygosity from kmer frequency distribution MeganS Bioinformatics 4 09-03-2015 09:15 AM
How to get allele frequency histogram from vcf file? Jfly7 Bioinformatics 0 12-15-2014 05:54 PM
using khist to generate kmer coverage histogram plumb Bioinformatics 1 08-29-2014 12:49 PM

Reply
 
Thread Tools
Old 12-18-2016, 05:04 PM   #1
TomHarrop
Member
 
Location: New Zealand

Join Date: Jul 2014
Posts: 20
Default No peak in BBNorm kmer-frequency histogram

Hi,

I'm working on de novo assembly of an insect genome. Our paired-end libraries were made from 10 ng of sheared DNA using a Rubicon ThruPLEX kit with 9 cycles of PCR. The bioanalyzer trace shows a mean insert size of 476 bp, and we sequenced around 120 million 125 base read-pairs from this library. We're expecting a genome size of around 500 Mbp (but that's a pretty rough guess).

I'm having trouble getting contiguous assemblies. Among others, I've tried edena (L50 = 287 bp as reported by BBTools stats.sh) and velvet (L50 = 482 bp). I'm new to de novo assembly but 9 cycles of PCR sounds like a lot to me, and I'm wondering if the library complexity is too low. Also, fastqc reports a GC content of around 30 % which I know can exacerbate PCR bias.

To troubleshoot, I'm looking at the before and after kmer-frequency histograms generated by BBNorm during normalisation (below). I can't see a peak in either histogram, but I'm not sure what that means. Can anyone help me interpret these plots or suggest further troubleshooting steps?

In case it's relevant, the processing I did before assembly is: quality trimming (Q < 30 at 3 end) and adaptor trimming (TruSeq indexed adaptor and TruSeq universal adaptor) with cutadapt; contaminant filtering using PhiX, sequencing_artifacts and adapters_no_transposase references (BBDuk); normalisation and error correction to target k-mer coverage of 57 (BBNorm).

Please let me know if any more information would help.

Thanks for reading,

Tom
Attached Images
File Type: png khist.png (25.8 KB, 20 views)
Attached Files
File Type: pdf khist.pdf (134.0 KB, 22 views)
TomHarrop is offline   Reply With Quote
Old 12-18-2016, 05:48 PM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

It looks to me like low library complexity due to overamplification. It's hard to say, though. Note that there is a peak at ~16x. In order to see it, you have to rotate the image to the left by 45 degrees... the fact that it is such a weak peak indicates a very wide spread of coverage, indicative of overamplification, or an extremely high error rate. Or severe contamination, which can also be a problem with low-input amplified libraries. Do the reads generally BLAST to related insects?
Brian Bushnell is offline   Reply With Quote
Old 12-18-2016, 07:48 PM   #3
TomHarrop
Member
 
Location: New Zealand

Join Date: Jul 2014
Posts: 20
Default

Hi Brian, thanks for the reply. Contamination sounds quite possible, I just BLASTed a random subset of the reads and got human, macaque, trees, zebrafish etc. as well as the occasional hit on other insects. Uh oh.

Our server is going offline tonight but I'll do a more systematic investigation tomorrow and post the results.
TomHarrop is offline   Reply With Quote
Old 12-21-2016, 02:13 PM   #4
TomHarrop
Member
 
Location: New Zealand

Join Date: Jul 2014
Posts: 20
Default

I blastn-ed 1000 R1 and 1000 R2 reads from this library against the 'nt' database. For R1, I got 544 hits with an evalue < 1. 493 of them had usable taxon identifiers. From that I got 89 plant hits (18%), 82 mammalian (17%), 65 insects (13%), 63 nematodes and 57 fish (and some other stuff). R2 numbers were similar.

I don't know if evalue is the best way to look at BLAST results for NGS reads (i.e. short queries), but either way it looks like contamination to me.

Thanks for the hint.

Last edited by TomHarrop; 01-10-2017 at 06:39 PM. Reason: more concise
TomHarrop is offline   Reply With Quote
Old 12-22-2016, 12:00 AM   #5
WhatsOEver
Senior Member
 
Location: Germany

Join Date: Apr 2012
Posts: 215
Default

I'd suggest to do the contamination analysis is a more systematic way using biobloomtools with a couple of the different top hit plants, mammals, insect, ... genomes you got from Blast. It will probably take some time, but I'd be rather surprised if you really have so many different contaminations - as long as the person doing your library preps isn't also a dedicated gardener or fisherman
WhatsOEver is offline   Reply With Quote
Old 12-22-2016, 01:43 AM   #6
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Yes, to be honest, this does sound strange. Normally, contamination comes from 1 or 2 sources... a grab-bag of taxa is very unusual. Are you getting 100% identity to anything, or just weak hits?
Brian Bushnell is offline   Reply With Quote
Old 12-22-2016, 04:51 AM   #7
WhatsOEver
Senior Member
 
Location: Germany

Join Date: Apr 2012
Posts: 215
Default

Quote:
Originally Posted by Brian Bushnell View Post
Yes, to be honest, this does sound strange. Normally, contamination comes from 1 or 2 sources... a grab-bag of taxa is very unusual. Are you getting 100% identity to anything, or just weak hits?
And in addition: Do you get your complete read seq aligned or are your hits rather the tiny 20-40bp local alignment crap Blast may output if there is nothing more suitable?

EDIT: Just saw that you used Blastn. I'd suggest megablast here.

Last edited by WhatsOEver; 12-22-2016 at 04:54 AM.
WhatsOEver is offline   Reply With Quote
Old 12-27-2016, 11:12 PM   #8
TomHarrop
Member
 
Location: New Zealand

Join Date: Jul 2014
Posts: 20
Default

Thanks for the replies. Sorry about the slow response, I missed the email notification over the holidays.

You're correct, the hits are mostly less than 60 bp, not the full read. I did try megablast but I don't get any hits (well, 11 out of 1000 reads had hits, about half to insects).
TomHarrop is offline   Reply With Quote
Reply

Tags
bbnorm, de novo assembly, l50

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:00 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO