SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
FastQC GGGGG Kmer Potjie Bioinformatics 3 07-29-2013 01:24 AM
Weird kmer distribution (using fastqc) feralBiologist Bioinformatics 3 07-14-2013 05:20 AM
FastQC Kmer content: spike in reverse read reubennowell Bioinformatics 4 06-12-2013 02:56 AM
FastQC: odd kmer content zshuhua Introductions 3 05-13-2013 07:36 PM
kmer content warning in FastQC vallejov RNA Sequencing 0 04-05-2013 10:10 AM

Reply
 
Thread Tools
Old 08-01-2013, 05:20 PM   #1
fahmida
Member
 
Location: Australia

Join Date: Aug 2010
Posts: 54
Default strange FastQC kmer plot even after trimming

Hi,
I've the attached strange FastQC kmer plot even after adpter and quality trimming. The data is from 400bp PE library from GAII. I've used trimmomatic to trim the TruSeq adapter.
Code:
GATCGGAAGAGCACACGTCTGAACTCCAGTCACCTTGTAATCTCGTATGCCGTCTTCTGCTTG
As suggested by many other posts not to worry too much about things like this. However, I am coming back to this only after getting a highly fragmented denovo assembly of a large genome. I understand that denovo assembly can be like that for many reasons, however, just to make sure I've high quality reads to supply to assembler and not to mention the plot looks Ugly.
Thanks for any suggestions.
Attached Images
File Type: png kmer_profiles.png (103.1 KB, 70 views)
fahmida is offline   Reply With Quote
Old 08-01-2013, 05:36 PM   #2
Wallysb01
Senior Member
 
Location: San Francisco, CA

Join Date: Feb 2011
Posts: 286
Default

Oh, for de novo assembly you should definitely worry about that (if you were just mapping reads to a genome, it likely wouldn't matter). Those 5-mers are being generated from two dinucleotide repeats (just in different frames and strands). That is going to screw up your assembly if you have very many of them infiltrating your reads, which we can't tell for that plot, but its just relative to the highest abundant k-mer.

Are you sure you put in the correct adapter for trimming. Just the TruSeq adapter is often not correct. But rather you need some set of indexed adapters, PCR primers, etc. I generally give Trimmomatic a pretty long list of every adapter/primer set that was used in the whole group of library preps being sequenced, just to be sure. After your assemblies, you'll find adapter/primer sequence of all kinds of stuff if you don't.
Wallysb01 is offline   Reply With Quote
Old 08-01-2013, 05:50 PM   #3
Wallysb01
Senior Member
 
Location: San Francisco, CA

Join Date: Feb 2011
Posts: 286
Default

Here are two K-mer plots before (bottom) and after (top and that CCCCC repeat is very much lower than the spikes you see in the bottom window) aggressive trimming with trimmomatic (including a quality trim) and overlapping with flash (do your 150bp reads overlap?). Here is the adapter file I went with too, as you can see it was a bit of the kitchen sink.
kmer_profiles.png

kmer_profiles_1.png

Adapters.txt
Wallysb01 is offline   Reply With Quote
Old 08-01-2013, 06:59 PM   #4
fahmida
Member
 
Location: Australia

Join Date: Aug 2010
Posts: 54
Default

Hi Wallysb01,

Thanks for your reply. I haven't explored the overlapping reads.
I've used your adapters and it seems most of those kmers are still having fun out there.
Also for your info, here is the trimmomatic command I used:
Code:
java -classpath trimmomatic-0.30.jar org.usadellab.trimmomatic.TrimmomaticPE -threads 16 -phred33 ../lane2_NoIndex_L002_R1_001_val_1.fq ../lane2_NoIndex_L002_R2_001_val_2.fq paired21.fq unpaired21.fq paired22.fq unpaired22.fq ILLUMINACLIP:Adapters.txt:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:30 MINLEN:50
Attached Images
File Type: png kmer_profiles_2.png (77.9 KB, 18 views)
fahmida is offline   Reply With Quote
Old 08-01-2013, 07:06 PM   #5
Wallysb01
Senior Member
 
Location: San Francisco, CA

Join Date: Feb 2011
Posts: 286
Default

Eek, Ok. Did the frequency drop much. You can tell by the table under that figure.

Also, how big are your inserts? Can you even attempt overlapping the reads?

And you may just want to trim off those first 10bp for your next assembly. That may help.

Finally, what kind of coverage do you have?
Wallysb01 is offline   Reply With Quote
Old 08-01-2013, 09:27 PM   #6
fahmida
Member
 
Location: Australia

Join Date: Aug 2010
Posts: 54
Default

Nope, the frequency doesn't drop much. Reads are 150bp and insert size is 400bp for 2 lanes and 700bp for another two lanes. hence, not much chance of overlaps.
Yes, I did trim off 10bp in both directions and it's almost the same and seems like I am running out of options.
Attached Images
File Type: png kmer_profiles_3.png (72.2 KB, 13 views)
fahmida is offline   Reply With Quote
Old 08-01-2013, 09:50 PM   #7
Wallysb01
Senior Member
 
Location: San Francisco, CA

Join Date: Feb 2011
Posts: 286
Default

You might try COPE (http://sourceforge.net/projects/coperead/). It can overlap reads using kmers, so reads don't have to actually overlap and instead just be close enough for high frequency kmers to span the gap. It may work pretty well with 2x150bp reads, because you could increase kmer sizes up a little bigger, assuming your coverage is pretty high too. And you can use the 700bp insert library to add to the kmer pool, but not attempt overlaps.

With my shorter 170bp library, I found flash to work better, but the library was actually that small with very few >190bp. So that kmer method didn't seem to help much. And while you library may look like its 400bp, I've generally found libraries to be shorter than what sequencing cores say.
Wallysb01 is offline   Reply With Quote
Old 08-01-2013, 10:03 PM   #8
fahmida
Member
 
Location: Australia

Join Date: Aug 2010
Posts: 54
Default

Thanks for your suggestions. It would be good to have longer reads through overlaps, however, think I need to get rid of those funny k-mers first, isn't it?. I can't find a way to deal with that. Once I've quality data I can move to the next step.
fahmida is offline   Reply With Quote
Old 08-01-2013, 10:23 PM   #9
Wallysb01
Senior Member
 
Location: San Francisco, CA

Join Date: Feb 2011
Posts: 286
Default

Those repetitive kmers are really just dinucleoties, so in kmer lengths around 21, for error correction and overlapping, they may not provide a huge obstacle.

In fact, you could up the kmer length to 10bp in fastqc to see if these sequences continue to be a problem. It maybe that certain reads are just filled with them and they could be removed with a very strict dust filtering. Say, you remove reads with a dust score of 30? There is really no reason to attempt to keep sequences with so many very, very low complexity sequences. While you of course ideally you'd want to try to assembly low complexity sequences, however in this case, they may be artifacts and providing more problems than they are worth.

Prinseq can do dust filtering, if you want to give it a shot. And it will separate out the good and bad seqs for inspection.

After playing with prinseq, you might actually want to drop that score a little lower, 20-ish?

Last edited by Wallysb01; 08-01-2013 at 10:27 PM.
Wallysb01 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:13 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO