SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
weird kmer content in illumina PE reads gab0 Illumina/Solexa 3 09-26-2017 02:22 AM
FastQC analyses of trimmed MiSeq reads kmer content akjones Bioinformatics 4 02-07-2014 06:50 AM
FastQC,kmer content, per base sequence content: is this good enough mgg Bioinformatics 10 11-06-2013 10:45 PM
FastQC: 3' bias (weird peaks) in Kmer content graphics Ann7 RNA Sequencing 2 06-19-2013 05:31 AM
weird kmer-content peak in RNA-seq data kareldegendt Bioinformatics 2 08-22-2012 03:45 PM

Reply
 
Thread Tools
Old 08-04-2014, 08:33 AM   #1
gab0
Member
 
Location: Talca, Chile

Join Date: Apr 2014
Posts: 11
Default weird kmer content in 5' end from genomic DNA PE reads

Hello

My name is Gabriel. I have asked this previously in the Illumina subforum but it seems that my post belongs here.

I'm writing because I'm analyzing Illumina reads (generated in a Hiseq 2000) from a genome of a particular insect species. The sequencing facility gave me the FASTQ files without adapters, but when checking the filtered FastQ files with the latest FastQC version (V 0.11.2) I am seeing a weird kmer pattern in the 5' region, it seems that a particular sequence is over represented, but the overrepresented sequence module does not show anything weird.

Also, it seems that the Kmer content overrepresented has a strong bias towards GC (i.e GGCCCGG, GCCCGGG and so on). I've also managed to overlap the Kmers to this sequence CTAGTATGGCCCGGGGGATCC but so far I've not been able to find anything related to this particular sequence. I'm concerned wheter it is OK to just trim this sequence, as I don't know how which meaning has this particular pattern. This sequence is present in both paired end files, and FastQC shows the kmer content peak in the 5' end of both files.

When searching this pattern with grep in my files I have noticed that there are several reads that seem to be duplicated, as the read sequence remains the same. I don't know if these duplicated reads should be removed or left.

So far and during my web search, I've only seen similar Kmer patterns when analyzing RNA-seq data, but this is not the case. Also, the "bad sequence" example from FastQC webpage shows a similar pattern, but in the 3' end, not in the 5' region, as this is my scenario.

It is worth noting that I have Paired end (2x100) files, and both files (1 and 2) have the same pattern.

I have attached the Kmer module graphs in these links:

http://seqanswers.com/forums/attachm...4&d=1405987233
http://seqanswers.com/forums/attachm...5&d=1405987245

I can add more information if needed.

Thank you very much, (and sorry for my english :P)
gab0 is offline   Reply With Quote
Old 08-05-2014, 01:43 AM   #2
nucacidhunter
Jafar Jabbari
 
Location: Melbourne

Join Date: Jan 2013
Posts: 1,226
Default

What kit was used for library prep and could you post FastQC plots for per sequence GC content, sequence duplication levels and Illumina adapters.
nucacidhunter is offline   Reply With Quote
Old 08-05-2014, 07:45 AM   #3
gab0
Member
 
Location: Talca, Chile

Join Date: Apr 2014
Posts: 11
Default

Hi nucacidhunter:

Thanks for replying. I'll answer by quoting what you posted.

Quote:
Originally Posted by nucacidhunter View Post
What kit was used for library prep
I sent the samples to another, external facility and I don't know which kit they used, so I'll find out ASAP.

I asked them to sequence my library in a HiSeq 2000 Illumina machine, in paired end runs (2x100bp). As I found out when receiving my reads by the index and the adapter sequence that was sent to me later, they did multiplexing.

Quote:
Originally Posted by nucacidhunter View Post
and could you post FastQC plots for per sequence GC content, sequence duplication levels and Illumina adapters.
They did told me the adapters used (when asked!), which would be these:

TruSeq Universal Adapter

5' AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT

TruSeq Adapter, Index 5

5’ GATCGGAAGAGCACACGTCTGAACTCCAGTCACACAGTGATCTCGTATGCCGTCTTCTGCTTG

Attached to the post are the plots for both read files. I have uploaded the plots for forward and reverse files (2nd plot of each category would be the reverse plot).









Finally the kmer content




These files should let you download the full FastQC report (Ver 0.11.2) in case you want to see it

https://dl.dropboxusercontent.com/u/...t-1_fastqc.zip
https://dl.dropboxusercontent.com/u/...t-2_fastqc.zip

Thank you very much,

Gabriel
Attached Images
File Type: png per sequence gc content-1.png (28.7 KB, 30 views)
File Type: png per sequence gc content-2.png (28.8 KB, 11 views)
File Type: png sequence duplication levels.png (21.5 KB, 26 views)
File Type: png sequence duplication levels-2.png (21.6 KB, 16 views)

Last edited by gab0; 08-07-2014 at 07:20 AM.
gab0 is offline   Reply With Quote
Old 08-05-2014, 04:34 PM   #4
nucacidhunter
Jafar Jabbari
 
Location: Melbourne

Join Date: Jan 2013
Posts: 1,226
Default

Apart from Kmer content every parameter looks fine in FastQC report. The number of over-represented Kmers is low (although it is unusual to see in balanced genomes) and I do not think it should be of any concern. The over-represented Kmer could be from duplicate reads (there is a small bump in %total sequences in duplication plot over >10) and it can be checked by removing duplicates and running FastQC again or it could be result of bias in at least one step of library prep due to AT rich nature of genome. Whether duplicates should be removed or not, I think it depends on downstream application and I will let bioinformatician to comment on it.
nucacidhunter is offline   Reply With Quote
Old 08-07-2014, 07:18 AM   #5
gab0
Member
 
Location: Talca, Chile

Join Date: Apr 2014
Posts: 11
Default

Quote:
Originally Posted by nucacidhunter View Post
Apart from Kmer content every parameter looks fine in FastQC report. The number of over-represented Kmers is low (although it is unusual to see in balanced genomes) and I do not think it should be of any concern. The over-represented Kmer could be from duplicate reads (there is a small bump in %total sequences in duplication plot over >10) and it can be checked by removing duplicates and running FastQC again or it could be result of bias in at least one step of library prep due to AT rich nature of genome. Whether duplicates should be removed or not, I think it depends on downstream application and I will let bioinformatician to comment on it.
Hi

thanks for your help! So apart from the Kmer problem, the files look ok for downstream analysis.

Well, I've found and fixed (partially) the kmer problem, so in here I'll write out how I solved this out:

When checking the files with FastQC V0.11.2, I saw this strange kmer pattern. When checking the Kmers, I figured out that they were displaced by 1bp, so I started to assembly (just by eye) the Kmer sequence.Then, looking the Kmer pattern with grep, I found that there were some repeated sequences/reads, like this one:

"ACTAGTATGGCCCGGGGGATCCTACGTTCCAAATGCAGCGAGCTCGTATAACCCTTTAAGAGTTGCTCTTTTTGTTTGGTAAGTTGCAAATCGAAGTTTTA"

Looking further I found a variant of this read, like this one

"AGTATGGCCCGGGGGATCCTACGTTCCAAATGCAGCGAGCTCGTATAACCCTTTAAGAGTTGCTCTTTTTGTTTGGTAAGTTGCAAATCGAAGTTTTAGAT"

As you can see, the variant is displaced 3bp in the 5' and 3' ends.

When searching the web again, I found a document from Illumina, the Illumina customer sequence letter. There I found some sequences that matched my reads, listed as: "Process Controls for TruSeq® Sample Preparation Kits Included in TruSeq DNA and RNA (v1/v2/LT/HT) and TruSeq Exome Kits"

So it seems that these reads came in as part of the library control, and they were not filtered by the sequencing facility.

I tested out a couple of tools for removing filtered reads. I used fastx_collapser but turns out that it produces FASTA files as output, not FASTQ files. Then I tested Fastq-mcf, which filtered the repeated reads, both correct repeated reads, and the control library reads.

After filtering out the repeated reads, now I had some FASTQ files without kmer warnings. Yoo-hoo!

Now I have to search for another tool to remove only the control reads, and maintaing the valid duplicates reads. I was thinking on using prinseq to remove these reads.

Thanks for your help!
gab0 is offline   Reply With Quote
Old 04-07-2015, 05:35 AM   #6
gauravdube
Junior Member
 
Location: India

Join Date: Feb 2014
Posts: 7
Default

Hi gab0,

I am facing exactly the same issue of k-mer content. Hence didn't created a different thread when i encountered yours. My question to you is: what is the tool you used to retain the valid duplicate reads and remove only the control reads. Thanks in advance.
gauravdube is offline   Reply With Quote
Old 04-07-2015, 06:05 AM   #7
gab0
Member
 
Location: Talca, Chile

Join Date: Apr 2014
Posts: 11
Default

Quote:
Originally Posted by gauravdube View Post
Hi gab0,

I am facing exactly the same issue of k-mer content. Hence didn't created a different thread when i encountered yours. My question to you is: what is the tool you used to retain the valid duplicate reads and remove only the control reads. Thanks in advance.
Hi gauravdube:

I found and used tools from the BBMap package. Brian helped me out guiding me hot to use the bbduk tool.

I used the following command line: bbduk.sh -Xmx4g -in=(file).fastq.gz -in2=(file).fastq.gz ref=adapters.fa -out=out1.fastq -out2=out2.fastq

Adapters file has all the adapters that I could find for Illumina platforms, including the control sequences from the libraries, in fasta format. That worked for me, hopefully will work for you too!

Best regards,

Gabriel
gab0 is offline   Reply With Quote
Old 06-12-2015, 06:33 AM   #8
nike00
Member
 
Location: italy

Join Date: Jul 2011
Posts: 10
Default

Quote:
Originally Posted by gab0 View Post
Hi gauravdube:

I found and used tools from the BBMap package. Brian helped me out guiding me hot to use the bbduk tool.

I used the following command line: bbduk.sh -Xmx4g -in=(file).fastq.gz -in2=(file).fastq.gz ref=adapters.fa -out=out1.fastq -out2=out2.fastq

Adapters file has all the adapters that I could find for Illumina platforms, including the control sequences from the libraries, in fasta format. That worked for me, hopefully will work for you too!

Best regards,

Gabriel
Dear Gabriel,

very interesting post. I would like to know if you have a list of the Illumina adapters and the control sequences as well, to use as adapters.fa file. I cannot find them anywhere.

Thanks a lot,
nike00
nike00 is offline   Reply With Quote
Old 06-12-2015, 07:02 AM   #9
NextGenSeq
Senior Member
 
Location: USA

Join Date: Apr 2009
Posts: 482
Default

It looks like Nextera bias to me.
NextGenSeq is offline   Reply With Quote
Old 06-12-2015, 09:41 AM   #10
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by nike00 View Post
Dear Gabriel,

very interesting post. I would like to know if you have a list of the Illumina adapters and the control sequences as well, to use as adapters.fa file. I cannot find them anywhere.

Thanks a lot,
nike00
If you download the BBMap package, the adapters are in the resources directory - nextera.fa.gz, truseq.fa.gz, and truseq_rna.fa.gz. You can use all of them with the flag "ref=nextera.fa.gz,truseq.fa.gz,truseq_rna.fa.gz" (with the appropriate paths).
Brian Bushnell is offline   Reply With Quote
Old 10-03-2015, 09:16 AM   #11
gauravdube
Junior Member
 
Location: India

Join Date: Feb 2014
Posts: 7
Default

Hi Gabriel,

Thank you so much. It worked for me.

Quote:
Originally Posted by gab0 View Post
Hi gauravdube:

I found and used tools from the BBMap package. Brian helped me out guiding me hot to use the bbduk tool.

I used the following command line: bbduk.sh -Xmx4g -in=(file).fastq.gz -in2=(file).fastq.gz ref=adapters.fa -out=out1.fastq -out2=out2.fastq

Adapters file has all the adapters that I could find for Illumina platforms, including the control sequences from the libraries, in fasta format. That worked for me, hopefully will work for you too!

Best regards,

Gabriel
gauravdube is offline   Reply With Quote
Old 10-04-2015, 02:36 AM   #12
nike00
Member
 
Location: italy

Join Date: Jul 2011
Posts: 10
Default

Quote:
Originally Posted by Brian Bushnell View Post
If you download the BBMap package, the adapters are in the resources directory - nextera.fa.gz, truseq.fa.gz, and truseq_rna.fa.gz. You can use all of them with the flag "ref=nextera.fa.gz,truseq.fa.gz,truseq_rna.fa.gz" (with the appropriate paths).
Thank you very much!
nike00 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:03 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO