SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Illumina/Solexa



Similar Threads
Thread Thread Starter Forum Replies Last Post
Comparison between SOLiD, Illumina MiSeq and Illumina HiSeq NGS_New_User SOLiD 0 12-12-2012 11:37 AM
bowtie command line for Illumina Hiseq 2000 with Illumina 1.5+ quality encoding files rworthi Illumina/Solexa 4 09-28-2011 11:25 AM
Multiplexing with Illumina HiSeq Croissant Illumina/Solexa 0 08-22-2011 11:14 AM
Illumina HiSeq BclConverter wdt Bioinformatics 10 05-09-2011 12:21 PM
Kits for DGE on Illumina GA/HiSeq Kiki RNA Sequencing 0 06-25-2010 12:47 AM

Reply
 
Thread Tools
Old 05-12-2015, 02:27 PM   #1
ty23991
Member
 
Location: New York NY

Join Date: May 2015
Posts: 24
Arrow Illumina HiSeq k-mers

Hi
This is about the Illumina Hiseq based on Trueseq library.

Prior to trimming, the fastqc showed some k-mers enrichment in the middle and 5' end.

I performed the following
- quality trimming by seqtk,
- trimming based on adapters of trueseq3.fa in trimmomatic

Then I performed fastqc again.

The FastQC report shows that certain k-mers continue to be enriched.

I understand that I should remove the ones at the 5' end

But i m concerned about the k-mers at position 48-52. I have attached the plot. I would appreciate any suggestions whether i should ignore the k-mers or any further trimming would be necessary prior to further processing.
Attached Images
File Type: png fqc_kmers.png (58.5 KB, 14 views)
ty23991 is offline   Reply With Quote
Old 05-12-2015, 02:39 PM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by ty23991 View Post
I understand that I should remove the ones at the 5' end

But i m concerned about the k-mers at position 48-52. I have attached the plot. I would appreciate any suggestions whether i should ignore the k-mers or any further trimming would be necessary prior to further processing.
You don't necessarily need to remove the ones at the 5' end; it depends on the library type and experiment. Was this library amplified using custom primers, for example?

I'm not really sure about the peak at 48. Can you describe the data?
Brian Bushnell is offline   Reply With Quote
Old 05-12-2015, 02:52 PM   #3
ty23991
Member
 
Location: New York NY

Join Date: May 2015
Posts: 24
Default

Thanks Brian
No custom primer was used.
The sequencing was performed based on standard TrueSeq library
ty23991 is offline   Reply With Quote
Old 05-12-2015, 02:59 PM   #4
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

You can run BBMap to generate a histogram of read mismatch rates by position, like this:

bbmap.sh in=reads.fq ref=reference.fa mhist=mhist.txt qhist=qhist.txt reads=1m

You don't need to trim unless the histogram (mhist.txt) indicates a higher than expected error rate in the first few bases. But my question really was - what are you sequencing, what's the experiment, and how was read fragmentation/shearing performed?
Brian Bushnell is offline   Reply With Quote
Old 05-12-2015, 04:09 PM   #5
ty23991
Member
 
Location: New York NY

Join Date: May 2015
Posts: 24
Default

Thanks so much for the suggestion.
Will get back with the outcome.
ty23991 is offline   Reply With Quote
Old 05-13-2015, 12:45 PM   #6
ty23991
Member
 
Location: New York NY

Join Date: May 2015
Posts: 24
Default

Hi Brian,
This is a sequencing experiment of couple of cancer cell lines as available.

Read fragmentation was performed by using the standard Illumina GAII protocol, using exonuclease, phosphorylation, addition of A-overhang followed by ligation to the adapters. The cluster generation process involves repeated bridge amplification cycles until bridges are formed between the 5' and 3' ends. Could it lead to biased base pair composition in the center ?

Here is the mhist (match histogram) output. The match and substitution rate is not different at first few positions compared to the rest of the positions. But your tool does show that rate of indel is zero for the first few positions. "Others" also shows a rate of 0.00003. So it seems a bit confusing. I am not sure if the k-mers are first few positions can be ignored.

here is the output:
First 20 positions
#BaseNum Match1 Sub1 Del1 Ins1 N1 Other1
1 0.99319 0.00678 0.00000 0.00000 0.00000 0.00003
2 0.99407 0.00590 0.00000 0.00000 0.00000 0.00003
3 0.99411 0.00583 0.00000 0.00004 0.00000 0.00002
4 0.99430 0.00561 0.00002 0.00006 0.00000 0.00002
5 0.99452 0.00537 0.00006 0.00009 0.00001 0.00002
6 0.99453 0.00535 0.00009 0.00010 0.00000 0.00001
7 0.99478 0.00508 0.00009 0.00013 0.00000 0.00001
8 0.99470 0.00514 0.00009 0.00015 0.00000 0.00001
9 0.99441 0.00545 0.00011 0.00014 0.00000 0.00001
10 0.99491 0.00495 0.00011 0.00014 0.00000 0.00001
11 0.99483 0.00501 0.00009 0.00016 0.00000 0.00001
12 0.99501 0.00482 0.00012 0.00016 0.00000 0.00001
13 0.99486 0.00496 0.00014 0.00017 0.00000 0.00001
14 0.99462 0.00519 0.00015 0.00018 0.00000 0.00000
15 0.99464 0.00518 0.00015 0.00018 0.00000 0.00000
16 0.99463 0.00520 0.00017 0.00017 0.00000 0.00000
17 0.99471 0.00510 0.00016 0.00019 0.00000 0.00000
18 0.99441 0.00541 0.00015 0.00018 0.00000 0.00000
19 0.99424 0.00558 0.00017 0.00018 0.00000 0.00000
20 0.99420 0.00562 0.00016 0.00018 0.00000 0.00000

last 10 positions.
91 0.99478 0.00507 0.00011 0.00014 0.00000 0.00001
92 0.99451 0.00533 0.00009 0.00015 0.00000 0.00001
93 0.99421 0.00564 0.00010 0.00013 0.00000 0.00001
94 0.99382 0.00604 0.00010 0.00012 0.00000 0.00002
95 0.99407 0.00580 0.00010 0.00012 0.00000 0.00002
96 0.99457 0.00532 0.00006 0.00009 0.00000 0.00002
97 0.99382 0.00609 0.00005 0.00007 0.00000 0.00002
98 0.99395 0.00599 0.00002 0.00004 0.00000 0.00003
99 0.99422 0.00574 0.00000 0.00000 0.00000 0.00003
100 0.99373 0.00623 0.00000 0.00000 0.00000 0.00004

Last edited by ty23991; 05-13-2015 at 01:08 PM.
ty23991 is offline   Reply With Quote
Old 05-13-2015, 12:58 PM   #7
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

The match rate is over 99% for the first few bases, so the reads do not need to be trimmed; the enriched kmers are genomic, not artifact. They're just biased; I imagine due to the exonuclease. Sonication typically yields much less bias than enzymatic shearing.

BBMap does not allow indels in the first or last 2bp, which is why those are zero (you can't call them accurately at the tips of reads). "other" means soft-clipped where the read goes off the end of a reference sequence.

Again, I have no idea about the spike in the middle of the read.
Brian Bushnell is offline   Reply With Quote
Old 05-13-2015, 01:07 PM   #8
ty23991
Member
 
Location: New York NY

Join Date: May 2015
Posts: 24
Default

Thanks so much for the explanation.
ty23991 is offline   Reply With Quote
Reply

Tags
fastqc, illumina hiseq, k-mer, trueseq

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 02:03 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO