Unconfigured Ad

**Brian Bushnell** · 05-12-2015, 02:39 PM

Originally posted by ty23991 View Post

I understand that I should remove the ones at the 5' end

But i m concerned about the k-mers at position 48-52. I have attached the plot. I would appreciate any suggestions whether i should ignore the k-mers or any further trimming would be necessary prior to further processing.

You don't necessarily need to remove the ones at the 5' end; it depends on the library type and experiment. Was this library amplified using custom primers, for example?

I'm not really sure about the peak at 48. Can you describe the data?

**ty23991** · 05-12-2015, 02:52 PM

Thanks Brian
No custom primer was used.
The sequencing was performed based on standard TrueSeq library

**Brian Bushnell** · 05-12-2015, 02:59 PM

You can run BBMap to generate a histogram of read mismatch rates by position, like this:

bbmap.sh in=reads.fq ref=reference.fa mhist=mhist.txt qhist=qhist.txt reads=1m

You don't need to trim unless the histogram (mhist.txt) indicates a higher than expected error rate in the first few bases. But my question really was - what are you sequencing, what's the experiment, and how was read fragmentation/shearing performed?

**ty23991** · 05-12-2015, 04:09 PM

Thanks so much for the suggestion.
Will get back with the outcome.

**ty23991** · 05-13-2015, 12:45 PM

Hi Brian,
This is a sequencing experiment of couple of cancer cell lines as available.

Read fragmentation was performed by using the standard Illumina GAII protocol, using exonuclease, phosphorylation, addition of A-overhang followed by ligation to the adapters. The cluster generation process involves repeated bridge amplification cycles until bridges are formed between the 5' and 3' ends. Could it lead to biased base pair composition in the center ?

Here is the mhist (match histogram) output. The match and substitution rate is not different at first few positions compared to the rest of the positions. But your tool does show that rate of indel is zero for the first few positions. "Others" also shows a rate of 0.00003. So it seems a bit confusing. I am not sure if the k-mers are first few positions can be ignored.

here is the output:
First 20 positions
#BaseNum Match1 Sub1 Del1 Ins1 N1 Other1
1 0.99319 0.00678 0.00000 0.00000 0.00000 0.00003
2 0.99407 0.00590 0.00000 0.00000 0.00000 0.00003
3 0.99411 0.00583 0.00000 0.00004 0.00000 0.00002
4 0.99430 0.00561 0.00002 0.00006 0.00000 0.00002
5 0.99452 0.00537 0.00006 0.00009 0.00001 0.00002
6 0.99453 0.00535 0.00009 0.00010 0.00000 0.00001
7 0.99478 0.00508 0.00009 0.00013 0.00000 0.00001
8 0.99470 0.00514 0.00009 0.00015 0.00000 0.00001
9 0.99441 0.00545 0.00011 0.00014 0.00000 0.00001
10 0.99491 0.00495 0.00011 0.00014 0.00000 0.00001
11 0.99483 0.00501 0.00009 0.00016 0.00000 0.00001
12 0.99501 0.00482 0.00012 0.00016 0.00000 0.00001
13 0.99486 0.00496 0.00014 0.00017 0.00000 0.00001
14 0.99462 0.00519 0.00015 0.00018 0.00000 0.00000
15 0.99464 0.00518 0.00015 0.00018 0.00000 0.00000
16 0.99463 0.00520 0.00017 0.00017 0.00000 0.00000
17 0.99471 0.00510 0.00016 0.00019 0.00000 0.00000
18 0.99441 0.00541 0.00015 0.00018 0.00000 0.00000
19 0.99424 0.00558 0.00017 0.00018 0.00000 0.00000
20 0.99420 0.00562 0.00016 0.00018 0.00000 0.00000

last 10 positions.
91 0.99478 0.00507 0.00011 0.00014 0.00000 0.00001
92 0.99451 0.00533 0.00009 0.00015 0.00000 0.00001
93 0.99421 0.00564 0.00010 0.00013 0.00000 0.00001
94 0.99382 0.00604 0.00010 0.00012 0.00000 0.00002
95 0.99407 0.00580 0.00010 0.00012 0.00000 0.00002
96 0.99457 0.00532 0.00006 0.00009 0.00000 0.00002
97 0.99382 0.00609 0.00005 0.00007 0.00000 0.00002
98 0.99395 0.00599 0.00002 0.00004 0.00000 0.00003
99 0.99422 0.00574 0.00000 0.00000 0.00000 0.00003
100 0.99373 0.00623 0.00000 0.00000 0.00000 0.00004

**Brian Bushnell** · 05-13-2015, 12:58 PM

The match rate is over 99% for the first few bases, so the reads do not need to be trimmed; the enriched kmers are genomic, not artifact. They're just biased; I imagine due to the exonuclease. Sonication typically yields much less bias than enzymatic shearing.

BBMap does not allow indels in the first or last 2bp, which is why those are zero (you can't call them accurately at the tips of reads). "other" means soft-clipped where the read goes off the end of a reference sequence.

Again, I have no idea about the spike in the middle of the read.

**ty23991** · 05-13-2015, 01:07 PM

Thanks so much for the explanation.

Topics	Statistics	Last Post
High-Resolution Sequencing Exposes Hidden Toxoplasma Diversity by SEQadmin2 Started by SEQadmin2, 07-02-2026, 11:08 AM	0 responses 8 views 0 reactions	Last Post by SEQadmin2 07-02-2026, 11:08 AM
New AI Model Captures Long-Range Genomic Signals to Improve RNA Splice Site Prediction by SEQadmin2 Started by SEQadmin2, 06-30-2026, 05:37 AM	0 responses 12 views 0 reactions	Last Post by SEQadmin2 06-30-2026, 05:37 AM
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 20 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 54 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM

Unconfigured Ad

Illumina HiSeq k-mers

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News