Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • ty23991
    Member
    • May 2015
    • 24

    Illumina HiSeq k-mers

    Hi
    This is about the Illumina Hiseq based on Trueseq library.

    Prior to trimming, the fastqc showed some k-mers enrichment in the middle and 5' end.

    I performed the following
    - quality trimming by seqtk,
    - trimming based on adapters of trueseq3.fa in trimmomatic

    Then I performed fastqc again.

    The FastQC report shows that certain k-mers continue to be enriched.

    I understand that I should remove the ones at the 5' end

    But i m concerned about the k-mers at position 48-52. I have attached the plot. I would appreciate any suggestions whether i should ignore the k-mers or any further trimming would be necessary prior to further processing.
    Attached Files
  • Brian Bushnell
    Super Moderator
    • Jan 2014
    • 2709

    #2
    Originally posted by ty23991 View Post
    I understand that I should remove the ones at the 5' end

    But i m concerned about the k-mers at position 48-52. I have attached the plot. I would appreciate any suggestions whether i should ignore the k-mers or any further trimming would be necessary prior to further processing.
    You don't necessarily need to remove the ones at the 5' end; it depends on the library type and experiment. Was this library amplified using custom primers, for example?

    I'm not really sure about the peak at 48. Can you describe the data?

    Comment

    • ty23991
      Member
      • May 2015
      • 24

      #3
      Thanks Brian
      No custom primer was used.
      The sequencing was performed based on standard TrueSeq library

      Comment

      • Brian Bushnell
        Super Moderator
        • Jan 2014
        • 2709

        #4
        You can run BBMap to generate a histogram of read mismatch rates by position, like this:

        bbmap.sh in=reads.fq ref=reference.fa mhist=mhist.txt qhist=qhist.txt reads=1m

        You don't need to trim unless the histogram (mhist.txt) indicates a higher than expected error rate in the first few bases. But my question really was - what are you sequencing, what's the experiment, and how was read fragmentation/shearing performed?

        Comment

        • ty23991
          Member
          • May 2015
          • 24

          #5
          Thanks so much for the suggestion.
          Will get back with the outcome.

          Comment

          • ty23991
            Member
            • May 2015
            • 24

            #6
            Hi Brian,
            This is a sequencing experiment of couple of cancer cell lines as available.

            Read fragmentation was performed by using the standard Illumina GAII protocol, using exonuclease, phosphorylation, addition of A-overhang followed by ligation to the adapters. The cluster generation process involves repeated bridge amplification cycles until bridges are formed between the 5' and 3' ends. Could it lead to biased base pair composition in the center ?

            Here is the mhist (match histogram) output. The match and substitution rate is not different at first few positions compared to the rest of the positions. But your tool does show that rate of indel is zero for the first few positions. "Others" also shows a rate of 0.00003. So it seems a bit confusing. I am not sure if the k-mers are first few positions can be ignored.

            here is the output:
            First 20 positions
            #BaseNum Match1 Sub1 Del1 Ins1 N1 Other1
            1 0.99319 0.00678 0.00000 0.00000 0.00000 0.00003
            2 0.99407 0.00590 0.00000 0.00000 0.00000 0.00003
            3 0.99411 0.00583 0.00000 0.00004 0.00000 0.00002
            4 0.99430 0.00561 0.00002 0.00006 0.00000 0.00002
            5 0.99452 0.00537 0.00006 0.00009 0.00001 0.00002
            6 0.99453 0.00535 0.00009 0.00010 0.00000 0.00001
            7 0.99478 0.00508 0.00009 0.00013 0.00000 0.00001
            8 0.99470 0.00514 0.00009 0.00015 0.00000 0.00001
            9 0.99441 0.00545 0.00011 0.00014 0.00000 0.00001
            10 0.99491 0.00495 0.00011 0.00014 0.00000 0.00001
            11 0.99483 0.00501 0.00009 0.00016 0.00000 0.00001
            12 0.99501 0.00482 0.00012 0.00016 0.00000 0.00001
            13 0.99486 0.00496 0.00014 0.00017 0.00000 0.00001
            14 0.99462 0.00519 0.00015 0.00018 0.00000 0.00000
            15 0.99464 0.00518 0.00015 0.00018 0.00000 0.00000
            16 0.99463 0.00520 0.00017 0.00017 0.00000 0.00000
            17 0.99471 0.00510 0.00016 0.00019 0.00000 0.00000
            18 0.99441 0.00541 0.00015 0.00018 0.00000 0.00000
            19 0.99424 0.00558 0.00017 0.00018 0.00000 0.00000
            20 0.99420 0.00562 0.00016 0.00018 0.00000 0.00000

            last 10 positions.
            91 0.99478 0.00507 0.00011 0.00014 0.00000 0.00001
            92 0.99451 0.00533 0.00009 0.00015 0.00000 0.00001
            93 0.99421 0.00564 0.00010 0.00013 0.00000 0.00001
            94 0.99382 0.00604 0.00010 0.00012 0.00000 0.00002
            95 0.99407 0.00580 0.00010 0.00012 0.00000 0.00002
            96 0.99457 0.00532 0.00006 0.00009 0.00000 0.00002
            97 0.99382 0.00609 0.00005 0.00007 0.00000 0.00002
            98 0.99395 0.00599 0.00002 0.00004 0.00000 0.00003
            99 0.99422 0.00574 0.00000 0.00000 0.00000 0.00003
            100 0.99373 0.00623 0.00000 0.00000 0.00000 0.00004
            Last edited by ty23991; 05-13-2015, 01:08 PM.

            Comment

            • Brian Bushnell
              Super Moderator
              • Jan 2014
              • 2709

              #7
              The match rate is over 99% for the first few bases, so the reads do not need to be trimmed; the enriched kmers are genomic, not artifact. They're just biased; I imagine due to the exonuclease. Sonication typically yields much less bias than enzymatic shearing.

              BBMap does not allow indels in the first or last 2bp, which is why those are zero (you can't call them accurately at the tips of reads). "other" means soft-clipped where the read goes off the end of a reference sequence.

              Again, I have no idea about the spike in the middle of the read.

              Comment

              • ty23991
                Member
                • May 2015
                • 24

                #8
                Thanks so much for the explanation.

                Comment

                Latest Articles

                Collapse

                • GATTACAT
                  Reply to Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                  by GATTACAT
                  Love this - good data definitely starts from good input, and poor input can only give relatively poor data. I particularly like the mention of Nanodrop/absorbance based methods for quantification. It's such a toss up if you'll get an accurate reading or what amounts to a randomly generated number, and a lot of library/sequencing related issues can be traced back to poor quant.
                  07-01-2026, 11:43 AM
                • SEQadmin2
                  Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                  by SEQadmin2


                  I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                  Here are nine questions we think about, in roughly the order they matter, before...
                  06-18-2026, 07:11 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by SEQadmin2, 07-02-2026, 11:08 AM
                0 responses
                8 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-30-2026, 05:37 AM
                0 responses
                12 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-26-2026, 11:10 AM
                0 responses
                20 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-17-2026, 06:09 AM
                0 responses
                54 views
                0 reactions
                Last Post SEQadmin2  
                Working...