Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Illumina HiSeq k-mers

    Hi
    This is about the Illumina Hiseq based on Trueseq library.

    Prior to trimming, the fastqc showed some k-mers enrichment in the middle and 5' end.

    I performed the following
    - quality trimming by seqtk,
    - trimming based on adapters of trueseq3.fa in trimmomatic

    Then I performed fastqc again.

    The FastQC report shows that certain k-mers continue to be enriched.

    I understand that I should remove the ones at the 5' end

    But i m concerned about the k-mers at position 48-52. I have attached the plot. I would appreciate any suggestions whether i should ignore the k-mers or any further trimming would be necessary prior to further processing.
    Attached Files

  • #2
    Originally posted by ty23991 View Post
    I understand that I should remove the ones at the 5' end

    But i m concerned about the k-mers at position 48-52. I have attached the plot. I would appreciate any suggestions whether i should ignore the k-mers or any further trimming would be necessary prior to further processing.
    You don't necessarily need to remove the ones at the 5' end; it depends on the library type and experiment. Was this library amplified using custom primers, for example?

    I'm not really sure about the peak at 48. Can you describe the data?

    Comment


    • #3
      Thanks Brian
      No custom primer was used.
      The sequencing was performed based on standard TrueSeq library

      Comment


      • #4
        You can run BBMap to generate a histogram of read mismatch rates by position, like this:

        bbmap.sh in=reads.fq ref=reference.fa mhist=mhist.txt qhist=qhist.txt reads=1m

        You don't need to trim unless the histogram (mhist.txt) indicates a higher than expected error rate in the first few bases. But my question really was - what are you sequencing, what's the experiment, and how was read fragmentation/shearing performed?

        Comment


        • #5
          Thanks so much for the suggestion.
          Will get back with the outcome.

          Comment


          • #6
            Hi Brian,
            This is a sequencing experiment of couple of cancer cell lines as available.

            Read fragmentation was performed by using the standard Illumina GAII protocol, using exonuclease, phosphorylation, addition of A-overhang followed by ligation to the adapters. The cluster generation process involves repeated bridge amplification cycles until bridges are formed between the 5' and 3' ends. Could it lead to biased base pair composition in the center ?

            Here is the mhist (match histogram) output. The match and substitution rate is not different at first few positions compared to the rest of the positions. But your tool does show that rate of indel is zero for the first few positions. "Others" also shows a rate of 0.00003. So it seems a bit confusing. I am not sure if the k-mers are first few positions can be ignored.

            here is the output:
            First 20 positions
            #BaseNum Match1 Sub1 Del1 Ins1 N1 Other1
            1 0.99319 0.00678 0.00000 0.00000 0.00000 0.00003
            2 0.99407 0.00590 0.00000 0.00000 0.00000 0.00003
            3 0.99411 0.00583 0.00000 0.00004 0.00000 0.00002
            4 0.99430 0.00561 0.00002 0.00006 0.00000 0.00002
            5 0.99452 0.00537 0.00006 0.00009 0.00001 0.00002
            6 0.99453 0.00535 0.00009 0.00010 0.00000 0.00001
            7 0.99478 0.00508 0.00009 0.00013 0.00000 0.00001
            8 0.99470 0.00514 0.00009 0.00015 0.00000 0.00001
            9 0.99441 0.00545 0.00011 0.00014 0.00000 0.00001
            10 0.99491 0.00495 0.00011 0.00014 0.00000 0.00001
            11 0.99483 0.00501 0.00009 0.00016 0.00000 0.00001
            12 0.99501 0.00482 0.00012 0.00016 0.00000 0.00001
            13 0.99486 0.00496 0.00014 0.00017 0.00000 0.00001
            14 0.99462 0.00519 0.00015 0.00018 0.00000 0.00000
            15 0.99464 0.00518 0.00015 0.00018 0.00000 0.00000
            16 0.99463 0.00520 0.00017 0.00017 0.00000 0.00000
            17 0.99471 0.00510 0.00016 0.00019 0.00000 0.00000
            18 0.99441 0.00541 0.00015 0.00018 0.00000 0.00000
            19 0.99424 0.00558 0.00017 0.00018 0.00000 0.00000
            20 0.99420 0.00562 0.00016 0.00018 0.00000 0.00000

            last 10 positions.
            91 0.99478 0.00507 0.00011 0.00014 0.00000 0.00001
            92 0.99451 0.00533 0.00009 0.00015 0.00000 0.00001
            93 0.99421 0.00564 0.00010 0.00013 0.00000 0.00001
            94 0.99382 0.00604 0.00010 0.00012 0.00000 0.00002
            95 0.99407 0.00580 0.00010 0.00012 0.00000 0.00002
            96 0.99457 0.00532 0.00006 0.00009 0.00000 0.00002
            97 0.99382 0.00609 0.00005 0.00007 0.00000 0.00002
            98 0.99395 0.00599 0.00002 0.00004 0.00000 0.00003
            99 0.99422 0.00574 0.00000 0.00000 0.00000 0.00003
            100 0.99373 0.00623 0.00000 0.00000 0.00000 0.00004
            Last edited by ty23991; 05-13-2015, 01:08 PM.

            Comment


            • #7
              The match rate is over 99% for the first few bases, so the reads do not need to be trimmed; the enriched kmers are genomic, not artifact. They're just biased; I imagine due to the exonuclease. Sonication typically yields much less bias than enzymatic shearing.

              BBMap does not allow indels in the first or last 2bp, which is why those are zero (you can't call them accurately at the tips of reads). "other" means soft-clipped where the read goes off the end of a reference sequence.

              Again, I have no idea about the spike in the middle of the read.

              Comment


              • #8
                Thanks so much for the explanation.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM
                • seqadmin
                  Techniques and Challenges in Conservation Genomics
                  by seqadmin



                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                  Avian Conservation
                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                  03-08-2024, 10:41 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Yesterday, 06:37 PM
                0 responses
                10 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, Yesterday, 06:07 PM
                0 responses
                9 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-22-2024, 10:03 AM
                0 responses
                49 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-21-2024, 07:32 AM
                0 responses
                67 views
                0 likes
                Last Post seqadmin  
                Working...
                X