Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • weird kmer content in 5' end from genomic DNA PE reads

    Hello

    My name is Gabriel. I have asked this previously in the Illumina subforum but it seems that my post belongs here.

    I'm writing because I'm analyzing Illumina reads (generated in a Hiseq 2000) from a genome of a particular insect species. The sequencing facility gave me the FASTQ files without adapters, but when checking the filtered FastQ files with the latest FastQC version (V 0.11.2) I am seeing a weird kmer pattern in the 5' region, it seems that a particular sequence is over represented, but the overrepresented sequence module does not show anything weird.

    Also, it seems that the Kmer content overrepresented has a strong bias towards GC (i.e GGCCCGG, GCCCGGG and so on). I've also managed to overlap the Kmers to this sequence CTAGTATGGCCCGGGGGATCC but so far I've not been able to find anything related to this particular sequence. I'm concerned wheter it is OK to just trim this sequence, as I don't know how which meaning has this particular pattern. This sequence is present in both paired end files, and FastQC shows the kmer content peak in the 5' end of both files.

    When searching this pattern with grep in my files I have noticed that there are several reads that seem to be duplicated, as the read sequence remains the same. I don't know if these duplicated reads should be removed or left.

    So far and during my web search, I've only seen similar Kmer patterns when analyzing RNA-seq data, but this is not the case. Also, the "bad sequence" example from FastQC webpage shows a similar pattern, but in the 3' end, not in the 5' region, as this is my scenario.

    It is worth noting that I have Paired end (2x100) files, and both files (1 and 2) have the same pattern.

    I have attached the Kmer module graphs in these links:




    I can add more information if needed.

    Thank you very much, (and sorry for my english :P)

  • #2
    What kit was used for library prep and could you post FastQC plots for per sequence GC content, sequence duplication levels and Illumina adapters.

    Comment


    • #3
      Hi nucacidhunter:

      Thanks for replying. I'll answer by quoting what you posted.

      Originally posted by nucacidhunter View Post
      What kit was used for library prep
      I sent the samples to another, external facility and I don't know which kit they used, so I'll find out ASAP.

      I asked them to sequence my library in a HiSeq 2000 Illumina machine, in paired end runs (2x100bp). As I found out when receiving my reads by the index and the adapter sequence that was sent to me later, they did multiplexing.

      Originally posted by nucacidhunter View Post
      and could you post FastQC plots for per sequence GC content, sequence duplication levels and Illumina adapters.
      They did told me the adapters used (when asked!), which would be these:

      TruSeq Universal Adapter

      5' AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT

      TruSeq Adapter, Index 5

      5’ GATCGGAAGAGCACACGTCTGAACTCCAGTCACACAGTGATCTCGTATGCCGTCTTCTGCTTG

      Attached to the post are the plots for both read files. I have uploaded the plots for forward and reverse files (2nd plot of each category would be the reverse plot).









      Finally the kmer content




      These files should let you download the full FastQC report (Ver 0.11.2) in case you want to see it




      Thank you very much,

      Gabriel
      Attached Files
      Last edited by gab0; 08-07-2014, 07:20 AM.

      Comment


      • #4
        Apart from Kmer content every parameter looks fine in FastQC report. The number of over-represented Kmers is low (although it is unusual to see in balanced genomes) and I do not think it should be of any concern. The over-represented Kmer could be from duplicate reads (there is a small bump in %total sequences in duplication plot over >10) and it can be checked by removing duplicates and running FastQC again or it could be result of bias in at least one step of library prep due to AT rich nature of genome. Whether duplicates should be removed or not, I think it depends on downstream application and I will let bioinformatician to comment on it.

        Comment


        • #5
          Originally posted by nucacidhunter View Post
          Apart from Kmer content every parameter looks fine in FastQC report. The number of over-represented Kmers is low (although it is unusual to see in balanced genomes) and I do not think it should be of any concern. The over-represented Kmer could be from duplicate reads (there is a small bump in %total sequences in duplication plot over >10) and it can be checked by removing duplicates and running FastQC again or it could be result of bias in at least one step of library prep due to AT rich nature of genome. Whether duplicates should be removed or not, I think it depends on downstream application and I will let bioinformatician to comment on it.
          Hi

          thanks for your help! So apart from the Kmer problem, the files look ok for downstream analysis.

          Well, I've found and fixed (partially) the kmer problem, so in here I'll write out how I solved this out:

          When checking the files with FastQC V0.11.2, I saw this strange kmer pattern. When checking the Kmers, I figured out that they were displaced by 1bp, so I started to assembly (just by eye) the Kmer sequence.Then, looking the Kmer pattern with grep, I found that there were some repeated sequences/reads, like this one:

          "ACTAGTATGGCCCGGGGGATCCTACGTTCCAAATGCAGCGAGCTCGTATAACCCTTTAAGAGTTGCTCTTTTTGTTTGGTAAGTTGCAAATCGAAGTTTTA"

          Looking further I found a variant of this read, like this one

          "AGTATGGCCCGGGGGATCCTACGTTCCAAATGCAGCGAGCTCGTATAACCCTTTAAGAGTTGCTCTTTTTGTTTGGTAAGTTGCAAATCGAAGTTTTAGAT"

          As you can see, the variant is displaced 3bp in the 5' and 3' ends.

          When searching the web again, I found a document from Illumina, the Illumina customer sequence letter. There I found some sequences that matched my reads, listed as: "Process Controls for TruSeq® Sample Preparation Kits Included in TruSeq DNA and RNA (v1/v2/LT/HT) and TruSeq Exome Kits"

          So it seems that these reads came in as part of the library control, and they were not filtered by the sequencing facility.

          I tested out a couple of tools for removing filtered reads. I used fastx_collapser but turns out that it produces FASTA files as output, not FASTQ files. Then I tested Fastq-mcf, which filtered the repeated reads, both correct repeated reads, and the control library reads.

          After filtering out the repeated reads, now I had some FASTQ files without kmer warnings. Yoo-hoo!

          Now I have to search for another tool to remove only the control reads, and maintaing the valid duplicates reads. I was thinking on using prinseq to remove these reads.

          Thanks for your help!

          Comment


          • #6
            Hi gab0,

            I am facing exactly the same issue of k-mer content. Hence didn't created a different thread when i encountered yours. My question to you is: what is the tool you used to retain the valid duplicate reads and remove only the control reads. Thanks in advance.

            Comment


            • #7
              Originally posted by gauravdube View Post
              Hi gab0,

              I am facing exactly the same issue of k-mer content. Hence didn't created a different thread when i encountered yours. My question to you is: what is the tool you used to retain the valid duplicate reads and remove only the control reads. Thanks in advance.
              Hi gauravdube:

              I found and used tools from the BBMap package. Brian helped me out guiding me hot to use the bbduk tool.

              I used the following command line: bbduk.sh -Xmx4g -in=(file).fastq.gz -in2=(file).fastq.gz ref=adapters.fa -out=out1.fastq -out2=out2.fastq

              Adapters file has all the adapters that I could find for Illumina platforms, including the control sequences from the libraries, in fasta format. That worked for me, hopefully will work for you too!

              Best regards,

              Gabriel

              Comment


              • #8
                Originally posted by gab0 View Post
                Hi gauravdube:

                I found and used tools from the BBMap package. Brian helped me out guiding me hot to use the bbduk tool.

                I used the following command line: bbduk.sh -Xmx4g -in=(file).fastq.gz -in2=(file).fastq.gz ref=adapters.fa -out=out1.fastq -out2=out2.fastq

                Adapters file has all the adapters that I could find for Illumina platforms, including the control sequences from the libraries, in fasta format. That worked for me, hopefully will work for you too!

                Best regards,

                Gabriel
                Dear Gabriel,

                very interesting post. I would like to know if you have a list of the Illumina adapters and the control sequences as well, to use as adapters.fa file. I cannot find them anywhere.

                Thanks a lot,
                nike00

                Comment


                • #9
                  It looks like Nextera bias to me.

                  Comment


                  • #10
                    Originally posted by nike00 View Post
                    Dear Gabriel,

                    very interesting post. I would like to know if you have a list of the Illumina adapters and the control sequences as well, to use as adapters.fa file. I cannot find them anywhere.

                    Thanks a lot,
                    nike00
                    If you download the BBMap package, the adapters are in the resources directory - nextera.fa.gz, truseq.fa.gz, and truseq_rna.fa.gz. You can use all of them with the flag "ref=nextera.fa.gz,truseq.fa.gz,truseq_rna.fa.gz" (with the appropriate paths).

                    Comment


                    • #11
                      Hi Gabriel,

                      Thank you so much. It worked for me.

                      Originally posted by gab0 View Post
                      Hi gauravdube:

                      I found and used tools from the BBMap package. Brian helped me out guiding me hot to use the bbduk tool.

                      I used the following command line: bbduk.sh -Xmx4g -in=(file).fastq.gz -in2=(file).fastq.gz ref=adapters.fa -out=out1.fastq -out2=out2.fastq

                      Adapters file has all the adapters that I could find for Illumina platforms, including the control sequences from the libraries, in fasta format. That worked for me, hopefully will work for you too!

                      Best regards,

                      Gabriel

                      Comment


                      • #12
                        Originally posted by Brian Bushnell View Post
                        If you download the BBMap package, the adapters are in the resources directory - nextera.fa.gz, truseq.fa.gz, and truseq_rna.fa.gz. You can use all of them with the flag "ref=nextera.fa.gz,truseq.fa.gz,truseq_rna.fa.gz" (with the appropriate paths).
                        Thank you very much!

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Techniques and Challenges in Conservation Genomics
                          by seqadmin



                          The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                          Avian Conservation
                          Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                          03-08-2024, 10:41 AM
                        • seqadmin
                          The Impact of AI in Genomic Medicine
                          by seqadmin



                          Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                          02-26-2024, 02:07 PM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 03-14-2024, 06:13 AM
                        0 responses
                        32 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-08-2024, 08:03 AM
                        0 responses
                        71 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-07-2024, 08:13 AM
                        0 responses
                        80 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-06-2024, 09:51 AM
                        0 responses
                        68 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X