Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Advice needed on De novo sequences Kmer content

    Good day,

    I need some advice on the Kmer content of my de novo project. I've sequenced the genome of a lovebird (parrot) species. Here are some details:

    - We sequenced the offspring at 100x coverage and its parents at 30x coverage on Illumina Hiseq 2500
    - The offspring had 3 PE libraries of 300, 550 and 750 bp, the parents 2 PE libraries of 300 and 550bp
    - The offspring had 2 LJD MP libraries of 3 and 8 kb
    - The read lengths were 125bp but after trimming by the service providers they were 30-125bp long
    - The genome has a GC content of around 43%
    - Overall the FastQC files look good and the only problem is the Kmer content

    Here is the problem... It seems that there is a Kmer bias around 42-54 bp on all 3 the samples.

    It looks if it is part of the Illumina TruSeq adapter, but it isn't given as an over represented sequence. The sequence is:
    5 GATCGGAAGAGCACACGTCTGAACTCCAGTCAC‐NNNNNN-ATCTCGTATGCCGTCTTCTGCTTG 3

    I have attached two screenshots from the Kmer contents here. Most of the FastQC reports look like this, for all 3 the birds.

    We have discussed it with the service provider, but they feel we don't have to worry at all.

    Has anybody experienced anything like this before? Can you offer some help please?

    Thank you in advance!
    Henriette
    Attached Files

  • #2
    Since your reads are variable length, they should have already been trimmed. But perhaps the trimming did not work very well, due to (for example) low quality. If the reads have adapter contamination, you can find it using BBMerge:

    bbmerge.sh in1=read1.fq in2=read2.fq ihist=ihist.txt outadapter=adapter.fa reads=4m

    The insert size histogram will also be informative (insert sizes shorter than read length indicate adapter contamination). Ideally (in this case), very few of your reads will even overlap, so they won't merge. Once this finishes, you can try trimming the sequences with BBDuk like this:

    bbduk.sh in1=read1.fq in2=read2.fq ref=adapter.fa ktrim=r k=23 mink=11 hdist=1 tbo tpe

    ...which will report the number of reads with adapter sequence. You can alternately (or additionally) use the adapter sequence file distributed with BBDuk since it has all standard Illumina adapters, but you never know what a random provider used.

    Since your target fragment lengths were, at a minimum, 300bp, there should be virtually zero adapter sequence present in 125bp reads. If there is, it indicates that your target insert sizes were probably not hit, or short fragments were not correctly removed. If you see adapter contamination in these trimmed reads, there was probably a serious problem upstream and you may need to request the sequencing to be redone, or else trim them correctly starting with the raw, untrimmed reads.

    Comment


    • #3
      Thanks for your reply Brian! I really appreciate your help.

      I forgot to add that the PE libraries are all fixed length of 125bp and it is only the LJD libraries that are variable length and trimmed by the service provider.

      The biggest problem we have is that the over represented Kmers are found in the middle of the read and it isn't the whole adapter sequence, but only a part of it. The regions before and after the over represented sequence is of good quality.

      Thanks!
      Henriette

      Comment


      • #4
        Well, just because the over-represented kmers are reported as being shorter then the adapter sequence does not mean the entire adapter is not present. I suggest you try adapter-trimming the reads using the adapter set included with BBDuk and see if that resolves the problem.

        Comment


        • #5
          Originally posted by henriettevdz View Post
          The biggest problem we have is that the over represented Kmers are found in the middle of the read and it isn't the whole adapter sequence, but only a part of it.
          Henriette
          The FastQC Kmer plot shows only the top 6 most abundant kmers. It is very likely that all kmers for the full adapters are over represented it just so happens that it is that spot in the middle is most abundant. Examine the full FastQC report (fastqc_data.txt) and you will likely be able to reconstruct all/most of the adapter from the full list of abundant kmers.

          Comment


          • #6
            BBMerge trim

            Dear Brian,

            We have trimmed the adapter sequences and I've attached the two FastQC kmer content files of the same runs. (only the one sample and from the 300 and 550 libraries). The number of sequences were reduced from around 50 000 000 to 4 000 000. Is this what we could expect from the data?

            Thanks!
            Henriette
            Attached Files

            Comment


            • #7
              To be honest, I tend to find the overrepresented kmer graphs fairly confusing and rely more on the base frequency by position. If the total number of sequences was reduced from 50 million to 4 million then you have a major problem with the raw data and it needs to be re-run (possibly with higher molecular weight input DNA). Or did you mean 40 million?

              It would be helpful to see the mapping results to an assembly, but with 4m reads, you won't get much of an assembly. So...

              First off, can you post the stderr (console) output of BBDuk?

              Second, it would be useful if you could run BBMerge on the raw input and post the console output, and attach the insert size histogram, for the 300bp and 550bp libraries, like this:

              bbmerge.sh in1=read1.fq in2=read2.fq ihist=ihist.txt reads=4m

              Also, the entire FastQC report from before trimming (in PDF, or, at least, the base composition histogram) would be useful. It seems like maybe you have a huge number of adapter-dimers, or very short inserts.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM
              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              18 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              22 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              17 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-04-2024, 09:00 AM
              0 responses
              49 views
              0 likes
              Last Post seqadmin  
              Working...
              X