Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Trimming adapter sequence in the middle of a read

    I am analyzing microRNA sequencing data (50 BP/read, single end, Illumina) and I have a sequence like this:

    TAGTAGGTTGCATAGTTTGGAATTCTCGGGTGCCAAGGAACTCCAG

    The underlined portion is standard Illumina adapter sequence. I am pretty sure the rest of the 3' end is artifact, but the standard adapter trimming tool that I was using doesn't remove adapters that occur in the middle of the read. I was hoping to get some help on this - are there any tools available that can essentially trim the adapter sequence and the junk after it? In this case, I only want to keep the TAGTAGGTTGCATAGTT. I tried blasting the read sequences and the 5' regions are indeed microRNA sequences but will not align properly because the adapters AND the 3' regions are not microRNA sequence. Any help would be greatly appreciated. Thank you very much.

  • #2
    Cutadapt (or trim_galore, which I find to be a nice wrapper) can do that.

    Comment


    • #3
      Originally posted by milesgr View Post

      I am pretty sure the rest of the 3' end is artifact,
      I have not worked with Illumina miRNA reads, but in general the Illumina adapters are close to 60 bp long. After that you do indeed get unpredictable sequences from the flow cell or other 'artifacts', but what you are seeing is probably just more of the adapter, possibly including a multiplex barcode if one was used on your samples.

      See this webpage from U Texas at Austin:

      Comment


      • #4
        Thanks for the info - it was very helpful. As a follow-up, I used the following command:
        cutadapt -e 0.05 -a TGGAATTCTCGGGTGCCAAGG 001.fastq > 001_CLIPPED.fastq

        I found the output was still retaining some adapters. For instance, one sequence left was (underlined)
        GAGACCGCCTGGGAATACCGGGTGCTGTAGGCTTTGGAATTCTCGGGTG

        This sequence is 49 bases (remember, read lengths were 50 bp), making me think that the trimmer removed the last base and missed the big picture. Another one is here (underlined), where a single base deletion (T between bolded bases) seems to have ruined the trimming procedure here:

        CCCCCCACTGCTAACTTTGACTGGCTTTGGAATTCCGGGTGCAAGGAAC

        I wanted to leave some error (0.05 error allows for one base error out of 21 total on the adapter sequence), but cutadapt seems to be missing a lot, reducing my miRNA coverage significantly. Any suggestions would be greatly appreciated. Thanks in advance.

        Comment


        • #5
          I use trimmomatic, but i find that it doesn't recognize adapters with indels either.

          if the value of -e that you are using should allow for 1 mismatch out of 21 bases, it's possible that the adapter sequence you are giving cutadapt is too short, and the score is not high enough for it to recognize adapters in your reads. Maybe you should try allowing a higher error level.

          Comment


          • #6
            Originally posted by mastal View Post
            I use trimmomatic, but i find that it doesn't recognize adapters with indels either.
            You are correct - right now, trimmomatic doesn't perform matching with INDELs, since it is relatively rare to find them in the standard illumina datasets, and trimmomatic was very much designed to meet our own requirements rather than cover all possible tasks.

            That said, we are currently evaluating what additional alignment (or other) features are needed for more special case applications, so if anyone has any suggestions, please let us know (email on the trimmomatic web page).

            Thanks,

            Tony.

            Comment


            • #7
              Hi Tony,

              I am finding the occasional 1-base insertions or deletions in the Illumina adapter sequences. In the case of the insertions, it is sort of a homopolymer effect, and the inserted base is almost always the same as the previous base (on the 5' side) in the sequence.

              By the way, I think trimmomatic is great, even if it took me a while to understand how palindrome clipping works.

              Best wishes,
              Maria

              Comment


              • #8
                try skewer

                if you put all the sequences in test.fasta as below:
                >1
                TAGTAGGTTGCATAGTTTGGAATTCTCGGGTGCCAAGGAACTCCAG
                >2
                GAGACCGCCTGGGAATACCGGGTGCTGTAGGCTTTGGAATTCTCGGGTG
                >3
                CCCCCCACTGCTAACTTTGACTGGCTTTGGAATTCCGGGTGCAAGGAAC

                and use the following command:
                $ skewer -r 0.2 -d 0.06 -x TGGAATTCTCGGGTGCCAAGG test.fasta -1 -l 16 2>/dev/null

                you may get the following output:
                >1
                TAGTAGGTTGCATAGTT
                >2
                GAGACCGCCTGGGAATACCGGGTGCTGTAGGCTT
                >3
                CCCCCCACTGCTAACTTTGACTGGCTT

                In your case, the error rate is higher than usual case, so a higher error rate (-r 0.2) and a higher indel error rate (-d 0.06) are chosen.

                BTW: indel error occurs in illumina reads, though pretty rare.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Essential Discoveries and Tools in Epitranscriptomics
                  by seqadmin




                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                  Yesterday, 07:01 AM
                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                59 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                57 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 09:21 AM
                0 responses
                48 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-04-2024, 09:00 AM
                0 responses
                55 views
                0 likes
                Last Post seqadmin  
                Working...
                X