Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Trimming Illumina PE sequences with Trimmomatic

    Hi all,

    I'm a newbie to NGS work and would have a few questions that I hope someone can help me with.

    I got strand-specific PE Illumina data (100bp). The company already did a clean up (adaptor filtering..). I checked my two files with fastqc and quality wise they look good (over Q30 on average), but have a slight drop at the 3' end to about Q28 and from the other graphs I had a bit more variation for the first 5-7 bases. I just did a test-run with PrinSeq and Trimmomatic using a few 100 sequences and Trimmomatic seems to give me the nicer output (PrinSeq adds the sequence identifier to the quality information - so 1 line. sequence identifier, 2 line. actual read, 3 line. + sequence identifier again, 4 line. quality information. Trimmomatic doesn't, it only has the + in line 3 which matches with the input file.) that Trinity might like better.

    The two things I'm interested in doing is a headcrop of 7 nucleotides and I'd love to use trailing to cut for quality on the 3' end. Now according to the manual it seems to work with a quality score of 1, 2 or 3 (3 should be used) - what does that mean? I'd like to cut anything below Q30 on my 3' end. Some posts here related to other questions with Trimmomatic seem to suggest I can write 30 as well, is that true? Could I just say trailing:30 or does it have to be 3 (whatever that means)?

    Strand-specificity doesn't really matter here, does it (my data is RF directionality)? Can I still write the /1 file as my first input file and the /2 as my second input file or would I have to change that?

    I'm also struggeling with phred33/phred64. I read 33 is for Illumina version 1.8 and I also read the wiki post most seem to refer to in that regard, but the one my seq id matches to doesn't clearly say what version it belongs to. It's very difficult getting information from my sequencing company, so I hope to figure out myself what version they might have used. My sequence id is like this:
    instrument:run id:1101:1374:1950#ATCAGAA/1
    Is there a way to figure out the Illumina version based on this?

    Thank you so much for your help and apologies for the lengthy post.

    Nicole

  • #2
    "TRAILING:30" will work
    However, Q20 is almost the universally accepted acceptance threshold (99% base call accuracy...If I remember correctly). Although this probably stems from the wide use of 454 in the growing stages of NGS. Q30 (99.9%) is a good min for Illumina, but you could justify keeping any bases >Q20

    Take a look here: http://en.wikipedia.org/wiki/FASTQ_format
    Were your samples run on a HiSeq?
    If there is any confusion (your seq ids may have been edited without you knowing, for example) you can figure out the phred encoding from the characters used in your quality data

    Comment


    • #3
      Thanks Jackie.

      Yes, my samples were run on a HiSeq 2000 - the company just got back to me, supposedly it was run through the Illumina pipeline v1.5 and that the base quality values run from 2 to 41.

      I tried to figure out the ASCII codes - with v1.5 (so phred +64 if I found the right information) I'd have to start looking from 66, because 0 and 1 don't exist anymore and 2 is that weird "B", correct? And that's where I don't understand it anymore really. With v1.5 B is supposed to only happen at the end and is Q<15 without specific quality value attached - yet looking at the first 100 sequences about 90% start with a BP/ or BS/. How can that be? Do you think they told me the wrong version number?

      Thanks

      Comment


      • #4
        If it is just the 3rd line header that is the problem with Prinseq then you can use the command line flag

        -no_qual_header

        your output should then just show a + in the 3rd line. It seems this is quite a common cause of confusion in the default prinseq output.

        Comment


        • #5
          Nicole,

          The sequences are not listed randomly, and the first reads are usually low quality (i.e., lots of Bs). Check the quality scores of reads from the middle of the data set for a more accurate representation of the whole.

          Comment


          • #6
            Thanks HESmith and ddb!

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM
            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            30 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            32 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            28 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            53 views
            0 likes
            Last Post seqadmin  
            Working...
            X