Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Remove the adapter sequence by fastx_clipper in fastq file

    I have a really bad pairs of fastq files so I am tried to used fastx toolkit to do quality control.

    The main question I have is how do I remove the adapter sequence if I do not know the adapter (but I knew there are some adapter sequence). The default adapter option is 'CCTTAAGG' in fastx_clipper. If, by any chance I new my adapter, for example two different adapter, should I remove them one by one or there is a way to move them all together.

    Thanks ahead.

  • #2
    Originally posted by Jiafen View Post
    I have a really bad pairs of fastq files so I am tried to used fastx toolkit to do quality control.

    The main question I have is how do I remove the adapter sequence if I do not know the adapter (but I knew there are some adapter sequence). The default adapter option is 'CCTTAAGG' in fastx_clipper. If, by any chance I new my adapter, for example two different adapter, should I remove them one by one or there is a way to move them all together.

    Thanks ahead.
    Have you run FASTQC? This will show you some over-represented sequences. You could then use something like cutadapt which allows you to input a file containing multiple adapter sequences.

    Comment


    • #3
      Don't use fastx-clipper, its wayyyy too aggressive - use cutadapt or something else instead.

      Comment


      • #4
        Thanks, Jimmy.

        Yes, I have run FASTQC, there are so many over-reprenseted sequences, only one or two of them are noted as 'TruSeq Adapter, Index 3 (**% over **)' in the column named 'possible source'. But most of them are not 100%, should I put the whole sequence in the cutadapt or just part of them? How about the other over-represented sequence, they should not be listed in this file, right?



        Originally posted by jimmybee View Post
        Have you run FASTQC? This will show you some over-represented sequences. You could then use something like cutadapt which allows you to input a file containing multiple adapter sequences.

        Comment


        • #5
          I forget to mention, all my reads are relatively the same length, 50 or 51bp. Seems cutadapt is designed for removing the adapter only from a sequence longer than the molecule that is sequenced.

          Comment


          • #6
            Thank you for your comments.

            Would you mind giving more information about why fasts-clipper is too aggressive?
            Originally posted by FiReaNG3L View Post
            Don't use fastx-clipper, its wayyyy too aggressive - use cutadapt or something else instead.

            Comment


            • #7
              If you look at the source code, it cuts adapters if:

              - the first base match and
              - more than 5 bases match through the adapter

              With long adapters and long reads, it tend to cut at too many places and in general is not suited for many analysis scenarios. Look for other threads on seqanswer for fastx_clipper for the details.

              Comment


              • #8
                Originally posted by Jiafen View Post
                Thanks, Jimmy.

                Yes, I have run FASTQC, there are so many over-reprenseted sequences, only one or two of them are noted as 'TruSeq Adapter, Index 3 (**% over **)' in the column named 'possible source'. But most of them are not 100%, should I put the whole sequence in the cutadapt or just part of them? How about the other over-represented sequence, they should not be listed in this file, right?
                Not to dredge up old posts, but Jiafen, did you ever solve this problem?

                I am at the exact same place- I have several libraries that FastQC reported as having overrepresentation of TruSeq adapters

                My instinctive reaction is to use fastx trimmer and have it simply discard sequences that contain the adapters I have found.

                What I am worried about is keeping the files in register, because they are paired-end runs. What would you all suggest?

                Should I try and get trimmomatic installed and working?

                Thanks in advance!
                Last edited by Rzinna; 04-09-2013, 07:14 AM.

                Comment


                • #9
                  Hi Rzinna,

                  I found two ways to solve the problem.

                  The first and easier way is to download Trimmomatic-0.22 from http://www.usadellab.org/cms/index.php?page=trimmomatic, and follow the examples on the page.

                  The second and more tedious way is to use Fastx_toolkit to remove poor quality reads and then use fastqcombinepairedend_update.py from Stanford Palumbi Lab to match up the paired-ends reads after Fastx_toolkit (http://sfg.stanford.edu/scripts.html).

                  I didn't remove the adaptor directly from method 2, but quite a few adaptor left in the result from method 2.

                  Hope this helps,
                  Jiafen


                  Originally posted by Rzinna View Post
                  Not to dredge up old posts, but Jiafen, did you ever solve this problem?

                  I am at the exact same place- I have several libraries that FastQC reported as having overrepresentation of TruSeq adapters

                  My instinctive reaction is to use fastx trimmer and have it simply discard sequences that contain the adapters I have found.

                  What I am worried about is keeping the files in register, because they are paired-end runs. What would you all suggest?

                  Should I try and get trimmomatic installed and working?

                  Thanks in advance!

                  Comment


                  • #10
                    Originally posted by Jiafen View Post
                    Hi Rzinna,

                    I found two ways to solve the problem.

                    The first and easier way is to download Trimmomatic-0.22 from http://www.usadellab.org/cms/index.php?page=trimmomatic, and follow the examples on the page.

                    The second and more tedious way is to use Fastx_toolkit to remove poor quality reads and then use fastqcombinepairedend_update.py from Stanford Palumbi Lab to match up the paired-ends reads after Fastx_toolkit (http://sfg.stanford.edu/scripts.html).

                    I didn't remove the adaptor directly from method 2, but quite a few adaptor left in the result from method 2.

                    Hope this helps,
                    Jiafen

                    hi Jiafen

                    please i need some help, i want to trim and remove adapter sequence
                    please can you show me the command line, as i could not understand the Trimmomatic command line. my reads are illumina hi-seq 2000 paired end reads

                    thanks

                    Comment


                    • #11
                      Hi aforntacc

                      I also took me a while to make trimmomatic works. I don't know whether you have the same mistake as mine. At the beginning, I only downloaded resource, it is the binary we should download to make it run.

                      In the folder where trimmomatic-0.22.jar is, I run the command line on terminal. If you are not in the folder of where trimmomatic-0.22.jar is, you need the path of this file. My command line


                      java -classpath trimmomatic-0.22.jar org.usadellab.trimmomatic.TrimmomaticPE -phred33 -trimlog poole.adaptor.log fastq_afterQualityFilter/poole.1_filtered_stillpaired.fastq fastq_afterQualityFilter/poole.2_filtered_stillpaired.fastq poole.adaptor1.fastq poole.adaptor1.unpair.fastq poole.adaptor2.fastq poole.adaptor2.unpair.fastq ILLUMINACLIP:adaptor2:2:39:29

                      So -trimlog poole.adaptor.log to specify the log file. it followed by the two paired fastq file, then you specify the adaptors.

                      Hope it helps.



                      Originally posted by aforntacc View Post
                      hi Jiafen

                      please i need some help, i want to trim and remove adapter sequence
                      please can you show me the command line, as i could not understand the Trimmomatic command line. my reads are illumina hi-seq 2000 paired end reads

                      thanks

                      Comment


                      • #12
                        I'm sorry Jiafen,
                        but your explanation/command line is incorrect.

                        The correct version of a command line to use is on the trimmomatic web page.



                        Trimmomatic produces 4 output files with the adapter-trimmed sequences (in addition to the .log file), and you need to give trimmomatic names for the 4 files, in the following order:
                        1. read1-paired: read1, where both reads of the pair survive the trimming;
                        2. read1-unpaired: for reads where only read1 of the pair survives the trimming;
                        3. read2-paired: read2, where both reads of the pair survive the trimming;
                        4. read2-unpaired: read2, where only read2 of the pair survive the trimming.

                        then, in the ILLUMINACLIP part of the command, you need to specify the name of a fasta file with the adapter sequences. The current version of trimmomatic comes with files containing sequences for the TruSeq Illumina adapters.

                        Comment


                        • #13
                          You are right, Mastal. I gave the wrong explanation, my command line is correct, though. After the two original fastq file, I did list four files. Aforntacc, I am sorry for the misleading. The adaptor list is in file adaptor2 right after ILLUMINACLIP.

                          Than you, Mastal.
                          Jiafen

                          Originally posted by mastal View Post
                          I'm sorry Jiafen,
                          but your explanation/command line is incorrect.

                          The correct version of a command line to use is on the trimmomatic web page.





                          Trimmomatic produces 4 output files with the adapter-trimmed sequences (in addition to the .log file), and you need to give trimmomatic names for the 4 files, in the following order:
                          1. read1-paired: read1, where both reads of the pair survive the trimming;
                          2. read1-unpaired: for reads where only read1 of the pair survives the trimming;
                          3. read2-paired: read2, where both reads of the pair survive the trimming;
                          4. read2-unpaired: read2, where only read2 of the pair survive the trimming.

                          then, in the ILLUMINACLIP part of the command, you need to specify the name of a fasta file with the adapter sequences. The current version of trimmomatic comes with files containing sequences for the TruSeq Illumina adapters.

                          Comment


                          • #14
                            ok guys i am very grateful for all your help, i trimmed successfully but now i have another issue
                            i want to know the -r/--mate -inner-dist) to use for tophat. i ran a subset of my reads with bowtie2 and i looked at the output file (.sam) i understand that the 9th field describes the fragment length from which i can substract the lenght of the read twice. but now i am see different values in the 9th field. so please what could be the fragment length to choose. see the sam alignment line.

                            [samopen] SAM header is present: 23720 sequences.
                            HWI-ST365_0157:7:1101:1975:2074#GCGGTC 77 * 0 0 * * 0 0 CCCANCTTCACACTCAAAATTTGTTCTGTAGTGTTTGACCATACACAACTTTTGTTTCTTTTGTTACAAAAGTATTGTATAATTGGAACTAAACAAAGGC _bbeBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB YT:Z:UP

                            HWI-ST365_0157:7:1101:1975:2074#GCGGTC 141 * 0 0 * * 0 0 TCAAAACACATGCACCCACTAGCTTCCTTGGAAAAANA a__ecceeeggggfffdgggfgeghdegfghffebgBB YT:Z:UP

                            HWI-ST365_0157:7:1101:1991:2120#GCGGTC 83 gi|359484478|ref|XM_002281860.2| 952 42 100M = 888 -164 TGCTTCACGACTGAGTAGTGACGACAAACATGCACGTTGCAGAATGAAATGACTTAAATTACCCTTGTTTTTATATGATTTGGAGTTAATGTAATGGGTG cccabcaab`dbdccdeeeegeggdgfbhihiihhihhiihhhihiihiihihggfhhhdciiihiiiiiiiiiiiiiihiiiiiiigggggeeeeebbb AS:i:-6 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:73A26 YS:i:0 YT:Z:CP

                            HWI-ST365_0157:7:1101:1991:2120#GCGGTC 163 gi|359484478|ref|XM_002281860.2| 888 42 40M = 952 164 GTACTTCCTACCATATGCGATGGGCATTCTCGTCATTTTA aabeeeeegggggifhiiiiiihifgggfh[egfgfhiic AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:40YS:i:-6 YT:Z:CP

                            HWI-ST365_0157:7:1101:2000:2177#GCGGTC 77 * 0 0 * * 0 0 CGAGAATCGGCTGGGGGCTTGGGTGATGGTCACTTGTTTCCAGACGTCTTGCATTTTGTCTGCCATGGTTTTGGCAGGGATGGCTTGGAGACGCCGTTGG ^__cccc]ce[ceefhW^^bU_R^UHONIOIONNaSM\bS_ecHLZ_N\``dHVZ_Y^BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB YT:Z:UP

                            HWI-ST365_0157:7:1101:2000:2177#GCGGTC 141 * 0 0 * * 0 0 CGGATAAATCACATACAAAAGCAATGGCACAGCAATCAAGGNCNCTA a__ecccae]bg]c`dffha_dReed`R[bfH_aHPY^fgeBBBBBB YT:Z:UP

                            HWI-ST365_0157:7:1101:2242:2071#GCGGTC 83 gi|359490881|ref|XM_002277964.2| 1169 40 100M = 1075 -194 GCATCACAATCAAGCCCCAAACTGATCGATGGGTTTTCCCAGAAACCAACAGTGGCATCATTATTCTCGCTGAGGGACGACTGATGAACTTAGGANGTGC BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBc___ AS:i:-1 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:95T4 YS:i:-35 YT:Z:CP

                            HWI-ST365_0157:7:1101:2242:2071#GCGGTC 163 gi|359490881|ref|XM_002277964.2| 1075 40 100M = 1169 194 AAGCAGATGAAGAACAATGCAATTGTTTGTAACATTGGCCAGTTTGACAATGAGATTGGTATGGTTGGTTTTGAAACCTACCCTGGGGTTAAGGGCATCA a__`ccccagecgda`[`beccegdedfJ`dgf[d[I^BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB AS:i:-35 XN:i:0 XM:i:7 XO:i:0 XG:i:0 NM:i:7 MD:Z:41C16A4C5C1G14T6C6 YS:i:-1YT:Z:CP

                            HWI-ST365_0157:7:1101:2219:2076#GCGGGC 153 gi|225443436|ref|XM_002269813.1| 66 0 100M = 66 0 CGGGCCGGTCTGGACGTCCGCCCCCCCCCACGGCACCCAAAAGGAGCGCAACCAGGTCGATGAAGCTCGCCGGAGCTTCGCCGACCTCCGGTTCGAGAAG BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBa`[_WZRffaf`ac`dd^_ihhgggeaecaee_a_ AS:i:-45 XN:i:0 XM:i:9 XO:i:0 XG:i:0 NM:i:9 MD:Z:1C3A7G0A2T2A5A2T10G59 YT:Z:UP

                            HWI-ST365_0157:7:1101:2219:2076#GCGGGC 69 gi|225443436|ref|XM_002269813.1| 66 0 * = 66 0 CAGANAGACAGTTGAGAGTTGAAACTAAATTGTATAATGTGGAAGCTGAAGGTGGCCGAAGGGGACGATCCCTGGCTCAGGACGTTGAACGGCCACGGCG _bbeBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB YT:Z:UP
                            bilbo@ubuntu:/media/My Passport/Trimmomatic-0.30$
                            thanks in advance

                            Comment


                            • #15
                              There's going to be a range of values, since having all the fragments exactly the same size would be very unusual (just think about how these are actually made to realize why). I think one of the picardtool commands can output the actual fragment length average (otherwise, that's trivial to script). Keep in mind that it's probably best to over-estimate the value a bit. I also remember that tophat restimates this during alignment, so setting an exact value is probably not overly important (presumably the library was run on a bioanalyzer at some point, so just use the appropriate value from that).

                              BTW, the "BBBBBB" stretches are very low quality sequences, you should probably trim those.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              30 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              32 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              28 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              53 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X