Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • how to filter low quality reads ?

    Dear all,
    I read a paper in which they analyze small RNA seq data.
    In a first step, they filter Low quality reads.
    I know that we can use FASTQC in order to evaluate the quality of RNAseq data, but which tool can i use to remove low quality reads ? Which threshold can i use or how to fixe a threshold ?
    Thanks in advance for your reply !

  • #2
    If you're using single end reads, the fastx toolkit is pretty simple and easy to use. You can filter based on any quality score you want in a lot of different ways. Ie. the average quality score of the read needs to be equal or greater than X, or you must have at least Y bases at or above quality Z. It is pretty friendly, but it doesn't handle paired reads very well, because it will remove one side of the pair, and not flag the other side as now just a single end read if the quality score is above the cut off. I've used sickle for trimming PE reads, and highly recommend it. I believe the same group at UC Davis has a paired read filtering script too.

    Comment


    • #3
      Check out PRINSEQ, it has a stand alone and web-based version.

      Comment


      • #4
        I have some low quality PE fastq files (read 1 and read 2 separate files). Will fastx toolkit work for these samples?
        Thanks,

        Comment


        • #5
          Read up on Trimmomatic before you start using fastx toolkit on PE sequences. fastx toolkit can be used with PE but you'll need to do more work in order to keep the reads matching between R1 and R2.

          Comment


          • #6
            Hi,

            I tried Trimmomatic but could not succeed, I'm getting empty files. I used the following command:
            java -classpath trimmomatic-0.15.jar org.usadellab.trimmomatic.TrimmomaticPE read1.fastq.gz read2.fastq.gz read1_forward_paired.fastq.gz read1_forward_unpaired.fastq.gz read2_reverse_paired.fastq.gz read2_reverse_unpaired.fastq.gz LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

            I've some issues with the command I used:
            1)Input is read1 and read2, why we need unpaired output files as in this case read1_forward_unpaired.fastq.gz and read2_reverse_unpaired.fastq.gz?
            2)I didn't use illumina clipping as i don't have this!
            3) Seeing the above FastQC graphs, what quality threshold shall I give?

            Where i'm doing the mistake?
            Thanks,

            Comment


            • #7
              Originally posted by tahamasoodi View Post

              Where i'm doing the mistake?
              If I was to make a single guess I am willing to bet that you should use the most recent version (0.22) instead of the 0.15 version.

              Are there no log files or error messages you can examine?

              As for the need for unpaired output files this is so that reads that no longer have a mate (due to the MINLEN) have a place to go to.
              Last edited by westerman; 11-27-2012, 11:18 AM. Reason: Added a question mark.

              Comment


              • #8
                Now I'm at home, tomorrow I'll let you know the version but there are no error messages, just 4 empty output files after 5-8 mins.

                Further I'm using bwa for the alignment, if I give the flag -q 15 or q 20, is it not enough to keep the low quality sequences apart from aligning the reference genome?
                Thanks,

                Comment


                • #9
                  Originally posted by tahamasoodi View Post
                  Further I'm using bwa for the alignment, if I give the flag -q 15 or q 20, is it not enough to keep the low quality sequences apart from aligning the reference genome?
                  That was not your original question which was:

                  Will fastx toolkit work for these samples?
                  Yes, you could use BWA with a high -q flag. Or bowtie2 with or without the qseq quality filtering. In either case you could do further processing on the resultant BAM file in order to get rid of poor mapping reads.

                  Comment


                  • #10
                    Because of fastx toolkit failure for PE data and some error in Trimmomatic, the question has changed to bwa. Is -q 15 ok for the attached data?

                    Click image for larger version

Name:	per_base_quality.png
Views:	2
Size:	10.9 KB
ID:	304011

                    Click image for larger version

Name:	per_base_quality1.png
Views:	2
Size:	11.6 KB
ID:	304012

                    Click image for larger version

Name:	per_base_quality2.png
Views:	2
Size:	11.4 KB
ID:	304013

                    Click image for larger version

Name:	per_base_quality3.png
Views:	2
Size:	11.4 KB
ID:	304014

                    What further processing is needed for BAM files?
                    Thanks,

                    Comment


                    • #11
                      Sure. -q 15 will get rid of the bad parts of your reads. Personally I like a -q 20 cutoff but it does depend on what you want to retain. For mapping to a known reference (instead of de-novo work) a poor quality dataset can be tolerated.

                      The BAM file will have a MAPQ (mapping quality) score which can be used to get rid of reads that do not map very well either due to q-score, repeat region, etc.

                      Comment


                      • #12
                        Thanks a lot!
                        Another question, how ever it is not related to this thread but i thought it is the best time to ask. May I know how can we calculate the following:

                        Reads mapped to the human genome
                        Reads mapped to the target regions (exome)
                        Coverage of target regions at 1x
                        Coverage of target regions at 10x
                        Coverage of target regions at 50x

                        Thanks again,
                        Thanks,

                        Comment


                        • #13
                          Hi Westerman,

                          I'm using the latest version of Trimmomatic but still getting the empty output files. My syntax is:

                          root@taha-MacPro:~/Softwares/Trimmomatic-0.22# java -classpath trimmomatic-0.22.jar org.usadellab.trimmomatic.TrimmomaticPE /home/taha/Desktop/Sample_lane5/lane5_NoIndex_L005_R1_001.fastq.gz /home/taha/Desktop/Sample_lane5/lane5_NoIndex_L005_R2_001.fastq.gz /home/taha/Desktop/Sample_lane5/123_paired.fastq.gz /home/taha/Desktop/Sample_lane5/123_unpaired.fastq.gz /home/taha/Desktop/Sample_lane5/123_pairedR2.fastq.gz /home/taha/Desktop/Sample_lane5/123_unpairedR2.fastq.gz LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15

                          Final message is:
                          Input Read Pairs: 87970561 Both Surviving: 0 (0.00%) Forward Only Surviving: 0 (0.00%) Reverse Only Surviving: 0 (0.00%) Dropped: 87970561 (100.00%)
                          TrimmomaticPE: Completed successfully

                          How can I get rid of this?
                          Thanks,

                          Comment


                          • #14
                            Thanks for the complete report. I am guessing here but perhaps you need to indicate the proper quality scoring? I suspect that you are using the latest Illumina technology and thus should add '-phred33' to the command line.

                            Comment


                            • #15
                              Hi Westerman,

                              Fantantic, it is now running. I ll let you know again when it finishes.

                              Thanks,
                              Thanks,

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              31 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              32 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              28 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              53 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X