Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • fastq format

    Hi Everyone,
    I have a metagenomics(16s) sequence reads and they are about 200 million in number. I want to filter out some reads by quality score and I am not able to find out how to get it. My reads look somewhat like this.

    @HWI-ST1035:115:C0RG7ACXX:5:1101:1481:2050 1:N:0:
    TACGGAGGGTCCGAGCGTTATCCGGAATTATTGGGTTTAAAGGGTCCGCAGGCGGGCAATTGAGTCAGGGGTGAAATGGTGCGGCTCAACCGTAGCACTGCCCTTGATACTGGTTGTCTTGAGTCATTGTGAAGTGGCCGGAATATGTAGG
    +
    B@CFFFFFHCCFHIJJJIJIIJJJJFHIJJJJJJJ75CGIIJII6=CHFFDDDDD557(9+>>A3>9?B9;?CCACD>>C>BBDDDCDAB?(<(+:@CDD849A:CCC>CC44<5@CAC:>:4:>(++:3(:+:>?<<B59?@DCE###
    @HWI-ST1035:115:C0RG7ACXX:5:1101:1363:2055 1:N:0:
    TACAGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTCTGTAGGTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCNNNNNNNNNNNNNGGN
    +
    @@@DFFFF?DFH<FHIIGIIIIIIIHEHGIIIIIIIIFFHHIII<CFH@EHI###################################################################################################
    @HWI-ST1035:115:C0RG7ACXX:5:1101:1497:2056 1:N:0:
    TACGTAGGGTGCAAGCGTTGTCCGGAATTACTGGGCGTAAAGAGCTCGTAGGCGGTTTGTCGCGTCGGCTGTGAAAACCAGCAGCTCAACTGTTGGCTTGCAGGCGATACGGGAAGACTTGAGTATTTCAGGGGGGACTGGAATTCCTGGG
    +
    ?@@D4ADDFA<DDFGIIGI6GHIIIIIIIICGIIIDFFFIII<FCCFF;EIFFFC>8;=@:8(85>7&08;?ADDB(9<2<?<@BBBBBB:A::AB:>?BBAA<.959>B<BB5&8+9ABBBB@###########################


    Please can someone help me to calculate the quality score of each read. I have searched a lot but could not find any information. This is my first time to deal with such data.
    Thanks for any help!!!!

  • #2
    You can do this in more than one way. I will give you a couple below.

    If you are comfortable on command line in a UNIX environment then try fastx-toolkit from Hannon lab: http://hannonlab.cshl.edu/fastx_toolkit/

    If you would rather do this via a GUI interface then try Galaxy. https://main.g2.bx.psu.edu/ They have a tutorial for metagenomic example available here: https://main.g2.bx.psu.edu/u/james/p...e-metagenomics This is for 454-data but you would get an idea of how to use the tools.

    There are also several video screencasts available for individual tools: http://wiki.g2.bx.psu.edu/Learn/Screencasts
    Last edited by GenoMax; 09-21-2012, 08:26 AM.

    Comment


    • #3
      Thanks GenoMax for the help.
      I tried fastx-toolkit but I am getting error. I used the command
      fastx_quality_stats -i trial.fastq -o abc.tx
      and the error I am getting is
      Invalid quality score value (char '#' ord 35 quality value -29) on line 4


      My input file looks like this
      @HWI-ST1035:115:C0RG7ACXX:5:1101:1216:2040 1:N:0:
      NACAGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTCTGNNNGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
      +
      #1=DDDFFCADHHIIIIIIIIGIIIIIIIGIIIIIIIFHIIIIICFHH#######################################################################################################
      @HWI-ST1035:115:C0RG7ACXX:5:1101:1208:2076 1:N:0:
      TACAGAGGTCTCAAGCGTTGTTCGGAATCACTGGGCGTAAAGCGTGCGNNNGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGNN
      +
      @@CFFFFF2ACFDHIIIFHGIHHIIIIIIGHIIIIHIDHIHIII;FGI#######################################################################################################
      @HWI-ST1035:115:C0RG7ACXX:5:1101:1168:2119 1:N:0:
      TACGTAGGGTGCGAGCGTTGTCCGGAATTACTGGGCGTAAAGGGCTCGNNNGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTNN
      +
      @@@FADDFH<CFHBIGHEHHIIIJJ<@GIHHIIJJIJ;FGIHJJHHGF#######################################################################################################
      @HWI-ST1035:115:C0RG7ACXX:5:1101:1173:2185 1:N:0:
      TACGTAGGGGGCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGNNNGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGNN
      +
      ?@@D=BDD+@DFAFIIIEFIACGFIFGICIBFFIIEI;FFIIEFFBCB#######################################################################################################
      @HWI-ST1035:115:C0RG7ACXX:5:1101:1155:2196 1:N:0:
      TACGTAGGGGGCAAGCGTTAATCGGAATTACTGGNCGNNNNNNNNGCGNNNGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCNN
      +
      ?@@DDDDD=)@DFBFGICGFE?BGIFBFIFEFII#####################################################################################################################
      @HWI-ST1035:115:C0RG7ACXX:5:1101:1183:2201 1:N:0:
      TACGGAGGGTGCGAGCGTTAATCGGAATAACTGGGCGTAAAGGGCACGNNNGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGNN
      +
      @@@DDDDDF2AF1EGIIGIIIIGIIIII<FFFIIIIIAEFIFFFF?DD#######################################################################################################

      Please help, I have been trying this from past many days...

      Comment


      • #4
        See this thread for the suggestion of a new tool for trimming by kmcarr. http://seqanswers.com/forums/showthr...tx+toolkit+Q33

        Also mentioned in the thread is the option -Q 33 (to tell fastx toolkit that your reads are in sanger fastq quality format).

        You command would become: fastx_quality_stats -Q 33 -i trial.fastq -o abc.txt
        Last edited by GenoMax; 09-27-2012, 08:34 AM.

        Comment


        • #5
          Thanks GeomeMax,
          I used the command
          fastx_quality_stats -Q 33 -i lane5_NoIndex_L005_R1_001.fastq -o quality_score.txt .
          its been running for more than an hour now without any error, but the problem is that the output file which I am creating, quality_score.txt is still empty. Is there a problem in this.
          Please help!!!!

          Comment


          • #6
            Hi,
            Finally I got the result in my output file, but it is showing only 151 rows. I am confused, I thought this would give me quality score of each read, but what I am getting is only 151 results(my read length = 151). This means it is giving me the quality score of each base position in all the reads. I want to know how can I use this result to filter out my reads of poor quality score.
            I am too confused.

            Comment


            • #7
              fastx_quality_stats yields per-base metrics (i.e., averaged across reads) rather than what you want. There may be pre-written scripts to do what you want (it'd be easy to write), but I don't know of any. One alternative might just be to quality trim and discard reads below some length (12bp or whatever your aligner says is the minimum). At the end of the day, that's what you're really interested in anyway since you have a bunch of Ns due to short transcripts.

              Comment


              • #8
                You probably do not want to focus on "average" score across an entire read but rather look at individual base quality scores. As dpryan pointed out, you seem to have a number of "N's" (perhaps the snippet you posted is from the beginning of the file) so you would want to trim those bases out before trying alignments.

                You can do the trimming using the "trimmomatic" (http://www.usadellab.org/cms/index.php?page=trimmomatic) tool that was suggested by kmcarr before which can take quality into account.

                Originally posted by newBioinfo View Post
                Hi,
                Finally I got the result in my output file, but it is showing only 151 rows. I am confused, I thought this would give me quality score of each read, but what I am getting is only 151 results(my read length = 151). This means it is giving me the quality score of each base position in all the reads. I want to know how can I use this result to filter out my reads of poor quality score.
                I am too confused.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM
                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                18 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                22 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 09:21 AM
                0 responses
                17 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-04-2024, 09:00 AM
                0 responses
                49 views
                0 likes
                Last Post seqadmin  
                Working...
                X