Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • FASTQ manipulation in Galaxy

    The Galaxy team wanted to take a moment and highlight several of the FASTQ manipulation tools that are currently available in Galaxy (http://usegalaxy.org). Galaxy provides a Free & Open Environment for NGS analysis (previously announced at: http://seqanswers.com/forums/showthread.php?t=4441).

    As always, we encourage feature requests, comments/suggestions and bug reports ([email protected]). Additional information about the toolset, including a set of screencasts, can be viewed at: http://main.g2.bx.psu.edu/u/dan/p/fastq.

    All of the following tools, unless mentioned otherwise, are found under the NGS: QC and manipulation tool section within Galaxy:

    1. Make FASTQ from FASTA and Quality Score files
    Some sequencing technologies will produce separate files containing sequences and quality scores (e.g. 454). These two separate files can be merged together to create a single FASTQ file. Specifying a quality score file is optional and, when not specified, quality score values will be filled with the maximal allowed quality value.

    2. FASTQ Groomer
    The FASTQ Groomer tool is used to verify and convert between the known FASTQ variants. The data created by this tool is guaranteed to conform to the target variant specified by the user, including the enforcement of quality score minimums and maximums. After grooming, the user is presented with some information about the input such as ASCII character and decimal value ranges and a list of FASTQ variants for which the input data is actually valid. Although the output created by this tool is now completely valid, if the user has selected the wrong presumed input variant, it is possible for the resultant score values to not reflect the values intended by the sequencing technology. Users should utilize the provided summary information as a sanity check before continuing with their analysis.

    3. Quality Statistics
    As quality scores can vary along the length of sequencing reads, determining how to trim and filter read data involves calculating summary statistics on a per column basis. The FASTQ Summary Statistics by column tool accomplishes this task. The output of this tool contains read counts, minimums, maximums, sums, means, quartiles with ranges, outliers and nucleotide counts for each base column in a FASTQ file. This statistical summary can be graphed by using the Boxplot tool, found under the Graph/Display Data tool section.

    4. Read Trimmer
    To prevent otherwise high quality reads from being rejected during quality filtering or from influencing the mapping or assembly process, it can be beneficial to trim bases from poor quality ends of reads. The FASTQ Trimmer by column tool allows trimming either end of a set of reads by using absolute offsets or by specifying percentage of read length based offsets. Offsets begin at 0 for each end and increase towards the opposing end of the read. For example, to trim the outer 3 bases from each end of a 36 length sequencing read, a user can specify absolute 5’ and 3’ offsets of 3 or percentage-based offsets of 8.33 (0.0833 * 36 = 2.9988, rounded to the nearest integer = 3).

    5. Quality Filter
    The Filter FASTQ reads by quality score and length tool allows filtering by minimum and maximum read lengths and by minimum and maximum quality score values over the entire read while allowing a configurable number of deviant bases. Complex filters can also be constructed that allow the user to set offsets, just like with the trimmer tool, to use as bounds for performing a selected aggregation action that is compared to a user specified value. Any number of complex filters can be designed and applied to a set of sequencing reads. For example, to only include reads which have no quality score values less than 28 in the first half of a read, a user can use percentage-based offsets of 0 and 50, select the min score aggregation and the greater-than-or-equal to operator (>=) and set a quality score threshold of 28.

    6. FASTQ Manipulation
    Highly configurable complex manipulations can be performed on selected FASTQ reads by using the Manipulate FASTQ reads on various attributes tool. This tool allows the user to define a set of matching criteria to be used to select the reads in a FASTQ file on which to perform a set of manipulations; any number of match directives can be defined and a read must match each directive to be considered for manipulation. Matching is currently limited to user specified regular expressions on sequence identifier/name, sequence content and quality score strings, with defaults set to match all (.*); however, additional matching and manipulation options can be easily implemented as needed. When a read does not match, it will be transferred to the output in an unmodified fashion. Reads which pass all matching criteria are subjected to any number of user specified manipulations. Manipulations are available which act upon sequence identifier/name, sequence content or quality score strings. Beyond allowing the user to remove matching reads or to perform string translations on any of these attributes, additional manipulations are available for sequence content, including: reverse complementing, reversing (without complementing), complementing (without reversing), trimming, in silico transcription of DNA to RNA and vice-versa, as well as changing the adapter base within color space sequences. Additionally, separate tools exist which can convert FASTQ files to-and-from a tabular format; this allows FASTQ data to be modified using any of the powerful text manipulation tools which are prepackaged with Galaxy.

    7. Paired-End Read Splitting and Joining
    FASTQ formatted paired-end sequencing data can come in two common forms, one which utilizes a separate file for each paired-end component or another where a single FASTQ file is used and the two paired-end reads ends have been concatenated together to form a single entry. Two tools exist to facilitate the use of this data: FASTQ Joiner on paired end reads and FASTQ Splitter on joined paired end reads. The Joiner tool takes two separate FASTQ files that contain paired end reads and creates a single file. The Splitter tool does the opposite of the Joiner tool and takes a single FASTQ file and splits each read in half, creating two separate FASTQ files. When splitting, an identifier suffix is added to each paired end; when joining, these differences in identifiers are taken into account.

  • #2
    Does "Paired-End Read Splitting and Joining" work on trimmed reads which may be of differing lengths?

    Comment


    • #3
      Is there (could there be) a tool for scrubbing low-complexity or otherwise poor/low-information content sequence?

      Specifically, something that would ditch single end reads or both ends of a paired-end read if either end meets the following hypothetical criteria?

      1. >= X% of the read is a single nucleotide (80% of the read is As)
      2. More the X% of the read is Ns.
      3. Low complexity (ATATATATA...)

      Such reads slow down alignments and in many cases, are irrelevant to downstream analyses. Many aligners filter them inherently, but others don't.

      It seems that a Galaxy and command-line analog would benefit folks.

      My devalued 2 cents.
      Aaron

      Comment


      • #4
        Originally posted by maubp View Post
        Does "Paired-End Read Splitting and Joining" work on trimmed reads which may be of differing lengths?
        No, currently, Splitting and Joining should only be performed on paired-end reads having equal lengths.

        Would it be useful to add an option to the Trimming tool to allow it to work directly on Joined paired-end reads? This could cause the joined reads to be split in half, where each half is trimmed according to the user specification and then the two trimmed halves are rejoined (similar to an option already available in the filter tool).

        Comment


        • #5
          Originally posted by quinlana View Post
          Is there (could there be) a tool for scrubbing low-complexity or otherwise poor/low-information content sequence?

          Specifically, something that would ditch single end reads or both ends of a paired-end read if either end meets the following hypothetical criteria?

          1. >= X% of the read is a single nucleotide (80% of the read is As)
          2. More the X% of the read is Ns.
          3. Low complexity (ATATATATA...)

          Such reads slow down alignments and in many cases, are irrelevant to downstream analyses. Many aligners filter them inherently, but others don't.

          It seems that a Galaxy and command-line analog would benefit folks.

          My devalued 2 cents.
          Aaron
          I like the sound of this and I think it would be a good fit for the Manipulate FASTQ reads on various attributes tool. Match by attribute (e.g. 1-3) with the action of Remove. Let me think about a good way to do this (it can likely already be done by constructing a sufficiently complex regular expression).

          For now, if you are interested, much of this can be done using the the various Text Manipulation and filter tools: first convert the FASTQ data to tabular (this tool is not yet available on the main server, but is on our test server, and will be on the main server after it is next updated), then use the "Compute an expression on every row" tool to compute the desired value (e.g. float( c2.count('A') ) / float( len( c2 ) ) ), then use the 'Filter data on any column using simple expressions' found under the 'Filter and Sort' tool menu to filter on the new column (c4), use the Tabular to FASTQ converter to convert back to fastq and then Groom your filtered data. These steps could then be built into a workflow, so you wouldn't have to do each step manually each time. -- This approach is less-than-ideal and I'll look into implementing the ability to do this directly on FASTQ files.

          Comment


          • #6
            Originally posted by blankenberg View Post
            No, currently, Splitting and Joining should only be performed on paired-end reads having equal lengths.

            Would it be useful to add an option to the Trimming tool to allow it to work directly on Joined paired-end reads? This could cause the joined reads to be split in half, where each half is trimmed according to the user specification and then the two trimmed halves are rejoined (similar to an option already available in the filter tool).
            As long as the documentation on splitting and joining is clear it should be fine. Personally I don't like the join/split approach since it requires all the reads to have the same lengths.

            I'm interested from the point of view of doing trimming and filtering on paired end data. Trimming means the reads will be of different lengths. Filtering may mean that half a pair is lost, making the remaining read effectively a single end read.

            This is further complicated by the fact you can have the forward/reverse pairs interleaved in a single FASTQ file, or in two separate files.

            Comment


            • #7
              Trouble with FASTQ summary statistics

              I am learning to use Galaxy to analysis RNA seq. I have trouble when I used FASTQ Summary Statistics to check the data. I received the message like this:
              6: FASTQ Summary Statistics on data 3
              0 bytes
              An error occurred running this job: Traceback (most recent call last):
              File "/galaxy/home/g2main/galaxy_main/tools/fastq/fastq_stats.py", line 48, in <module>
              if __name__ == "__main__": main()
              File "/galaxy/home/g2main/galaxy_main/tools/fastq/fastq_stats.py", line 17, in main
              fo


              Please help me to understand and solve it!
              Thanks in advance,
              VH
              Last edited by vinhha; 11-21-2012, 01:04 AM.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM
              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              26 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              29 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              25 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-04-2024, 09:00 AM
              0 responses
              52 views
              0 likes
              Last Post seqadmin  
              Working...
              X