Help getting fastq_filter.py working on command line?

hlyates

Member

Join Date: Mar 2015
Posts: 29

Help getting fastq_filter.py working on command line?

03-27-2015, 06:11 AM

I want to use galaxy's fastq_filter tool on the command line.

Basically, I already know what the inputs are required by fastq_filter.py, but not sure how to generate two of them.

After you read the python and xml file, you learn that it is expecting us to run a line something like this:

Code:

fastq_filter.py $input_file $fastq_filter_file $output_file $output_file.files_path '${input_file.extension[len( 'fastq' ):]}'

$input_file
$fastq_filter_file I don't know how to make this
$output_file
$output_file.files_path I don't know what this is or how to avoid it
${input_file.extension[len( 'fastq' ):]} Seems to be type check input file type ? Not going to worry about this for now

The fastq_filter.ply is interesting. In it it has something like

Code:

def fastq_read_pass_filter( fastq_read ):
     def mean( score_list ):
         return float( sum( score_list ) ) / float( len( score_list ) )
     if len( fastq_read ) &lt; $min_size:
         return False
     if $max_size &gt; 0 and len( fastq_read ) &gt; $max_size:
         return False
     num_deviates = $max_num_deviants
     qual_scores = fastq_read.get_decimal_quality_scores()
     for qual_score in qual_scores:
         if qual_score &lt; $min_quality or ( $max_quality &gt; 0 and qual_score &gt; $max_quality ):
             if num_deviates == 0:
                 return False
             else:
                 num_deviates -= 1
 #if not $paired_end:
     qual_scores_split = [ qual_scores ]
 #else:
     qual_scores_split = [ qual_scores[ 0:int( len( qual_scores ) / 2 ) ], qual_scores[ int( len( qual_scores ) / 2 ): ] ]
 #end if
 #for $fastq_filter in $fastq_filters:
     for split_scores in qual_scores_split:
         left_column_offset = $fastq_filter[ 'offset_type' ][ 'left_column_offset' ]
         right_column_offset = $fastq_filter[ 'offset_type' ][ 'right_column_offset' ]
 #if $fastq_filter[ 'offset_type' ]['base_offset_type'] == 'offsets_percent':
         left_column_offset = int( round( float( left_column_offset ) / 100.0 * float( len( split_scores ) ) ) )
         right_column_offset = int( round( float( right_column_offset ) / 100.0 * float( len( split_scores ) ) ) )
 #end if
         if right_column_offset > 0:
             split_scores = split_scores[ left_column_offset:-right_column_offset]
         else:
             split_scores = split_scores[ left_column_offset:]
         if split_scores: ##if a read doesn't have enough columns, it passes by default
             if not ( ${fastq_filter[ 'score_operation' ]}( split_scores ) $fastq_filter[ 'score_comparison' ] $fastq_filter[ 'score' ]  ):
                 return False
 #end for
     return True

Is that python? Is this how the xml turns user input into a filter script? I had someone suggest I use the galaxy api for this, but that might be just as much work to get set up as getting this script to run? I'm not opposed to it, but I want to the easy way out because this is the last galaxy tool I have to run in my analysis I think before I move on to other things.

Any help and assistance would be appreciated.

Last edited by hlyates; 03-27-2015, 06:12 AM. Reason: Added tags

Tags: command line, command line tool, fastq, galaxy, scripts

maubp

Peter (Biopython etc)

Join Date: Jul 2009

Posts: 1543
- Share
- Tweet
#2

03-29-2015, 09:18 AM

The development repository is here:

File not found · galaxyproject/tools-devteam

https://github.com/galaxyproject/tools-devteam/tree/master/tool_collections/galaxy_sequence_utils/fastq_filter

Contains a set of Galaxy Tools mostly written by the Galaxy Team. - File not found · galaxyproject/tools-devteam

Correction: The code you quoted is from the <configfile> XML snippet, it is a Python-like templating language called Cheetah.

Last edited by maubp; 03-29-2015, 09:20 AM. Reason: correction
Comment

Previous template Next

Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM
Strategies for Sequencing Challenging Samples

by seqadmin

Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
- Channel: Articles
03-22-2024, 06:39 AM

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 25 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 29 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 25 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Help getting fastq_filter.py working on command line?

Comment

Latest Articles

ad_right_rmr

News