Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • vinay427
    Member
    • Jul 2012
    • 10

    Useful bioinformatics tool ideas?

    Hi,

    I am working at a lab for the next several weeks, and as I have strong experience in the computer side of things, I wanted to develop a tool that would be useful. My initial idea is to compile various tools such as Trimmomatic, tophat, bowtie2, cufflinks, samtools, prinseq-lite, and prinseq-graph to make a sort of "one-stop shop" script for those assembling and visually representing a paired end FastQ file. I had issues learning about how these tools fit together, and figured that this script might help do so. I've attached it below, but it's very dysfunctional as it's not nearly finished. However, it gives you an idea of what it does. I'm not quite sure if this niche is already covered by another (GPL) tool? Do any of you have an idea for something that is greatly needed and would hopefully prove very helpful in this community?

    Code:
    #!/bin/bash
    
    ############################
    # TITLE
    clear
    echo "FastQ Paired/Single End Sequence Analyzer"
    echo "Licensed under GPL"
    echo
    
    # GET MAIN INFORMATION
    echo -n "Enter path of first FastQ file (without extension): "
    read -e FASTQ1
    echo -n "Enter path of second FastQ file (without extension): "
    read -e FASTQ2
    echo
    ../$FASTQ1=FASTQ_DIR
    echo -n "Enter path of desired paired read 1 file: "
    read -e P_R1
    echo -n "Enter path of desired UNpaired read 1 file: "
    read -e U_R1
    echo -n "Enter path of desired paired read 2 file: "
    read -e P_R2
    echo -n "Enter path of desired UNpaired read 2 file: "
    read -e U_R2
    echo -n "Enter path of desired trim log: "
    read -e TRIM_LOG
    echo -n "Enter number of threads to be used: "
    read -e THREADS
    
    # QUALITY MENU
    PS3='Pick a quality scale: '
    phred_options=("phred33" "phred64")
    select PHRED_CHOICE in "${phred_options[@]}"
    do
        case $PHRED_CHOICE in
        	"phred33")
    			PHRED=phred33
    			break
    			;;
    		"phred64")
    			PHRED=phred64
    			break
    			;;
    	esac
    done
    
    #################################
    # MAIN ACTION MENU FROM TRIMMOMATIC
    PS3='Pick a step: '
    select TRIM_CHOICE in ILLUMINACLIP SLIDINGWINDOW LEADING TRAILING CROP HEADCROP MINLENGTH TOPHRED33 TOPHRED64 SKIP/CONTINUE
    do
    	case "$TRIM_CHOICE" in
    		"ILLUMINACLIP")
    			echo "Starting"
    			echo -n "If desired, enter manual flags to be used: "
    			read -e TRIM_FLAGS
    			java -classpath trimmomatic.jar org.usadellab.trimmomatic.TrimmomaticPE $TRIM_FLAGS -threads $THREADS $PHRED -trimlog $TRIM_LOG $FASTQ_DIR/$FASTQ1.fastq $FASTQ_DIR/$FASTQ2.fastq $P_R1 $U_R1 $P_R2 $U_R2 ILLUMINACLIP
    			;;
    		"SLIDINGWINDOW")
    			echo -n "Specify window size (number of bases to average across): "
    			read -e SLIDINGWINDOW_WINDOW
    			echo -n "Specify required quality: "
    			read -e SLIDINGWINDOW_QUALITY
    			echo -n "If desired, enter manual flags to be used: "
    			read -e TRIM_FLAGS
    			echo "Starting..."
    			java -classpath trimmomatic.jar org.usadellab.trimmomatic.TrimmomaticPE $TRIM_FLAGS -threads $THREADS -$PHRED -trimlog $TRIM_LOG $FASTQ_DIR/$FASTQ1.fastq $FASTQ_DIR/$FASTQ2.fastq $P_R1 $U_R1 $P_R2 $U_R2 SLIDINGWINDOW:$SLIDINGWINDOW_WINDOW:$SLIDINGWINDOW_QUALITY
    			;;
    		"LEADING")
    			echo -n "Specify min quality: "
    			read -e LEADING_VALUE
    			echo -n "If desired, enter manual flags to be used: "
    			read -e TRIM_FLAGS
    			echo "Starting..."
    			java -classpath trimmomatic.jar org.usadellab.trimmomatic.TrimmomaticPE $TRIM_FLAGS -threads $THREADS -$PHRED -trimlog $TRIM_LOG $FASTQ_DIR/$FASTQ1.fastq $FASTQ_DIR/$FASTQ2.fastq $P_R1 $U_R1 $P_R2 $U_R2 LEADING:$LEADING_VALUE
    			;;
    		"TRAILING")
    			echo -n "Specify min quality"
    			read -e TRAILING_VALUE
    			echo -n "If desired, enter manual flags to be used: "
    			read -e TRIM_FLAGS
    			echo "Starting"
    			java -classpath trimmomatic.jar org.usadellab.trimmomatic.TrimmomaticPE $TRIM_FLAGS -threads $THREADS -$PHRED -trimlog $TRIM_LOG $FASTQ_DIR/$FASTQ1.fastq $FASTQ_DIR/$FASTQ2.fastq $P_R1 $U_R1 $P_R2 $U_R2 TRAILING:$TRAILING_VALUE
    			;;
    		"CROP")
    			echo -n "Specify number of bases to keep (from start of read): "
    			read -e CROP_VALUE
    			echo -n "If desired, enter manual flags to be used: "
    			read -e TRIM_FLAGS
    			echo "Starting..."
    			java -classpath trimmomatic.jar org.usadellab.trimmomatic.TrimmomaticPE $TRIM_FLAGS -threads $THREADS -$PHRED -trimlog $TRIM_LOG $FASTQ_DIR/$FASTQ1.fastq $FASTQ_DIR/$FASTQ2.fastq $P_R1 $U_R1 $P_R2 $U_R2 CROP:$CROP_VALUE
    			;;
    		"HEADCROP")
    			echo -n "Specify number of bases to remove (from start of read): "
    			read -e HEADCROP_VALUE
    			echo -n "If desired, enter manual flags to be used: "
    			read -e TRIM_FLAGS
    			echo "Starting..."
    			java -classpath trimmomatic.jar org.usadellab.trimmomatic.TrimmomaticPE $TRIM_FLAGS -threads $THREADS -$PHRED -trimlog $TRIM_LOG $FASTQ_DIR/$FASTQ1.fastq $FASTQ_DIR/$FASTQ2.fastq $P_R1 $U_R1 $P_R2 $U_R2 HEADCROP:$HEADCROP_VALUE
    			;;
    		"MINLENGTH")
    			echo -n "Specify min length of reads to be kept: "
    			read -e MINLENGTH_VALUE
    			echo -n "If desired, enter manual flags to be used: "
    			read -e TRIM_FLAGS
    			echo "Starting..."
    			java -classpath trimmomatic.jar org.usadellab.trimmomatic.TrimmomaticPE $TRIM_FLAGS -threads $THREADS -$PHRED -trimlog $TRIM_LOG $FASTQ_DIR/$FASTQ1.fastq $FASTQ_DIR/$FASTQ2.fastq $P_R1 $U_R1 $P_R2 $U_R2 MINLENGTH:$MINLENGTH_VALUE
    			;;
    		"TOPHRED33")
    			echo "Starting..."
    			echo -n "If desired, enter manual flags to be used: "
    			read -e TRIM_FLAGS
    			java -classpath trimmomatic.jar org.usadellab.trimmomatic.TrimmomaticPE $TRIM_FLAGS -threads $THREADS -$PHRED -trimlog $TRIM_LOG $FASTQ_DIR/$FASTQ1.fastq $FASTQ_DIR/$FASTQ2.fastq $P_R1 $U_R1 $P_R2 $U_R2 TOPHRED33
    			;;
    		"TOPHRED64")
    			echo "Starting..."
    			echo -n "If desired, enter manual flags to be used: "
    			read -e TRIM_FLAGS
    			java -classpath trimmomatic.jar org.usadellab.trimmomatic.TrimmomaticPE $TRIM_FLAGS -threads $THREADS -$PHRED -trimlog $TRIM_LOG $FASTQ_DIR/$FASTQ1.fastq $FASTQ_DIR/$FASTQ2.fastq $P_R1 $U_R1 $P_R2 $U_R2 TOPHRED64
    			;;
            "GENERATE VISUAL REPORT")
                echo "Visual report will now be generated..."
                PS3='Pick a tool: '
                    phred_options=("prinseq-lite, prinseq-graphs" "fastqc")
                    select VISUAL_CHOICE in "${visual_options[@]}"
                    do
                    case $VISUAL_CHOICE in
    		            "prinseq-lite, prinseq-graphs")
                            echo "Generating graph file..."
                            prinseq-lite -fastq $FASTQ_DIR/$FASTQ1.fastq -out_format 4 -graph_data fastq1.gd
                            prinseq-lite -fastq $FASTQ_DIR/$FASTQ2.fastq -out_format 4 -graph_data fastq2.gd
                            echo "Generating html report..."
                            mkdir read1_report
                            mkdir read2_report
                            cd read1_report
                            prinseq-graphs ../fastq1.gd -html_all
                            cd ../read2_report
                            prinseq-graphs ../fastq2.gd -html_all
    	            		break
    		            	;;
                        "fastqc"
                            fastqc $P_R1 $P_R2
                            echo -n "Type command for web browser to view html results in: "
                            read -e BROWSER
                            $BROWSER $FASTQ_DIR/$FASTQ1"_fastqc"/fastqc_report.html $FASTQ_DIR/$FASTQ2"_fastqc"/fastqc_report.html
                            ;;
                	esac
                    done
                ;;
    		"SKIP/CONTINUE")
    			echo "Continuing..."
    			break
    			;;
    	esac
    done
    
    ##############################
    # BOWTIE AND TOPHAT
    echo -n "Enter path to reference FastA for index build: "
    read -e "REF_PATH"
    echo -n "If desired, enter manual flags to be used by tophat: "
    read -e TOPHAT_FLAGS
    bowtie/bowtie2-build $REF_PATH
    $TOPHAT_PATH $TOPHAT_FLAGS $FASTQ_DIR/$FASTQ1.fastq $FASTQ_DIR/$FASTQ2.fastq
    tophat_out/accepted_hits.bam=$ACC_HITS
    samtools build $ACC_HITS
    
    # TABLET
    tablet $REF_PATH $ACC_HITS
    Thanks,
    vinay427
  • vinay427
    Member
    • Jul 2012
    • 10

    #2
    Bump, does anyone have any neat ideas or advice about a project that would be helpful to the community? Of course, it would be open-source licensed under GPL.

    Comment

    • pbluescript
      Senior Member
      • Nov 2009
      • 224

      #3
      It would be nice if there was a tool that could take a bam file aligned to a transcriptome and along with a gtf file convert the transcriptome coordinates to genome coordinates.

      I have asked if such a tool exists here before and done my own searching with no luck. I've cobbled together a solution to the problem, but it's not fast and it only works in my specific case.

      Comment

      • vinay427
        Member
        • Jul 2012
        • 10

        #4
        Anyone else feel that pbluescript's idea would be useful? I'm only trying to gain a general idea of people's opinions toward such a tool (converting .bam aligned to transcriptome and .gtf to genome coordinates).

        Comment

        • alec
          Member
          • Apr 2011
          • 18

          #5
          If you go with the conversion tool, it would also be useful if it could convert from, say, hg18 to hg19 using a liftover file.

          My vote would be for a Python wrapper to bowtie or bwa. For example, I often have barcoded data and would like to write a loop that reads a fastq file, removes the barcode, aligns the read, and then uses pysam to write it to a different bam file depending on the barcode. This would also give me more control over how multiply aligned or mispaired reads are handled, or let me store additional information with the alignment.

          Comment

          • A_Morozov
            Member
            • Feb 2011
            • 40

            #6
            vinay427, you can contribute quite a bit to BioPerl and BioPython modules if you know any of these languages. Just check out mailing lists: there always are some requests.

            Comment

            • pmiguel
              Senior Member
              • Aug 2008
              • 2328

              #7
              I would like a tool that visually displays the components of an RNA sample sequence by length and abundance. What I have in mind would be a 2 dimensional plot. X-axis would be "nucleotides" of length of a transcript. Y-axis would be read counts for the given transcript. The Y-axis might be usefully log-transformed.

              So, if you did no ribosomal depletion, and sequenced 30 million reads into a sample, somewhere on the order of 90% of the reads (27 million) would be 28S and 18S nuclear rRNA. So those might be plotted as being 4 thousand and 2 thousand nucleotides on the x-axis and 18 million reads "thick" and 9 million reads thick respectively. Below them would be the next rank of high expressers -- possibly mitochondrial rRNA, or some abundant tissue specific gene. Or some small transcriptome contaminant, like from bacteria contaminating a cell culture.

              Then subsequent rows of decreasing thickness would populated the lower part of the diagram.

              The purpose of this would be to give a general, visual, overview of the composition of an RNA sample. I don't think you get that using a full genome viewer.

              --
              Phillip

              Comment

              • Jean
                Member
                • Nov 2008
                • 37

                #8
                Originally posted by pmiguel View Post
                I would like a tool that visually displays the components of an RNA sample sequence by length and abundance. What I have in mind would be a 2 dimensional plot. X-axis would be "nucleotides" of length of a transcript. Y-axis would be read counts for the given transcript. The Y-axis might be usefully log-transformed.

                So, if you did no ribosomal depletion, and sequenced 30 million reads into a sample, somewhere on the order of 90% of the reads (27 million) would be 28S and 18S nuclear rRNA. So those might be plotted as being 4 thousand and 2 thousand nucleotides on the x-axis and 18 million reads "thick" and 9 million reads thick respectively. Below them would be the next rank of high expressers -- possibly mitochondrial rRNA, or some abundant tissue specific gene. Or some small transcriptome contaminant, like from bacteria contaminating a cell culture.

                Then subsequent rows of decreasing thickness would populated the lower part of the diagram.

                The purpose of this would be to give a general, visual, overview of the composition of an RNA sample. I don't think you get that using a full genome viewer.

                --
                Phillip

                Couldn't this be plotted in R with the mapping and transcriptome information?

                Comment

                • pmiguel
                  Senior Member
                  • Aug 2008
                  • 2328

                  #9
                  Originally posted by Jean View Post
                  Couldn't this be plotted in R with the mapping and transcriptome information?
                  That would be fine with me.

                  --
                  Phillip

                  Comment

                  • vinay427
                    Member
                    • Jul 2012
                    • 10

                    #10
                    Originally posted by alec View Post
                    My vote would be for a Python wrapper to bowtie or bwa. For example, I often have barcoded data and would like to write a loop that reads a fastq file, removes the barcode, aligns the read, and then uses pysam to write it to a different bam file depending on the barcode. This would also give me more control over how multiply aligned or mispaired reads are handled, or let me store additional information with the alignment.
                    What exactly do you mean by a wrapper? From what I've seen, would it basically pass through arguments to bowtie (or bowtie2) so that you can integrate it into your Python loop?

                    I found this example; is this essentially what you would want a Python wrapper for Bowtie to do, with different arguments?
                    Code:
                    input_filenames = ['BAM-1.bam','BAM-2.bam']
                    output_filename = 'all.bam'
                    
                    merge_parameters = [output_filename] + input_filenames
                    pysam.merge(*merge_parameters)

                    Comment

                    • severin
                      Genome Informatics Facility
                      • Sep 2009
                      • 105

                      #11
                      GBrowse2

                      I would like to see an extension to GBrowse2 that would permit for on the fly joining of scaffolds based on evidence in the tracks.

                      Comment

                      • alec
                        Member
                        • Apr 2011
                        • 18

                        #12
                        Originally posted by vinay427 View Post
                        What exactly do you mean by a wrapper? From what I've seen, would it basically pass through arguments to bowtie (or bowtie2) so that you can integrate it into your Python loop?
                        I'm thinking of being able to align sequences one by one. For example, my reads often have custom barcodes at the beginning that need to be removed before alignment and written to a different file for each barcode:

                        Code:
                        files = dict([
                            ["ATGCG", pysam.Samfile("sample1.bam", "wb")],
                            ["GACTA", pysam.Samfile("sample2.bam", "wb")]
                        ])
                        bwt = bowtie.load("/path/to/index")
                        for((seq, qual) in FastqReader("reads.fastq")):
                            barcode = seq[0:5]
                            if barcode in files:
                                files[barcode].write(bwt.align(seq[5:], qual[5:]))
                            else:
                                pass #handle nonmatching barcodes here
                        There are also a lot of cases where I would want to store additional information in the auxiliary data, set certain flags, or otherwise modify the alignment before writing it. Of course, this can already be done with pysam, the trick is getting the alignment in the first place.

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Pathogen Surveillance with Advanced Genomic Tools
                          by seqadmin




                          The COVID-19 pandemic highlighted the need for proactive pathogen surveillance systems. As ongoing threats like avian influenza and newly emerging infections continue to pose risks, researchers are working to improve how quickly and accurately pathogens can be identified and tracked. In a recent SEQanswers webinar, two experts discussed how next-generation sequencing (NGS) and machine learning are shaping efforts to monitor viral variation and trace the origins of infectious...
                          03-24-2025, 11:48 AM
                        • seqadmin
                          New Genomics Tools and Methods Shared at AGBT 2025
                          by seqadmin


                          This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                          The Headliner
                          The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                          03-03-2025, 01:39 PM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 03-20-2025, 05:03 AM
                        0 responses
                        49 views
                        0 reactions
                        Last Post seqadmin  
                        Started by seqadmin, 03-19-2025, 07:27 AM
                        0 responses
                        57 views
                        0 reactions
                        Last Post seqadmin  
                        Started by seqadmin, 03-18-2025, 12:50 PM
                        0 responses
                        50 views
                        0 reactions
                        Last Post seqadmin  
                        Started by seqadmin, 03-03-2025, 01:15 PM
                        0 responses
                        201 views
                        0 reactions
                        Last Post seqadmin  
                        Working...