Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Useful bioinformatics tool ideas?

    Hi,

    I am working at a lab for the next several weeks, and as I have strong experience in the computer side of things, I wanted to develop a tool that would be useful. My initial idea is to compile various tools such as Trimmomatic, tophat, bowtie2, cufflinks, samtools, prinseq-lite, and prinseq-graph to make a sort of "one-stop shop" script for those assembling and visually representing a paired end FastQ file. I had issues learning about how these tools fit together, and figured that this script might help do so. I've attached it below, but it's very dysfunctional as it's not nearly finished. However, it gives you an idea of what it does. I'm not quite sure if this niche is already covered by another (GPL) tool? Do any of you have an idea for something that is greatly needed and would hopefully prove very helpful in this community?

    Code:
    #!/bin/bash
    
    ############################
    # TITLE
    clear
    echo "FastQ Paired/Single End Sequence Analyzer"
    echo "Licensed under GPL"
    echo
    
    # GET MAIN INFORMATION
    echo -n "Enter path of first FastQ file (without extension): "
    read -e FASTQ1
    echo -n "Enter path of second FastQ file (without extension): "
    read -e FASTQ2
    echo
    ../$FASTQ1=FASTQ_DIR
    echo -n "Enter path of desired paired read 1 file: "
    read -e P_R1
    echo -n "Enter path of desired UNpaired read 1 file: "
    read -e U_R1
    echo -n "Enter path of desired paired read 2 file: "
    read -e P_R2
    echo -n "Enter path of desired UNpaired read 2 file: "
    read -e U_R2
    echo -n "Enter path of desired trim log: "
    read -e TRIM_LOG
    echo -n "Enter number of threads to be used: "
    read -e THREADS
    
    # QUALITY MENU
    PS3='Pick a quality scale: '
    phred_options=("phred33" "phred64")
    select PHRED_CHOICE in "${phred_options[@]}"
    do
        case $PHRED_CHOICE in
        	"phred33")
    			PHRED=phred33
    			break
    			;;
    		"phred64")
    			PHRED=phred64
    			break
    			;;
    	esac
    done
    
    #################################
    # MAIN ACTION MENU FROM TRIMMOMATIC
    PS3='Pick a step: '
    select TRIM_CHOICE in ILLUMINACLIP SLIDINGWINDOW LEADING TRAILING CROP HEADCROP MINLENGTH TOPHRED33 TOPHRED64 SKIP/CONTINUE
    do
    	case "$TRIM_CHOICE" in
    		"ILLUMINACLIP")
    			echo "Starting"
    			echo -n "If desired, enter manual flags to be used: "
    			read -e TRIM_FLAGS
    			java -classpath trimmomatic.jar org.usadellab.trimmomatic.TrimmomaticPE $TRIM_FLAGS -threads $THREADS $PHRED -trimlog $TRIM_LOG $FASTQ_DIR/$FASTQ1.fastq $FASTQ_DIR/$FASTQ2.fastq $P_R1 $U_R1 $P_R2 $U_R2 ILLUMINACLIP
    			;;
    		"SLIDINGWINDOW")
    			echo -n "Specify window size (number of bases to average across): "
    			read -e SLIDINGWINDOW_WINDOW
    			echo -n "Specify required quality: "
    			read -e SLIDINGWINDOW_QUALITY
    			echo -n "If desired, enter manual flags to be used: "
    			read -e TRIM_FLAGS
    			echo "Starting..."
    			java -classpath trimmomatic.jar org.usadellab.trimmomatic.TrimmomaticPE $TRIM_FLAGS -threads $THREADS -$PHRED -trimlog $TRIM_LOG $FASTQ_DIR/$FASTQ1.fastq $FASTQ_DIR/$FASTQ2.fastq $P_R1 $U_R1 $P_R2 $U_R2 SLIDINGWINDOW:$SLIDINGWINDOW_WINDOW:$SLIDINGWINDOW_QUALITY
    			;;
    		"LEADING")
    			echo -n "Specify min quality: "
    			read -e LEADING_VALUE
    			echo -n "If desired, enter manual flags to be used: "
    			read -e TRIM_FLAGS
    			echo "Starting..."
    			java -classpath trimmomatic.jar org.usadellab.trimmomatic.TrimmomaticPE $TRIM_FLAGS -threads $THREADS -$PHRED -trimlog $TRIM_LOG $FASTQ_DIR/$FASTQ1.fastq $FASTQ_DIR/$FASTQ2.fastq $P_R1 $U_R1 $P_R2 $U_R2 LEADING:$LEADING_VALUE
    			;;
    		"TRAILING")
    			echo -n "Specify min quality"
    			read -e TRAILING_VALUE
    			echo -n "If desired, enter manual flags to be used: "
    			read -e TRIM_FLAGS
    			echo "Starting"
    			java -classpath trimmomatic.jar org.usadellab.trimmomatic.TrimmomaticPE $TRIM_FLAGS -threads $THREADS -$PHRED -trimlog $TRIM_LOG $FASTQ_DIR/$FASTQ1.fastq $FASTQ_DIR/$FASTQ2.fastq $P_R1 $U_R1 $P_R2 $U_R2 TRAILING:$TRAILING_VALUE
    			;;
    		"CROP")
    			echo -n "Specify number of bases to keep (from start of read): "
    			read -e CROP_VALUE
    			echo -n "If desired, enter manual flags to be used: "
    			read -e TRIM_FLAGS
    			echo "Starting..."
    			java -classpath trimmomatic.jar org.usadellab.trimmomatic.TrimmomaticPE $TRIM_FLAGS -threads $THREADS -$PHRED -trimlog $TRIM_LOG $FASTQ_DIR/$FASTQ1.fastq $FASTQ_DIR/$FASTQ2.fastq $P_R1 $U_R1 $P_R2 $U_R2 CROP:$CROP_VALUE
    			;;
    		"HEADCROP")
    			echo -n "Specify number of bases to remove (from start of read): "
    			read -e HEADCROP_VALUE
    			echo -n "If desired, enter manual flags to be used: "
    			read -e TRIM_FLAGS
    			echo "Starting..."
    			java -classpath trimmomatic.jar org.usadellab.trimmomatic.TrimmomaticPE $TRIM_FLAGS -threads $THREADS -$PHRED -trimlog $TRIM_LOG $FASTQ_DIR/$FASTQ1.fastq $FASTQ_DIR/$FASTQ2.fastq $P_R1 $U_R1 $P_R2 $U_R2 HEADCROP:$HEADCROP_VALUE
    			;;
    		"MINLENGTH")
    			echo -n "Specify min length of reads to be kept: "
    			read -e MINLENGTH_VALUE
    			echo -n "If desired, enter manual flags to be used: "
    			read -e TRIM_FLAGS
    			echo "Starting..."
    			java -classpath trimmomatic.jar org.usadellab.trimmomatic.TrimmomaticPE $TRIM_FLAGS -threads $THREADS -$PHRED -trimlog $TRIM_LOG $FASTQ_DIR/$FASTQ1.fastq $FASTQ_DIR/$FASTQ2.fastq $P_R1 $U_R1 $P_R2 $U_R2 MINLENGTH:$MINLENGTH_VALUE
    			;;
    		"TOPHRED33")
    			echo "Starting..."
    			echo -n "If desired, enter manual flags to be used: "
    			read -e TRIM_FLAGS
    			java -classpath trimmomatic.jar org.usadellab.trimmomatic.TrimmomaticPE $TRIM_FLAGS -threads $THREADS -$PHRED -trimlog $TRIM_LOG $FASTQ_DIR/$FASTQ1.fastq $FASTQ_DIR/$FASTQ2.fastq $P_R1 $U_R1 $P_R2 $U_R2 TOPHRED33
    			;;
    		"TOPHRED64")
    			echo "Starting..."
    			echo -n "If desired, enter manual flags to be used: "
    			read -e TRIM_FLAGS
    			java -classpath trimmomatic.jar org.usadellab.trimmomatic.TrimmomaticPE $TRIM_FLAGS -threads $THREADS -$PHRED -trimlog $TRIM_LOG $FASTQ_DIR/$FASTQ1.fastq $FASTQ_DIR/$FASTQ2.fastq $P_R1 $U_R1 $P_R2 $U_R2 TOPHRED64
    			;;
            "GENERATE VISUAL REPORT")
                echo "Visual report will now be generated..."
                PS3='Pick a tool: '
                    phred_options=("prinseq-lite, prinseq-graphs" "fastqc")
                    select VISUAL_CHOICE in "${visual_options[@]}"
                    do
                    case $VISUAL_CHOICE in
    		            "prinseq-lite, prinseq-graphs")
                            echo "Generating graph file..."
                            prinseq-lite -fastq $FASTQ_DIR/$FASTQ1.fastq -out_format 4 -graph_data fastq1.gd
                            prinseq-lite -fastq $FASTQ_DIR/$FASTQ2.fastq -out_format 4 -graph_data fastq2.gd
                            echo "Generating html report..."
                            mkdir read1_report
                            mkdir read2_report
                            cd read1_report
                            prinseq-graphs ../fastq1.gd -html_all
                            cd ../read2_report
                            prinseq-graphs ../fastq2.gd -html_all
    	            		break
    		            	;;
                        "fastqc"
                            fastqc $P_R1 $P_R2
                            echo -n "Type command for web browser to view html results in: "
                            read -e BROWSER
                            $BROWSER $FASTQ_DIR/$FASTQ1"_fastqc"/fastqc_report.html $FASTQ_DIR/$FASTQ2"_fastqc"/fastqc_report.html
                            ;;
                	esac
                    done
                ;;
    		"SKIP/CONTINUE")
    			echo "Continuing..."
    			break
    			;;
    	esac
    done
    
    ##############################
    # BOWTIE AND TOPHAT
    echo -n "Enter path to reference FastA for index build: "
    read -e "REF_PATH"
    echo -n "If desired, enter manual flags to be used by tophat: "
    read -e TOPHAT_FLAGS
    bowtie/bowtie2-build $REF_PATH
    $TOPHAT_PATH $TOPHAT_FLAGS $FASTQ_DIR/$FASTQ1.fastq $FASTQ_DIR/$FASTQ2.fastq
    tophat_out/accepted_hits.bam=$ACC_HITS
    samtools build $ACC_HITS
    
    # TABLET
    tablet $REF_PATH $ACC_HITS
    Thanks,
    vinay427

  • #2
    Bump, does anyone have any neat ideas or advice about a project that would be helpful to the community? Of course, it would be open-source licensed under GPL.

    Comment


    • #3
      It would be nice if there was a tool that could take a bam file aligned to a transcriptome and along with a gtf file convert the transcriptome coordinates to genome coordinates.

      I have asked if such a tool exists here before and done my own searching with no luck. I've cobbled together a solution to the problem, but it's not fast and it only works in my specific case.

      Comment


      • #4
        Anyone else feel that pbluescript's idea would be useful? I'm only trying to gain a general idea of people's opinions toward such a tool (converting .bam aligned to transcriptome and .gtf to genome coordinates).

        Comment


        • #5
          If you go with the conversion tool, it would also be useful if it could convert from, say, hg18 to hg19 using a liftover file.

          My vote would be for a Python wrapper to bowtie or bwa. For example, I often have barcoded data and would like to write a loop that reads a fastq file, removes the barcode, aligns the read, and then uses pysam to write it to a different bam file depending on the barcode. This would also give me more control over how multiply aligned or mispaired reads are handled, or let me store additional information with the alignment.

          Comment


          • #6
            vinay427, you can contribute quite a bit to BioPerl and BioPython modules if you know any of these languages. Just check out mailing lists: there always are some requests.

            Comment


            • #7
              I would like a tool that visually displays the components of an RNA sample sequence by length and abundance. What I have in mind would be a 2 dimensional plot. X-axis would be "nucleotides" of length of a transcript. Y-axis would be read counts for the given transcript. The Y-axis might be usefully log-transformed.

              So, if you did no ribosomal depletion, and sequenced 30 million reads into a sample, somewhere on the order of 90% of the reads (27 million) would be 28S and 18S nuclear rRNA. So those might be plotted as being 4 thousand and 2 thousand nucleotides on the x-axis and 18 million reads "thick" and 9 million reads thick respectively. Below them would be the next rank of high expressers -- possibly mitochondrial rRNA, or some abundant tissue specific gene. Or some small transcriptome contaminant, like from bacteria contaminating a cell culture.

              Then subsequent rows of decreasing thickness would populated the lower part of the diagram.

              The purpose of this would be to give a general, visual, overview of the composition of an RNA sample. I don't think you get that using a full genome viewer.

              --
              Phillip

              Comment


              • #8
                Originally posted by pmiguel View Post
                I would like a tool that visually displays the components of an RNA sample sequence by length and abundance. What I have in mind would be a 2 dimensional plot. X-axis would be "nucleotides" of length of a transcript. Y-axis would be read counts for the given transcript. The Y-axis might be usefully log-transformed.

                So, if you did no ribosomal depletion, and sequenced 30 million reads into a sample, somewhere on the order of 90% of the reads (27 million) would be 28S and 18S nuclear rRNA. So those might be plotted as being 4 thousand and 2 thousand nucleotides on the x-axis and 18 million reads "thick" and 9 million reads thick respectively. Below them would be the next rank of high expressers -- possibly mitochondrial rRNA, or some abundant tissue specific gene. Or some small transcriptome contaminant, like from bacteria contaminating a cell culture.

                Then subsequent rows of decreasing thickness would populated the lower part of the diagram.

                The purpose of this would be to give a general, visual, overview of the composition of an RNA sample. I don't think you get that using a full genome viewer.

                --
                Phillip

                Couldn't this be plotted in R with the mapping and transcriptome information?

                Comment


                • #9
                  Originally posted by Jean View Post
                  Couldn't this be plotted in R with the mapping and transcriptome information?
                  That would be fine with me.

                  --
                  Phillip

                  Comment


                  • #10
                    Originally posted by alec View Post
                    My vote would be for a Python wrapper to bowtie or bwa. For example, I often have barcoded data and would like to write a loop that reads a fastq file, removes the barcode, aligns the read, and then uses pysam to write it to a different bam file depending on the barcode. This would also give me more control over how multiply aligned or mispaired reads are handled, or let me store additional information with the alignment.
                    What exactly do you mean by a wrapper? From what I've seen, would it basically pass through arguments to bowtie (or bowtie2) so that you can integrate it into your Python loop?

                    I found this example; is this essentially what you would want a Python wrapper for Bowtie to do, with different arguments?
                    Code:
                    input_filenames = ['BAM-1.bam','BAM-2.bam']
                    output_filename = 'all.bam'
                    
                    merge_parameters = [output_filename] + input_filenames
                    pysam.merge(*merge_parameters)

                    Comment


                    • #11
                      GBrowse2

                      I would like to see an extension to GBrowse2 that would permit for on the fly joining of scaffolds based on evidence in the tracks.

                      Comment


                      • #12
                        Originally posted by vinay427 View Post
                        What exactly do you mean by a wrapper? From what I've seen, would it basically pass through arguments to bowtie (or bowtie2) so that you can integrate it into your Python loop?
                        I'm thinking of being able to align sequences one by one. For example, my reads often have custom barcodes at the beginning that need to be removed before alignment and written to a different file for each barcode:

                        Code:
                        files = dict([
                            ["ATGCG", pysam.Samfile("sample1.bam", "wb")],
                            ["GACTA", pysam.Samfile("sample2.bam", "wb")]
                        ])
                        bwt = bowtie.load("/path/to/index")
                        for((seq, qual) in FastqReader("reads.fastq")):
                            barcode = seq[0:5]
                            if barcode in files:
                                files[barcode].write(bwt.align(seq[5:], qual[5:]))
                            else:
                                pass #handle nonmatching barcodes here
                        There are also a lot of cases where I would want to store additional information in the auxiliary data, set certain flags, or otherwise modify the alignment before writing it. Of course, this can already be done with pysam, the trick is getting the alignment in the first place.

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Strategies for Sequencing Challenging Samples
                          by seqadmin


                          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                          03-22-2024, 06:39 AM
                        • seqadmin
                          Techniques and Challenges in Conservation Genomics
                          by seqadmin



                          The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                          Avian Conservation
                          Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                          03-08-2024, 10:41 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, Yesterday, 06:37 PM
                        0 responses
                        11 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, Yesterday, 06:07 PM
                        0 responses
                        10 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-22-2024, 10:03 AM
                        0 responses
                        51 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-21-2024, 07:32 AM
                        0 responses
                        68 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X