Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How do I shorten a genome sequence to secure my workflow is properly functioning?

    I am Moritz from the University Heidelberg in Germany.

    For my bachelor thesis I have 20 large (25-30 GB) genome files (.txt.gz) by patients with hepatocellular carcinoma. I have Bpipe installed on my Ubuntu server, which I have got to try out several approaches.

    Steps included are:

    Alignment (BWA (Transform sai and sam)) against hg19.fasta
    Transform (samtols)
    Dedupe
    The problem I have is that in order to try out my bpipe workflow, I have to take a whole sequence of 30 GB and start from the beginning. That takes a lot of time. So my questions are:

    How can I shorten one file?

    Where can I find a short sequence that I can use to test my pipeline?

  • #2
    You can use the suggestions in this thread to sample reads from your sequence files for testing small subsets: http://seqanswers.com/forums/showthread.php?t=16505

    Comment


    • #3
      If you want to have a little number of reads (suppose you have a .fastq file) in your "test sample", an easy way is :
      head -10000 your_big_file.fastq > sample_test_2500_reads.fastq
      One read = 4 rows in .fastq....so, in order to have (for example!!) 2500 reads in your "test sample", you have to set -10000 in head option.

      Changing this option, you can change the number of reads.

      This is a fisrt easy way.

      Comment


      • #4
        Like I wrote, I have several files which end on .txt.gz
        one example:
        D1n67_GDNA_12s00128-1-2_Hose_lane00128_1_2_sequence.txt.gz
        Is that an answer to the file format? (I'm totally new to Genetics)
        So can I take my file, gunzip it and then run your line?

        Comment


        • #5
          Yes, you can gunzip it and then do what mattia suggested.

          Keep in mind that the first lines of many fastq files will NOT have sequence data that is of similar quality to the majority of the file. If you solely want to make the pipeline work that may not be a problem (unless the reads are all "NNN...NNN"s and you're trying to align them), but if you want to see reads that are representative of the file as a whole then you should look in the thread that GenoMax linked too.

          Comment


          • #6
            Now I am stuck with this failure:

            Can anybody help me out if this? This is my Bpipe pipeline:
            REFERENCE="/genedata/human_genome_GRCh37/hg19.fa"
            PICARD_HOME="/home/trr/picard-tools-1.93/picard-tools-1.93"
            PLATYPUS_HOME="~/bin/Platypus_0.1.9"
            STAMPY_HOME="~/bin/stampy-1.0.17"
            BWA_HOME="/home/trr/bwa-0.7.5a"

            seq1="/genedata/sample_test_10k_reads/sample_test_10k_reads.txt.gz"

            //readgroup information:
            //rg_id="lane711s003155"
            rg_id="lane712s006433"
            rg_lb="nextera"
            rg_pl="ILLUMINA"
            rg_pu="flowcell-barcode.lane"
            rg_sm="GDNA"

            //##############################################################

            //Alignment

            //##############################################################
            //BWA
            @Transform("sai")
            align_bwa = {
            exec "$BWA_HOME/bwa aln -t 4 -q 10 $REFERENCE $input > $output"
            }

            @Transform("sam")
            sampe_bwa = {
            exec "$BWA_HOME/bwa sampe -P -r '@RG\tID:$rg_id\tLB:$rg_lb\tPL:$rg_pl\tPU:$rg_pu\tSM:$rg_sm' $REFERENCE $inputs.sai $seq1 $seq2 > $output"
            }

            //##############################################################

            //SAM to BAM

            //##############################################################

            //@Transform("bam")
            sort = {
            exec "samtools view -bSu $input | samtools sort - $output"
            }

            //##############################################################

            //Remove Duplicates

            //##############################################################

            @Filter("dedupe")
            dedupe = {
            exec """
            java -Xmx1g -jar $PICARD_HOME/MarkDuplicates.jar
            MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000
            METRICS_FILE=out.metrics
            REMOVE_DUPLICATES=true
            ASSUME_SORTED=true
            VALIDATION_STRINGENCY=LENIENT
            INPUT=$input
            OUTPUT=$output
            """
            }

            //##############################################################

            //Run Pipeline

            //##############################################################

            Bpipe.run {
            "sample_test_10k_reads*" * [align_bwa] + sampe_bwa + sort + dedupe
            }
            and I run it with
            Code:
            bpipe run sample_test.pipe sample_test_10k_reads.txt.gz
            Last edited by moritz1; 07-12-2013, 08:40 AM.

            Comment


            • #7
              I may be wrong as I haven't used this, but it looks like your pipeline wants a $seq1 and a $seq2, but you've only supplied a $seq1. Is that possibly the problem?

              When I have issues with alignment commands, I try to do it with the bare necessities to make it run and then add in more complicated aspects of the command once I can make sure I'm typing in the basic command correctly.

              Comment


              • #8
                Oh yeah thanks a lot, I think that should be it. I'll edit later when I can try it out!

                Comment


                • #9
                  Now I am getting this error, although I have changed the input to two files:



                  Code:
                  REFERENCE="/genedata/human_genome_GRCh37/hg19.fa"
                  PICARD_HOME="/home/trr/picard-tools-1.93/picard-tools-1.93"
                  PLATYPUS_HOME="~/bin/Platypus_0.1.9"
                  STAMPY_HOME="~/bin/stampy-1.0.17"
                  BWA_HOME="/home/trr/bwa-0.7.5a"
                  
                  seq1="/genedata/sample_test_10k_reads/sample_test_10k_reads1.txt.gz"
                  seq2="/genedata/sample_test_10k_reads/sample_test_10k_reads2.txt.gz"
                   
                  //readgroup information:
                  rg_id="lane712s006433"
                  rg_lb="nextera"
                  rg_pl="ILLUMINA"
                  rg_pu="flowcell-barcode.lane"
                  rg_sm="GDNA"
                   
                  //#############################################################
                  //Alignment
                  //##############################################################
                  //BWA
                  @Transform("sai")
                  align_bwa = {
                        exec "$BWA_HOME/bwa aln -t 4 -q 10 $REFERENCE $input > $output"
                  }
                   
                  @Transform("sam")
                  sampe_bwa = {
                        exec "$BWA_HOME/bwa sampe -P -r '@RG\tID:$rg_id\tLB:$rg_lb\tPL:$rg_pl\tPU:$rg_pu\tSM:$rg_sm' $REFERENCE $inputs.sai $seq1 $seq2 > $output"
                  }
                   
                  //##############################################################
                  //SAM to BAM
                  //##############################################################
                   
                  //@Transform("bam")
                  sort = {
                          exec "samtools view -bSu $input  | samtools sort - $output"
                  }
                   
                  //##############################################################
                   
                  //Remove Duplicates
                   
                  //##############################################################
                   
                  @Filter("dedupe")
                  dedupe = {
                          exec """
                             java -Xmx1g -jar $PICARD_HOME/MarkDuplicates.jar
                                             MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000
                                             METRICS_FILE=out.metrics
                                             REMOVE_DUPLICATES=true
                                             ASSUME_SORTED=true  
                                             VALIDATION_STRINGENCY=LENIENT
                                             INPUT=$input
                                             OUTPUT=$output
                         """
                  }
                   
                  //##############################################################
                  //Run Pipeline
                  //##############################################################
                   
                  Bpipe.run {
                      "sample_test_10k_reads%" * [align_bwa] + sampe_bwa + sort + dedupe
                  }
                  and I run it with:
                  Code:
                  bpipe run sample_test_pipeline.pipe sample_test_10k_reads*

                  Comment


                  • #10
                    Shouldn't that one line read:

                    Code:
                    exec "$BWA_HOME/bwa sampe -P -r '@RG\tID:$rg_id\tLB:$rg_lb\tPL:$rg_pl\tPU:$rg_pu\tSM:$rg_sm' $REFERENCE [B]$inputs1.sai $inputs2.sai[/B] $seq1 $seq2 > $output"

                    Comment


                    • #11
                      Yes, I got it working. Thanks a lot.

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Strategies for Sequencing Challenging Samples
                        by seqadmin


                        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                        03-22-2024, 06:39 AM
                      • seqadmin
                        Techniques and Challenges in Conservation Genomics
                        by seqadmin



                        The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                        Avian Conservation
                        Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                        03-08-2024, 10:41 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, Yesterday, 06:37 PM
                      0 responses
                      10 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, Yesterday, 06:07 PM
                      0 responses
                      9 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-22-2024, 10:03 AM
                      0 responses
                      51 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-21-2024, 07:32 AM
                      0 responses
                      67 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X