Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How do I shorten a genome sequence to secure my workflow is properly functioning?

    I am Moritz from the University Heidelberg in Germany.

    For my bachelor thesis I have 20 large (25-30 GB) genome files (.txt.gz) by patients with hepatocellular carcinoma. I have Bpipe installed on my Ubuntu server, which I have got to try out several approaches.

    Steps included are:

    Alignment (BWA (Transform sai and sam)) against hg19.fasta
    Transform (samtols)
    Dedupe
    The problem I have is that in order to try out my bpipe workflow, I have to take a whole sequence of 30 GB and start from the beginning. That takes a lot of time. So my questions are:

    How can I shorten one file?

    Where can I find a short sequence that I can use to test my pipeline?

  • #2
    You can use the suggestions in this thread to sample reads from your sequence files for testing small subsets: http://seqanswers.com/forums/showthread.php?t=16505

    Comment


    • #3
      If you want to have a little number of reads (suppose you have a .fastq file) in your "test sample", an easy way is :
      head -10000 your_big_file.fastq > sample_test_2500_reads.fastq
      One read = 4 rows in .fastq....so, in order to have (for example!!) 2500 reads in your "test sample", you have to set -10000 in head option.

      Changing this option, you can change the number of reads.

      This is a fisrt easy way.

      Comment


      • #4
        Like I wrote, I have several files which end on .txt.gz
        one example:
        D1n67_GDNA_12s00128-1-2_Hose_lane00128_1_2_sequence.txt.gz
        Is that an answer to the file format? (I'm totally new to Genetics)
        So can I take my file, gunzip it and then run your line?

        Comment


        • #5
          Yes, you can gunzip it and then do what mattia suggested.

          Keep in mind that the first lines of many fastq files will NOT have sequence data that is of similar quality to the majority of the file. If you solely want to make the pipeline work that may not be a problem (unless the reads are all "NNN...NNN"s and you're trying to align them), but if you want to see reads that are representative of the file as a whole then you should look in the thread that GenoMax linked too.

          Comment


          • #6
            Now I am stuck with this failure:

            Can anybody help me out if this? This is my Bpipe pipeline:
            REFERENCE="/genedata/human_genome_GRCh37/hg19.fa"
            PICARD_HOME="/home/trr/picard-tools-1.93/picard-tools-1.93"
            PLATYPUS_HOME="~/bin/Platypus_0.1.9"
            STAMPY_HOME="~/bin/stampy-1.0.17"
            BWA_HOME="/home/trr/bwa-0.7.5a"

            seq1="/genedata/sample_test_10k_reads/sample_test_10k_reads.txt.gz"

            //readgroup information:
            //rg_id="lane711s003155"
            rg_id="lane712s006433"
            rg_lb="nextera"
            rg_pl="ILLUMINA"
            rg_pu="flowcell-barcode.lane"
            rg_sm="GDNA"

            //##############################################################

            //Alignment

            //##############################################################
            //BWA
            @Transform("sai")
            align_bwa = {
            exec "$BWA_HOME/bwa aln -t 4 -q 10 $REFERENCE $input > $output"
            }

            @Transform("sam")
            sampe_bwa = {
            exec "$BWA_HOME/bwa sampe -P -r '@RG\tID:$rg_id\tLB:$rg_lb\tPL:$rg_pl\tPU:$rg_pu\tSM:$rg_sm' $REFERENCE $inputs.sai $seq1 $seq2 > $output"
            }

            //##############################################################

            //SAM to BAM

            //##############################################################

            //@Transform("bam")
            sort = {
            exec "samtools view -bSu $input | samtools sort - $output"
            }

            //##############################################################

            //Remove Duplicates

            //##############################################################

            @Filter("dedupe")
            dedupe = {
            exec """
            java -Xmx1g -jar $PICARD_HOME/MarkDuplicates.jar
            MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000
            METRICS_FILE=out.metrics
            REMOVE_DUPLICATES=true
            ASSUME_SORTED=true
            VALIDATION_STRINGENCY=LENIENT
            INPUT=$input
            OUTPUT=$output
            """
            }

            //##############################################################

            //Run Pipeline

            //##############################################################

            Bpipe.run {
            "sample_test_10k_reads*" * [align_bwa] + sampe_bwa + sort + dedupe
            }
            and I run it with
            Code:
            bpipe run sample_test.pipe sample_test_10k_reads.txt.gz
            Last edited by moritz1; 07-12-2013, 08:40 AM.

            Comment


            • #7
              I may be wrong as I haven't used this, but it looks like your pipeline wants a $seq1 and a $seq2, but you've only supplied a $seq1. Is that possibly the problem?

              When I have issues with alignment commands, I try to do it with the bare necessities to make it run and then add in more complicated aspects of the command once I can make sure I'm typing in the basic command correctly.

              Comment


              • #8
                Oh yeah thanks a lot, I think that should be it. I'll edit later when I can try it out!

                Comment


                • #9
                  Now I am getting this error, although I have changed the input to two files:



                  Code:
                  REFERENCE="/genedata/human_genome_GRCh37/hg19.fa"
                  PICARD_HOME="/home/trr/picard-tools-1.93/picard-tools-1.93"
                  PLATYPUS_HOME="~/bin/Platypus_0.1.9"
                  STAMPY_HOME="~/bin/stampy-1.0.17"
                  BWA_HOME="/home/trr/bwa-0.7.5a"
                  
                  seq1="/genedata/sample_test_10k_reads/sample_test_10k_reads1.txt.gz"
                  seq2="/genedata/sample_test_10k_reads/sample_test_10k_reads2.txt.gz"
                   
                  //readgroup information:
                  rg_id="lane712s006433"
                  rg_lb="nextera"
                  rg_pl="ILLUMINA"
                  rg_pu="flowcell-barcode.lane"
                  rg_sm="GDNA"
                   
                  //#############################################################
                  //Alignment
                  //##############################################################
                  //BWA
                  @Transform("sai")
                  align_bwa = {
                        exec "$BWA_HOME/bwa aln -t 4 -q 10 $REFERENCE $input > $output"
                  }
                   
                  @Transform("sam")
                  sampe_bwa = {
                        exec "$BWA_HOME/bwa sampe -P -r '@RG\tID:$rg_id\tLB:$rg_lb\tPL:$rg_pl\tPU:$rg_pu\tSM:$rg_sm' $REFERENCE $inputs.sai $seq1 $seq2 > $output"
                  }
                   
                  //##############################################################
                  //SAM to BAM
                  //##############################################################
                   
                  //@Transform("bam")
                  sort = {
                          exec "samtools view -bSu $input  | samtools sort - $output"
                  }
                   
                  //##############################################################
                   
                  //Remove Duplicates
                   
                  //##############################################################
                   
                  @Filter("dedupe")
                  dedupe = {
                          exec """
                             java -Xmx1g -jar $PICARD_HOME/MarkDuplicates.jar
                                             MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000
                                             METRICS_FILE=out.metrics
                                             REMOVE_DUPLICATES=true
                                             ASSUME_SORTED=true  
                                             VALIDATION_STRINGENCY=LENIENT
                                             INPUT=$input
                                             OUTPUT=$output
                         """
                  }
                   
                  //##############################################################
                  //Run Pipeline
                  //##############################################################
                   
                  Bpipe.run {
                      "sample_test_10k_reads%" * [align_bwa] + sampe_bwa + sort + dedupe
                  }
                  and I run it with:
                  Code:
                  bpipe run sample_test_pipeline.pipe sample_test_10k_reads*

                  Comment


                  • #10
                    Shouldn't that one line read:

                    Code:
                    exec "$BWA_HOME/bwa sampe -P -r '@RG\tID:$rg_id\tLB:$rg_lb\tPL:$rg_pl\tPU:$rg_pu\tSM:$rg_sm' $REFERENCE [B]$inputs1.sai $inputs2.sai[/B] $seq1 $seq2 > $output"

                    Comment


                    • #11
                      Yes, I got it working. Thanks a lot.

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Advancing Precision Medicine for Rare Diseases in Children
                        by seqadmin




                        Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                        12-16-2024, 07:57 AM
                      • seqadmin
                        Recent Advances in Sequencing Technologies
                        by seqadmin



                        Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                        Long-Read Sequencing
                        Long-read sequencing has seen remarkable advancements,...
                        12-02-2024, 01:49 PM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, 12-17-2024, 10:28 AM
                      0 responses
                      25 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 12-13-2024, 08:24 AM
                      0 responses
                      42 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 12-12-2024, 07:41 AM
                      0 responses
                      28 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 12-11-2024, 07:45 AM
                      0 responses
                      42 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X