Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Newblar (GS de novo assembler) paired end input

    Hello all.

    I am new to genome assembly and am using Newblar to assemble two paired-end fastq files that are 4.09 GB each. I see that when you input these files into the program you have the choice to choose if they are paired-end, does this mean it will recognize both input files as being paired-end?

    I have also gone ahead and merged the 2 paired-end files into a using flash. When I input this merged paired-end file, should I still choose the paired-end option in Newblar?

    Also, I am getting an error that I have ran out of computation memory. Thus, I am splitting up my files using fastq-splitter.pl. However, how would I split up separate paired-end files for input into Newblar?

    Thanks in advance

  • #2
    You will not get an optimal assembly if you split reads into multiple subsets and assembly them independently. In fact, you'll get a mess. If you run out of memory, you need to use a computer that has more memory, or a different algorithm.

    Comment


    • #3
      Okay thanks for the important info.

      Do you know if I put in two paired-end files and use the paired-end option on both of them, Newbler will recognize this as paired end files? Or is using a merged file of the two paired-end reads a better approach?

      Comment


      • #4
        Sorry, I have never used Newbler, so I don't know its idiosyncrasies... but hopefully someone else does!

        Typically, if you have overlapping reads, an OLC assembler will perform best with merged reads. Flash does not perform well in my tests, though. Bearing in mind that I am biased, being the developer, I recommend BBMerge for joining paired reads prior to assembly.

        What was your merge rate? The best procedure depends on that... if the insert size was too long to merge a substantial fraction of the reads, it's better to skip merging.

        Comment


        • #5
          The max read length is 250 bp, which I used as the maxOverlap parameter in flash. The results of this merge:

          Code:
          [FLASH] Read combination statistics:
          [FLASH]  Total pairs: 8576138
          [FLASH]  Combined pairs:  6056207
          [FLASH]  Uncombined pairs: 2519931
          [FLASH]  Percent combined: 70.62%
          Note that when I adjusted maxOverlap to be 225, I was getting a warning that a high proportion overlapped by more than 225 bp. Which is why I stuck with 250. Although this may not be the best option since my max read length is 250 bp?

          The max read length was calculated by using this command and looking through all the reads to determine a max read length:

          Code:
          awk '{if(NR%4==2) print NR"\t"$0"\t"length($0)}' <read> > <output.txt>
          Last edited by ronaldrcutler; 07-14-2016, 03:32 AM.

          Comment


          • #6
            Is this 454 data? Is that the reason for using newbler?

            Comment


            • #7
              No this is fastq data.

              Comment


              • #8
                From which platform? How big is the genome expected to be? What is the read length?

                Comment


                • #9
                  Illumina I believe.

                  The merged paired-end file (using flash) has 6056207 sequences, 1451352720 bp
                  The mate1 paired-end file has 8576138 sequences, 1798377920 bp

                  The read lengths are 250 bp

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM
                  • seqadmin
                    Techniques and Challenges in Conservation Genomics
                    by seqadmin



                    The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                    Avian Conservation
                    Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                    03-08-2024, 10:41 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 03-27-2024, 06:37 PM
                  0 responses
                  13 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-27-2024, 06:07 PM
                  0 responses
                  12 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-22-2024, 10:03 AM
                  0 responses
                  53 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-21-2024, 07:32 AM
                  0 responses
                  69 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X