Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Newblar (GS de novo assembler) paired end input

    Hello all.

    I am new to genome assembly and am using Newblar to assemble two paired-end fastq files that are 4.09 GB each. I see that when you input these files into the program you have the choice to choose if they are paired-end, does this mean it will recognize both input files as being paired-end?

    I have also gone ahead and merged the 2 paired-end files into a using flash. When I input this merged paired-end file, should I still choose the paired-end option in Newblar?

    Also, I am getting an error that I have ran out of computation memory. Thus, I am splitting up my files using fastq-splitter.pl. However, how would I split up separate paired-end files for input into Newblar?

    Thanks in advance

  • #2
    You will not get an optimal assembly if you split reads into multiple subsets and assembly them independently. In fact, you'll get a mess. If you run out of memory, you need to use a computer that has more memory, or a different algorithm.

    Comment


    • #3
      Okay thanks for the important info.

      Do you know if I put in two paired-end files and use the paired-end option on both of them, Newbler will recognize this as paired end files? Or is using a merged file of the two paired-end reads a better approach?

      Comment


      • #4
        Sorry, I have never used Newbler, so I don't know its idiosyncrasies... but hopefully someone else does!

        Typically, if you have overlapping reads, an OLC assembler will perform best with merged reads. Flash does not perform well in my tests, though. Bearing in mind that I am biased, being the developer, I recommend BBMerge for joining paired reads prior to assembly.

        What was your merge rate? The best procedure depends on that... if the insert size was too long to merge a substantial fraction of the reads, it's better to skip merging.

        Comment


        • #5
          The max read length is 250 bp, which I used as the maxOverlap parameter in flash. The results of this merge:

          Code:
          [FLASH] Read combination statistics:
          [FLASH]  Total pairs: 8576138
          [FLASH]  Combined pairs:  6056207
          [FLASH]  Uncombined pairs: 2519931
          [FLASH]  Percent combined: 70.62%
          Note that when I adjusted maxOverlap to be 225, I was getting a warning that a high proportion overlapped by more than 225 bp. Which is why I stuck with 250. Although this may not be the best option since my max read length is 250 bp?

          The max read length was calculated by using this command and looking through all the reads to determine a max read length:

          Code:
          awk '{if(NR%4==2) print NR"\t"$0"\t"length($0)}' <read> > <output.txt>
          Last edited by ronaldrcutler; 07-14-2016, 03:32 AM.

          Comment


          • #6
            Is this 454 data? Is that the reason for using newbler?

            Comment


            • #7
              No this is fastq data.

              Comment


              • #8
                From which platform? How big is the genome expected to be? What is the read length?

                Comment


                • #9
                  Illumina I believe.

                  The merged paired-end file (using flash) has 6056207 sequences, 1451352720 bp
                  The mate1 paired-end file has 8576138 sequences, 1798377920 bp

                  The read lengths are 250 bp

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Current Approaches to Protein Sequencing
                    by seqadmin


                    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                    04-04-2024, 04:25 PM
                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 04-11-2024, 12:08 PM
                  0 responses
                  27 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 10:19 PM
                  0 responses
                  30 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 09:21 AM
                  0 responses
                  26 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-04-2024, 09:00 AM
                  0 responses
                  52 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X