Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Newblar (GS de novo assembler) paired end input

    Hello all.

    I am new to genome assembly and am using Newblar to assemble two paired-end fastq files that are 4.09 GB each. I see that when you input these files into the program you have the choice to choose if they are paired-end, does this mean it will recognize both input files as being paired-end?

    I have also gone ahead and merged the 2 paired-end files into a using flash. When I input this merged paired-end file, should I still choose the paired-end option in Newblar?

    Also, I am getting an error that I have ran out of computation memory. Thus, I am splitting up my files using fastq-splitter.pl. However, how would I split up separate paired-end files for input into Newblar?

    Thanks in advance

  • #2
    You will not get an optimal assembly if you split reads into multiple subsets and assembly them independently. In fact, you'll get a mess. If you run out of memory, you need to use a computer that has more memory, or a different algorithm.

    Comment


    • #3
      Okay thanks for the important info.

      Do you know if I put in two paired-end files and use the paired-end option on both of them, Newbler will recognize this as paired end files? Or is using a merged file of the two paired-end reads a better approach?

      Comment


      • #4
        Sorry, I have never used Newbler, so I don't know its idiosyncrasies... but hopefully someone else does!

        Typically, if you have overlapping reads, an OLC assembler will perform best with merged reads. Flash does not perform well in my tests, though. Bearing in mind that I am biased, being the developer, I recommend BBMerge for joining paired reads prior to assembly.

        What was your merge rate? The best procedure depends on that... if the insert size was too long to merge a substantial fraction of the reads, it's better to skip merging.

        Comment


        • #5
          The max read length is 250 bp, which I used as the maxOverlap parameter in flash. The results of this merge:

          Code:
          [FLASH] Read combination statistics:
          [FLASH]  Total pairs: 8576138
          [FLASH]  Combined pairs:  6056207
          [FLASH]  Uncombined pairs: 2519931
          [FLASH]  Percent combined: 70.62%
          Note that when I adjusted maxOverlap to be 225, I was getting a warning that a high proportion overlapped by more than 225 bp. Which is why I stuck with 250. Although this may not be the best option since my max read length is 250 bp?

          The max read length was calculated by using this command and looking through all the reads to determine a max read length:

          Code:
          awk '{if(NR%4==2) print NR"\t"$0"\t"length($0)}' <read> > <output.txt>
          Last edited by ronaldrcutler; 07-14-2016, 03:32 AM.

          Comment


          • #6
            Is this 454 data? Is that the reason for using newbler?

            Comment


            • #7
              No this is fastq data.

              Comment


              • #8
                From which platform? How big is the genome expected to be? What is the read length?

                Comment


                • #9
                  Illumina I believe.

                  The merged paired-end file (using flash) has 6056207 sequences, 1451352720 bp
                  The mate1 paired-end file has 8576138 sequences, 1798377920 bp

                  The read lengths are 250 bp

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Recent Innovations in Spatial Biology
                    by seqadmin


                    Spatial biology is an exciting field that encompasses a wide range of techniques and technologies aimed at mapping the organization and interactions of various biomolecules in their native environments. As this area of research progresses, new tools and methodologies are being introduced, accompanied by efforts to establish benchmarking standards and drive technological innovation.

                    3D Genomics
                    While spatial biology often involves studying proteins and RNAs in their...
                    01-01-2025, 07:30 PM
                  • seqadmin
                    Advancing Precision Medicine for Rare Diseases in Children
                    by seqadmin




                    Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                    12-16-2024, 07:57 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, Yesterday, 11:18 AM
                  0 responses
                  16 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 12-30-2024, 01:35 PM
                  0 responses
                  33 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 12-17-2024, 10:28 AM
                  0 responses
                  41 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 12-13-2024, 08:24 AM
                  0 responses
                  57 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X