Seqanswers Leaderboard Ad

**Brian Bushnell** · 07-13-2016, 02:54 PM

You will not get an optimal assembly if you split reads into multiple subsets and assembly them independently. In fact, you'll get a mess. If you run out of memory, you need to use a computer that has more memory, or a different algorithm.

**ronaldrcutler** · 07-13-2016, 06:29 PM

Okay thanks for the important info.

Do you know if I put in two paired-end files and use the paired-end option on both of them, Newbler will recognize this as paired end files? Or is using a merged file of the two paired-end reads a better approach?

**Brian Bushnell** · 07-13-2016, 06:44 PM

Sorry, I have never used Newbler, so I don't know its idiosyncrasies... but hopefully someone else does!

Typically, if you have overlapping reads, an OLC assembler will perform best with merged reads. Flash does not perform well in my tests, though. Bearing in mind that I am biased, being the developer, I recommend BBMerge for joining paired reads prior to assembly.

What was your merge rate? The best procedure depends on that... if the insert size was too long to merge a substantial fraction of the reads, it's better to skip merging.

**ronaldrcutler** · 07-14-2016, 03:27 AM

The max read length is 250 bp, which I used as the maxOverlap parameter in flash. The results of this merge:

Code:

[FLASH] Read combination statistics:
[FLASH]  Total pairs: 8576138
[FLASH]  Combined pairs:  6056207
[FLASH]  Uncombined pairs: 2519931
[FLASH]  Percent combined: 70.62%

Note that when I adjusted maxOverlap to be 225, I was getting a warning that a high proportion overlapped by more than 225 bp. Which is why I stuck with 250. Although this may not be the best option since my max read length is 250 bp?

The max read length was calculated by using this command and looking through all the reads to determine a max read length:

Code:

awk '{if(NR%4==2) print NR"\t"$0"\t"length($0)}' <read> > <output.txt>

**GenoMax** · 07-14-2016, 03:47 AM

Is this 454 data? Is that the reason for using newbler?

**ronaldrcutler** · 07-14-2016, 03:51 AM

No this is fastq data.

**GenoMax** · 07-14-2016, 03:58 AM

From which platform? How big is the genome expected to be? What is the read length?

**ronaldrcutler** · 07-14-2016, 07:47 AM

Illumina I believe.

The merged paired-end file (using flash) has 6056207 sequences, 1451352720 bp
The mate1 paired-end file has 8576138 sequences, 1798377920 bp

The read lengths are 250 bp

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 27 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 26 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Newblar (GS de novo assembler) paired end input

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News