Seqanswers Leaderboard Ad

**Brian Bushnell** · 03-07-2014, 04:50 PM

Originally posted by megster View Post

I had trimmed the data already, but what I really need is for someone to tell me how to randomly select ~2 million reads to throw out! Can I just delete the first 2 million in the fastq file? For some reason I haven't been able to find any info on how to do this, kinda feel like it's a dumb question haha.

I don't know about Ion Torrent, but some platforms have low quality reads concentrated into one part of the file (for example, there might be a bubble on the Illumina platform) so I would recommend subsampling randomly, or normalizing rather subsampling.

To subsample randomly or normalize, you can use BBTools:

reformat.sh in=reads.fq out=sampled.fq samplebases=100000000

That will sample exactly 100 megabases (plus at most 1 read length) randomly from the entire file. Requires reading the file twice. You can alternately get an approximate sampling like this:

reformat.sh in=reads.fq out=sampled.fq samplerate=0.25

...which will sample 25% of the reads, and only requires reading the file once. Well, either way it's very fast.

To normalize the reads to some target coverage depth:
bbnorm.sh in=reads.fq out=normalized.fq target=20 min=2

...which will normalize to 20x, and throw away reads with under 2x depth (assuming them to be full errors). This way, high peaks will go down, but areas with low coverage will not be reduced, which is better for assembly. This is a lot slower and requires more memory than sampling, but in my tests, greatly improves Soap and Velvet assemblies over sampling or just using raw data.

**andylemire** · 03-07-2014, 06:24 PM

Update your Geneious to r7.1 if you haven't already. The new Geneious de novo assembler handles Ion Torrent reads much, much better than previous versions, and really helps with the homopolymer errors in reducing the number of contigs. I re-ran a dozen plasmid assemblies just today and the results were incredible.

And, to answer your question, it has a check box at the top to downsample your reads. The quality trimming is nice too because it's an annotation instead of a clipping, so it's easy to re-run with different stringencies.

**megster** · 03-11-2014, 03:43 PM

Thanks! I missed that option at the top of the box, I ran it with the MIRA plugin and it worked beautifully. I'll also retry it with the new Geneious assembler program and see if it works any better for me.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 18 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 17 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 49 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Very high coverage w/Ion Torrent

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News