Seqanswers Leaderboard Ad

**westerman** · 08-19-2014, 07:04 AM

The quoted script will convert Ns into As. I doubt if this is what you really want to submit to the TSA since at the point you would be submitting incorrect information.

I do not have a program to recommend but just throwing away scaffolds/contigs that do not meet TSA's criteria would be what I would do.

**Brian Bushnell** · 08-19-2014, 09:33 AM

I would also recommend throwing away scaffolds that are more than 10% ambiguous. But for scaffolds with more than 14 consecutive Ns, you can either split them into two scaffolds at that point, or change the Ns into to a single N (which is still technically valid as N signifies an unknown sequence of unknown length). Otherwise you could lose a lot of useful information.

Unfortunately I don't have a tool that does this.

**westerman** · 08-20-2014, 06:30 AM

Originally posted by Brian Bushnell View Post

... or change the Ns into to a single N (which is still technically valid as N signifies an unknown sequence of unknown length). ...

I do not agree with Brian on this. A single N should mean a single base that can not be resolved -- often due to due to quality or other technical factors. It should not represent an unknown length. Multiple-Ns, just like poly-A or other poly tracts do often represent unknown lengths because it is hard to accurately sequence and assemble long stretches of a single nucleotide.

**westerman** · 08-20-2014, 06:34 AM

And as reference to an authority (instead of my own personal opinion), NCBI says (I made the relevant text bold)

TSA does not accept assemblies which have Ns inserted to represent gaps of unknown length. Sequences containing Ns representing gaps of unknown length need to be split into individual assemblies. Internal Ns representing ambiguous bases or known length gaps can be submitted. If the Ns represent ambiguous bases they should not be more than 10% of the sequence length or more than 14 n's in a row. If the N's represent a known length gap then an assembly_gap feature must be used.

**Brian Bushnell** · 08-20-2014, 08:56 AM

OK, I will defer to that guidance, then. I interpret single N's as single unknown bases, but I know I have read alternate definitions of N as meaning unknown sequence of unknown length, though I couldn't find a reference to that when searching.

Note, though, that those guidelines are not necessarily ideal, and preclude the submission of scaffolded assemblies such as HG19.

**GenoMax** · 08-20-2014, 09:13 AM

@papori - What software were you using for the transcriptome assembly? In the example you posted were there multiple reads with N's in those positions or was there no consensus in the reads that spanned that region.

**papori** · 08-20-2014, 11:24 PM

I am using Trinity, but i just figure out that i didnt use it properly and that is the reason for the Ns.
Now, Trinity finished to run again, and i found that i dont have any Ns in the whole assembly..

So, it is still interesting question:
Ho to filter out contigs with more than 10% Ns or 14 in a row?

But for me the problem just solved using different parameters in Trinity.
Thanks!

**dpryan** · 08-20-2014, 11:59 PM

For filtering I would think bioperl or biopython would come in useful. Just read in the resulting fasta files with those and then iterate over the contigs, calculating N content and such. That should be a pretty straightforward program to write (assuming you can code, otherwise I imagine it'd prove anything but straightforward).

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 27 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 31 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 27 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Ambiguous bases should not be more than total 10% length or more than 14n's in a row.

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News