SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
counting total bases in a fasta file morning latte Bioinformatics 6 09-23-2016 05:58 AM
mapper which work with ambiguous bases dietmar13 Bioinformatics 2 10-30-2013 11:59 AM
Ambiguous bases vs mismatches in Tophat ameyer Bioinformatics 0 10-01-2012 09:59 AM
IUPAC ambiguous bases in vcf file? sdvie Bioinformatics 0 07-04-2012 01:47 AM

Reply
 
Thread Tools
Old 08-18-2014, 10:43 PM   #1
papori
Senior Member
 
Location: berd

Join Date: Dec 2010
Posts: 179
Default Ambiguous bases should not be more than total 10% length or more than 14n's in a row.

Hi all,
I am trying to submit a transcriptome assembly to the TSA.
The format is like this:
>seq1234
TTTTTTTNNNTTTTTTTTTTTTGGTTTTCTTGAGTAAAGTAAAAAAACCTGAATGATG
GATGAGGCGAATGATGTGAGGATAAATNNNNAAACGANTNTTATAAGATGTAAAAGTT
GTCATTAACTTAGTAAAGGCCCTAATTATTGAAGTTAATTATTCCAATGGATAAAAAT
>seq1235
AGACACATCGTGTGTTTCTGGATCTTTTTCAGCTTCTTCCTTCAAATCTACTCTGGTT
GGTGCTGCTGTCAACTGCATCATTTTCGTTTGCTNNNNNCTTTTTGGCCGGAGCATCA
and so on...

The TSA are asking for this criteria:
Ambiguous bases should not be more than total 10% length or more than 14n's in a row.

Does someone knows quick linux based solution for this?
I googled it, but i found only solutions to replace the ambiguous as this:
https://github.com/jimhester/fasta_utilities
or this,
https://github.com/jimhester/fasta_utilities
but i have perl issues with this..

any linux based solution will be appreciate!
Thanks
papori is offline   Reply With Quote
Old 08-19-2014, 07:04 AM   #2
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

The quoted script will convert Ns into As. I doubt if this is what you really want to submit to the TSA since at the point you would be submitting incorrect information.

I do not have a program to recommend but just throwing away scaffolds/contigs that do not meet TSA's criteria would be what I would do.
westerman is offline   Reply With Quote
Old 08-19-2014, 09:33 AM   #3
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

I would also recommend throwing away scaffolds that are more than 10% ambiguous. But for scaffolds with more than 14 consecutive Ns, you can either split them into two scaffolds at that point, or change the Ns into to a single N (which is still technically valid as N signifies an unknown sequence of unknown length). Otherwise you could lose a lot of useful information.

Unfortunately I don't have a tool that does this.
Brian Bushnell is offline   Reply With Quote
Old 08-20-2014, 06:30 AM   #4
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

Quote:
Originally Posted by Brian Bushnell View Post
... or change the Ns into to a single N (which is still technically valid as N signifies an unknown sequence of unknown length). ...
I do not agree with Brian on this. A single N should mean a single base that can not be resolved -- often due to due to quality or other technical factors. It should not represent an unknown length. Multiple-Ns, just like poly-A or other poly tracts do often represent unknown lengths because it is hard to accurately sequence and assemble long stretches of a single nucleotide.
westerman is offline   Reply With Quote
Old 08-20-2014, 06:34 AM   #5
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

And as reference to an authority (instead of my own personal opinion), NCBI says (I made the relevant text bold)
Quote:
TSA does not accept assemblies which have Ns inserted to represent gaps of unknown length. Sequences containing Ns representing gaps of unknown length need to be split into individual assemblies. Internal Ns representing ambiguous bases or known length gaps can be submitted. If the Ns represent ambiguous bases they should not be more than 10% of the sequence length or more than 14 n's in a row. If the N's represent a known length gap then an assembly_gap feature must be used.
westerman is offline   Reply With Quote
Old 08-20-2014, 08:56 AM   #6
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

OK, I will defer to that guidance, then. I interpret single N's as single unknown bases, but I know I have read alternate definitions of N as meaning unknown sequence of unknown length, though I couldn't find a reference to that when searching.

Note, though, that those guidelines are not necessarily ideal, and preclude the submission of scaffolded assemblies such as HG19.
Brian Bushnell is offline   Reply With Quote
Old 08-20-2014, 09:13 AM   #7
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,881
Default

@papori - What software were you using for the transcriptome assembly? In the example you posted were there multiple reads with N's in those positions or was there no consensus in the reads that spanned that region.
GenoMax is offline   Reply With Quote
Old 08-20-2014, 11:24 PM   #8
papori
Senior Member
 
Location: berd

Join Date: Dec 2010
Posts: 179
Default

I am using Trinity, but i just figure out that i didnt use it properly and that is the reason for the Ns.
Now, Trinity finished to run again, and i found that i dont have any Ns in the whole assembly..

So, it is still interesting question:
Ho to filter out contigs with more than 10% Ns or 14 in a row?

But for me the problem just solved using different parameters in Trinity.
Thanks!
papori is offline   Reply With Quote
Old 08-20-2014, 11:59 PM   #9
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

For filtering I would think bioperl or biopython would come in useful. Just read in the resulting fasta files with those and then iterate over the contigs, calculating N content and such. That should be a pretty straightforward program to write (assuming you can code, otherwise I imagine it'd prove anything but straightforward).
dpryan is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:27 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO