Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Ambiguous bases should not be more than total 10% length or more than 14n's in a row.

    Hi all,
    I am trying to submit a transcriptome assembly to the TSA.
    The format is like this:
    >seq1234
    TTTTTTTNNNTTTTTTTTTTTTGGTTTTCTTGAGTAAAGTAAAAAAACCTGAATGATG
    GATGAGGCGAATGATGTGAGGATAAATNNNNAAACGANTNTTATAAGATGTAAAAGTT
    GTCATTAACTTAGTAAAGGCCCTAATTATTGAAGTTAATTATTCCAATGGATAAAAAT
    >seq1235
    AGACACATCGTGTGTTTCTGGATCTTTTTCAGCTTCTTCCTTCAAATCTACTCTGGTT
    GGTGCTGCTGTCAACTGCATCATTTTCGTTTGCTNNNNNCTTTTTGGCCGGAGCATCA
    and so on...

    The TSA are asking for this criteria:
    Ambiguous bases should not be more than total 10% length or more than 14n's in a row.

    Does someone knows quick linux based solution for this?
    I googled it, but i found only solutions to replace the ambiguous as this:

    or this,

    but i have perl issues with this..

    any linux based solution will be appreciate!
    Thanks

  • #2
    The quoted script will convert Ns into As. I doubt if this is what you really want to submit to the TSA since at the point you would be submitting incorrect information.

    I do not have a program to recommend but just throwing away scaffolds/contigs that do not meet TSA's criteria would be what I would do.

    Comment


    • #3
      I would also recommend throwing away scaffolds that are more than 10% ambiguous. But for scaffolds with more than 14 consecutive Ns, you can either split them into two scaffolds at that point, or change the Ns into to a single N (which is still technically valid as N signifies an unknown sequence of unknown length). Otherwise you could lose a lot of useful information.

      Unfortunately I don't have a tool that does this.

      Comment


      • #4
        Originally posted by Brian Bushnell View Post
        ... or change the Ns into to a single N (which is still technically valid as N signifies an unknown sequence of unknown length). ...
        I do not agree with Brian on this. A single N should mean a single base that can not be resolved -- often due to due to quality or other technical factors. It should not represent an unknown length. Multiple-Ns, just like poly-A or other poly tracts do often represent unknown lengths because it is hard to accurately sequence and assemble long stretches of a single nucleotide.

        Comment


        • #5
          And as reference to an authority (instead of my own personal opinion), NCBI says (I made the relevant text bold)
          TSA does not accept assemblies which have Ns inserted to represent gaps of unknown length. Sequences containing Ns representing gaps of unknown length need to be split into individual assemblies. Internal Ns representing ambiguous bases or known length gaps can be submitted. If the Ns represent ambiguous bases they should not be more than 10% of the sequence length or more than 14 n's in a row. If the N's represent a known length gap then an assembly_gap feature must be used.

          Comment


          • #6
            OK, I will defer to that guidance, then. I interpret single N's as single unknown bases, but I know I have read alternate definitions of N as meaning unknown sequence of unknown length, though I couldn't find a reference to that when searching.

            Note, though, that those guidelines are not necessarily ideal, and preclude the submission of scaffolded assemblies such as HG19.

            Comment


            • #7
              @papori - What software were you using for the transcriptome assembly? In the example you posted were there multiple reads with N's in those positions or was there no consensus in the reads that spanned that region.

              Comment


              • #8
                I am using Trinity, but i just figure out that i didnt use it properly and that is the reason for the Ns.
                Now, Trinity finished to run again, and i found that i dont have any Ns in the whole assembly..

                So, it is still interesting question:
                Ho to filter out contigs with more than 10% Ns or 14 in a row?

                But for me the problem just solved using different parameters in Trinity.
                Thanks!

                Comment


                • #9
                  For filtering I would think bioperl or biopython would come in useful. Just read in the resulting fasta files with those and then iterate over the contigs, calculating N content and such. That should be a pretty straightforward program to write (assuming you can code, otherwise I imagine it'd prove anything but straightforward).

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Current Approaches to Protein Sequencing
                    by seqadmin


                    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                    04-04-2024, 04:25 PM
                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 04-11-2024, 12:08 PM
                  0 responses
                  27 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 10:19 PM
                  0 responses
                  31 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 09:21 AM
                  0 responses
                  27 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-04-2024, 09:00 AM
                  0 responses
                  52 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X