Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • removing N inserts

    I am trying to build a comprehensive database of prokaryotic (bacteria and archea) and fungal genomes to be used for screening ancient DNA reads for contamination. What I found unfortunately that many of genomes in NCBI or EMBL databases have a lot of poly-N inserts, which obviously need to be eliminated. This can be done either by removing inserts from each FASTA record, which may be difficult, or by splitting records at poly-N inserts and trimming Ns from the ends. Is there a tool/sctipt to do this? Alternatively, I may have to abandon genomes and just concatenate GenBank relevant records, but I first will have to extract FASTA from them. Any advice?

  • #2
    It's not obvious to me that they have to be taken out. Reads just won't align there, that's all.

    You could use sed to get rid of all the N's, or write a script in something to trim out sequences of Ns that are more than a certain length.

    Comment


    • #3
      Well, dowloaded data files need to be preprocessed by BWA-SW to make databases for local install of DeconSeq, and the author removed Ns by splitting, as BWA-SW replaces Ns with either of A, G, C, T at random (citing the paper). But it does not say in the paper how this was done...

      Comment


      • #4
        Why do you need to remove the poly-N regions? These are gap regions that could be useful

        Comment


        • #5
          I am also not convinced that you should remove N's, but if you must, you can with Biopieces (www.biopieces.org):

          Code:
          read_fasta -i in.fna | transliterate_seq -d 'nN' | write_fasta -o out.fna -x

          Comment


          • #6
            Wow, i did not think biopieces can do this. Gotta try!

            Comment


            • #7
              @yaximik

              For more finegrained control you can use substitute_vals to remove blocks of N's longer than 25:

              Code:
              read_fasta -i in.fna | substitute_vals -k SEQ -s 'N{25,}' -r '' -ig | write_fasta -o out.fna -x

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Essential Discoveries and Tools in Epitranscriptomics
                by seqadmin




                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                04-22-2024, 07:01 AM
              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              59 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              57 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              51 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-04-2024, 09:00 AM
              0 responses
              55 views
              0 likes
              Last Post seqadmin  
              Working...
              X