Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • NCBI WGS submission: Need to trim sequences of various length from scaffolds

    The submission that will never die...

    I have a number of contigs that did not pass NCBI's contamination/adapter screen. I need to trim these but the problem is that they all are varying lengths. Some internal bust most at either the 5' or 3' end of the scaffold.

    The only info provided by NCBI is the scaffold/contig name, length and the start and stop base # of what needs to be trimmed. If I had the sequence I could just manually ctrl-f and delete/substitute Ns.

    Most of these are too large to load into VectorNTI or similar program.

    Any help would be appreciated. Thanks.

  • #2
    I presume that you have the original contigs in fasta or some other text format, yes? If so, you'll find biopython very useful (it won't complain about contig length, unless your computer is from the 80s). You can parse fasta files and subset sequences based on coordinates relatively easily with it. The general idea would be to store the coordinates to be trimmed in a text file and the write a little script to (1) read that into a hash (2) open the file containing the contigs (3) iterate through the records, checking for the presence of each in the hash and then subsetting accordingly.

    I would be hesitant to hard mask internal sequences that are actually adapter contamination. It would seem more reasonable in those cases to simply break apart the contigs containing them (you really should remove all adapter sequence prior to assembly).

    Comment


    • #3
      Thanks for the quick reply. I will look into that.

      The adapter sequences were removed from the short reads for the initial contig assembly. I'm assuming that the jump libraries are the culprit hear.

      In the end it's only 47 contigs/scaffolds out of 70k for a large eukaryotic genome.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Essential Discoveries and Tools in Epitranscriptomics
        by seqadmin




        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
        Yesterday, 07:01 AM
      • seqadmin
        Current Approaches to Protein Sequencing
        by seqadmin


        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
        04-04-2024, 04:25 PM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, 04-11-2024, 12:08 PM
      0 responses
      55 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 10:19 PM
      0 responses
      52 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 09:21 AM
      0 responses
      45 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-04-2024, 09:00 AM
      0 responses
      55 views
      0 likes
      Last Post seqadmin  
      Working...
      X