Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • NCBI WGS submission: Need to trim sequences of various length from scaffolds

    The submission that will never die...

    I have a number of contigs that did not pass NCBI's contamination/adapter screen. I need to trim these but the problem is that they all are varying lengths. Some internal bust most at either the 5' or 3' end of the scaffold.

    The only info provided by NCBI is the scaffold/contig name, length and the start and stop base # of what needs to be trimmed. If I had the sequence I could just manually ctrl-f and delete/substitute Ns.

    Most of these are too large to load into VectorNTI or similar program.

    Any help would be appreciated. Thanks.

  • #2
    I presume that you have the original contigs in fasta or some other text format, yes? If so, you'll find biopython very useful (it won't complain about contig length, unless your computer is from the 80s). You can parse fasta files and subset sequences based on coordinates relatively easily with it. The general idea would be to store the coordinates to be trimmed in a text file and the write a little script to (1) read that into a hash (2) open the file containing the contigs (3) iterate through the records, checking for the presence of each in the hash and then subsetting accordingly.

    I would be hesitant to hard mask internal sequences that are actually adapter contamination. It would seem more reasonable in those cases to simply break apart the contigs containing them (you really should remove all adapter sequence prior to assembly).

    Comment


    • #3
      Thanks for the quick reply. I will look into that.

      The adapter sequences were removed from the short reads for the initial contig assembly. I'm assuming that the jump libraries are the culprit hear.

      In the end it's only 47 contigs/scaffolds out of 70k for a large eukaryotic genome.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Current Approaches to Protein Sequencing
        by seqadmin


        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
        04-04-2024, 04:25 PM
      • seqadmin
        Strategies for Sequencing Challenging Samples
        by seqadmin


        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
        03-22-2024, 06:39 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, 04-11-2024, 12:08 PM
      0 responses
      27 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 10:19 PM
      0 responses
      31 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 09:21 AM
      0 responses
      27 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-04-2024, 09:00 AM
      0 responses
      52 views
      0 likes
      Last Post seqadmin  
      Working...
      X