Go Back   SEQanswers > Bioinformatics > Bioinformatics

Similar Threads
Thread Thread Starter Forum Replies Last Post
Generate .AGP file for WGS submission Tuinhof Bioinformatics 11 11-27-2014 05:59 AM
read depth analysis on WGS data stored as .sra from ncbi Oscar.K Bioinformatics 1 04-02-2014 05:21 AM
mask contigs in scaffolds by length Wallysb01 Bioinformatics 0 04-20-2013 09:36 AM
how to trim solid reads length? lei Bioinformatics 7 12-14-2012 08:55 AM
Do I need to trim the sequences like this? days369 Bioinformatics 4 08-16-2010 09:19 PM

Thread Tools
Old 04-10-2014, 07:17 AM   #1
Location: DE

Join Date: Dec 2012
Posts: 65
Default NCBI WGS submission: Need to trim sequences of various length from scaffolds

The submission that will never die...

I have a number of contigs that did not pass NCBI's contamination/adapter screen. I need to trim these but the problem is that they all are varying lengths. Some internal bust most at either the 5' or 3' end of the scaffold.

The only info provided by NCBI is the scaffold/contig name, length and the start and stop base # of what needs to be trimmed. If I had the sequence I could just manually ctrl-f and delete/substitute Ns.

Most of these are too large to load into VectorNTI or similar program.

Any help would be appreciated. Thanks.
lac302 is offline   Reply With Quote
Old 04-10-2014, 07:29 AM   #2
Devon Ryan
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480

I presume that you have the original contigs in fasta or some other text format, yes? If so, you'll find biopython very useful (it won't complain about contig length, unless your computer is from the 80s). You can parse fasta files and subset sequences based on coordinates relatively easily with it. The general idea would be to store the coordinates to be trimmed in a text file and the write a little script to (1) read that into a hash (2) open the file containing the contigs (3) iterate through the records, checking for the presence of each in the hash and then subsetting accordingly.

I would be hesitant to hard mask internal sequences that are actually adapter contamination. It would seem more reasonable in those cases to simply break apart the contigs containing them (you really should remove all adapter sequence prior to assembly).
dpryan is offline   Reply With Quote
Old 04-10-2014, 07:43 AM   #3
Location: DE

Join Date: Dec 2012
Posts: 65

Thanks for the quick reply. I will look into that.

The adapter sequences were removed from the short reads for the initial contig assembly. I'm assuming that the jump libraries are the culprit hear.

In the end it's only 47 contigs/scaffolds out of 70k for a large eukaryotic genome.
lac302 is offline   Reply With Quote

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

All times are GMT -8. The time now is 11:55 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO