I am trying to calculate the number of contigs in a scaffold file i.e. a consensus sequence separated by n's. I have been working on an assembly generated by Newbler and have closed some of the gaps computationally or experimentally. I need to know how many contigs are left in each scaffold. Could anyone point me in the right direction?
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
The file 454Scaffolds.txt generated by Newbler has the information you need.
See http://contig.wordpress.com/2010/03/...-file/#more-56 for more information.
-
I should clarify: the assembly was imported in to gap4 and worked on by joining contigs to the scaffold consensus and closing gaps computationally or experimentally. I can save out the updated consensus files but these will still contain n's due to the scaffold sequence that I joined in. I need to find a way of calculating the number of contigs ie. the number of sequences separated by n's in this file.
Comment
-
Assuming you are a unix type system, one answer is to use the 'tr' command along with 'sed' and 'wc'. First get rid of the fasta headers. Then get rid of the newlines. Then reduce all the of the 'n's to a single character. Finally delete all non-n's and then count up the remaining n's. That number will represent the number of gaps you have plus one thus the number of contigs.
sed -e 's/>.*/n/' scaffold.fasta | tr -d '\n' | tr -s 'n' | tr -d 'acgt' | wc -c
The above assumes only acgtn in lower case. I suspect there are as many other answers as there are people on this bulletin board.
Comment
Latest Articles
Collapse
-
by seqadmin
The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...-
Channel: Articles
Yesterday, 07:01 AM -
-
by seqadmin
Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...-
Channel: Articles
04-04-2024, 04:25 PM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, 04-11-2024, 12:08 PM
|
0 responses
39 views
0 likes
|
Last Post
by seqadmin
04-11-2024, 12:08 PM
|
||
Started by seqadmin, 04-10-2024, 10:19 PM
|
0 responses
41 views
0 likes
|
Last Post
by seqadmin
04-10-2024, 10:19 PM
|
||
Started by seqadmin, 04-10-2024, 09:21 AM
|
0 responses
35 views
0 likes
|
Last Post
by seqadmin
04-10-2024, 09:21 AM
|
||
Started by seqadmin, 04-04-2024, 09:00 AM
|
0 responses
55 views
0 likes
|
Last Post
by seqadmin
04-04-2024, 09:00 AM
|
Comment