Seqanswers Leaderboard Ad

**flxlex** · 12-06-2010, 12:52 AM

First, have you read my blog entry on newbler cDNA output (pardon the shameless self-promotion)? http://contig.wordpress.com/2010/09/...-output-files/

Originally posted by WaltL View Post

Roche says you should use the 454Isotig.fna file as your assembly "contig" file for downstream work, e.g. blast... which is what I did. There are only isotig seqs found in this file.

Are you sure? If there are isogroups that did not become isotigs, the contigs of these isogroups should be in the 454Isotigs.fna file...

The cDNA module is somewhat buggy, as noted in several posts at SeqAnswers.

I am a it surprised about the large contigs in the ace file missing from the other fna files. I did find equal number of isotigs in the ace file and 454Isotigsfiles for the one I checked:

grep -c isotig 454Isotigs.ace
32541
grep -c isotig 454Isotigs.fna
32541

About the short contigs: in a de novo genome assembly, these also exist but are not reported as, by default, the lower limit for '454AllContigs.fna' and the 454Comntigs.ace files is 100 bp. These short ones are the result of the way newbler builds the contig graph (also explained on my blog). Some of them are repeats, some small differences (indels) between transcript variants etc.

On the metrics: the number of isogroups should potentially tell you how many 'genes' there are. Splice variants (the different isotigs) could actually be just small sequence variants. Collapsing (i.e. clustering) these with CD-HIT or a similar tool might help getting the real splice variants and reduce the number of isotigs. Contigs not in isotigs are a bit of a problem, but if you have a reference genome, maybe you can deduce the real transcript by alignment of the contigs (or reads) to the reference?

Hope this helps,

Lex

**WaltL** · 12-08-2010, 05:38 PM

Lex,

Thanks for your response. So I went back and double checked and you are correct, there are contigs in the 454Isotig.fna file. I greped contig and found 1,113 instances... the difference of the total 48,882 being 47,729 which is the # of isotigs found in the ace file.

I still, however, do not understand why Roche chooses to write all the short (<100 bp) contigs to the ace file. I mean, this is just junk sequence. Since I am using someone else's scripts to parse the ace file into my database, I have no way to filter them out. Seems like writing these bits to a separate debris/boneyard file would be a smarter way to go. Oh well... maybe on the next version!

Also, thank you for the suggestion on collapsing the assembly. I have tried running some of my miraEST assemblies from the same dataset (> 180K multi-read contigs) through CAP3, but that didn't help very much. The isogroup count for Newbler is ~ 26K isogroups and, given that this particular conifer species has a genome 7X larger than human (no reference yet), it is actually the most collapsed assembly when compared to the other assemblers I've used. Right now, I think it may be collapsing things too much.

Thanks again!

Walt

Best,
Walt

Topics	Statistics	Last Post
TIGR Systems Offer a Compact Alternative to CRISPR for Gene Editing by seqadmin Started by seqadmin, 03-03-2025, 01:15 PM	0 responses 149 views 0 likes	Last Post by seqadmin 03-03-2025, 01:15 PM
Highlights from AGBT 2025 – Part II by seqadmin Started by seqadmin, 02-28-2025, 12:58 PM	0 responses 223 views 0 likes	Last Post by seqadmin 02-28-2025, 12:58 PM
Highlights from AGBT 2025 – Part I by seqadmin Started by seqadmin, 02-24-2025, 02:48 PM	0 responses 590 views 0 likes	Last Post by seqadmin 02-24-2025, 02:48 PM
Selecting the Right AI Model for Bioinformatics Research by seqadmin Started by seqadmin, 02-21-2025, 02:46 PM	0 responses 259 views 0 likes	Last Post by seqadmin 02-21-2025, 02:46 PM

Seqanswers Leaderboard Ad

Announcement

Why does Newbler do what it does with .ace files?

Comment

Comment

Latest Articles

ad_right_rmr

News