Seqanswers Leaderboard Ad

**Brian Bushnell** · 06-09-2014, 02:18 PM

What kind of organism is it? Some are very repetitive. Also, highly heterozygous polyploid organisms can generate multiple redundant scaffolds.

It could be that you need to start over from the reads and do your own assembly with an assembler (like dipSpades) or settings that are better designed for the input data.

**RNAddict** · 06-09-2014, 04:17 PM

Brian,
I appreciate your input. We have now gone through a few rounds of sequencing and assembly - our latest assembly is using Illumina 'longreads' >1500kb in addition to a number of SIMP PE libraries for scaffolding.

The organism has 5 chromosomes and is diploid. Its supposed to have a small ~70Mb genome, so I had originally suspected that that most genes would occur in single copy.

Also, the organism in question is parthenogenic and the cultures used were supposedly generated from a single individual.

Maybe I'll repose my question as this: "is there some strategy/criteria for telling if two very similar scaffolds are the result of sequencing error (or some other artificial source) rather than alleles or gene duplications?"

**Brian Bushnell** · 06-09-2014, 04:39 PM

Unfortunately, the answer to that question is "no". At least, not objectively with complete confidence. But if you look at the rate at which exact (or almost-exact) repeats occur in related organisms, as a function of size, you should be able to get an idea of whether that's realistic. I would not expect there to be thousands of 6k+ exact repeats in any organism, for example, but there are several branches of life I have not worked on.

I think organisms with simpler expression control tend to require more copies of genes that need regulation (or need to be highly expressed), but that wouldn't explain replications of 6-14k sequences that are (presumably) substantially larger than single genes. I encourage you to generate and post a kmer frequency histogram; that will allow visual estimation of the genome size, and expected genome fraction with 1 copy, 2 copies, 3 copies, etc.

By the way, I have a program called "dedupe" that's explicitly designed to remove redundant scaffolds (both exact duplicates, containments, and optionally inexact containments down to some minimum percent identity) from assemblies, and we run it routinely. Why? Well... in a metagenome, for example, they don't add any useful information. And usually (with some of our protocols), such things are assembly artifacts. I don't know why your assembly is 3-4x your target size, but if it was around double, I would have suggested you simply remove all duplicates on the assumption they are caused by the ploidy.

**RNAddict** · 06-10-2014, 12:07 PM

Brian - thanks for your insightful response. I'll go ahead and try generating a kmer plot.

Your comment brings up a point that I am still a bit stuck on - that being what to call "duplicate." I was speaking with a genomics person at my institution whose opinion was that any large scaffold (6-14kb) that had >95% identity with another scaffold was probably a product of heterozygousity and to throw out these scaffolds.

On the flip side of that coin anything <95% identity he thought could be a potential paralog and should be retained.

I guess I am curious since it sounds like you do this type of filtering a lot if these percentages sound good to you.

**Brian Bushnell** · 06-10-2014, 12:41 PM

95% sounds pretty low; I think we normally we use a cutoff of around 99% (depending on this situation; other times we use 100%). For example, in transcriptome assembly, I think our cutoff is 99% identity and the shorter contig must be at least 80% of the length of the longer one to be discarded. But I don't know why those numbers were chosen or if they're optimal.

For genomes, it depends on the organism - humans have a SNP rate of around 1/1000, while some fungi can have a rate of over 1/50. I think one we sequenced even had a rate of 1/22. So, you could look at the heterozygosity of related organisms and establish your cutoff that way; if relatives tend to have chromosomes with 99.5% identity to each other, then tossing things with only 95% identity may not be wise.

And as I mentioned, heterozygosity explains an assembly 2x the size you expected, but not 4x, so it's still a little mysterious.

Another thing you might consider - sometimes chromosomes get duplicated, and then the two copies evolve into different chromosomes. But for a while, the two copies will be very similar. So you could, for example, have an organism with 5 chromosomes, 2 of which are near-identical because the duplication was recent. That's another thing you may be able to check by looking at related species, and the numbers of chromosomes they have.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 39 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 41 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 35 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 55 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Gene duplication events vs. duplicate scaffolds

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News