Seqanswers Leaderboard Ad

**flxlex** · 09-14-2014, 11:41 PM

Two things you can do:

- rerun quiver (the final step in the pipeline, RS_Resequencing I believe), this may help
- polish with Illumina data, e.g. 50-100x MiSeq (different tools can do this, e.g. Pilon http://www.broadinstitute.org/software/pilon/)

**mbayer** · 09-15-2014, 12:55 AM

Hi VidJa,

try QUAST (http://bioinf.spbau.ru/quast). It's a pretty handy tool for doing both statistical and biological evaluation of assemblies. It has a web interface but alternatively you can also run it locally from the command line.

cheers

Micha

**rhall** · 09-15-2014, 09:19 AM

+1 for re-runing quiver, particularly if you have run any kind of circularization on the genome. I would also check the QV of the consensus calls that are problematic, note a lowercase call in the consensus is an indication that quiver was unable to call consensus for that region, e.g. no uniquely mapped reads.

**ebioman** · 10-07-2014, 01:33 AM

Another good (and quick) check can be the coverage.
Just re-map all your reads onto the obtained genome and look at the coverage (should be already in your final results as well).
If you look at the obtained coverage it should be very uniformly distributed among the entire genome. If you have sudden increase or
decrease of coverage it often points to miss-assemblies.

**bastianwur** · 10-10-2014, 01:41 AM

You need to be careful with that.
I had some spikes in my Illumina PE mapping, which looked weird, and I thought about collapsed repeats. But later I got PacBio data, and just for "fun" I mapped my PE data to them, and the same spikes appeared.

**sagarutturkar** · 10-10-2014, 01:14 PM

Testing circular genomes

Hi Rhall,

Can you please help with genome circularization analysis with Pacbio

1. I have 2 contigs assembly. C1 - 4.3 Mb C2 - 13KB (average coverage above 40X)
2. The C2 completely match with C1 -
C2_chunk1 - first 1-6243 at location between 3.45-3.46 MB
C2_chunk2 - Next 6216-13244 maps somewhere 2.49 MB
There is 29KB gap between C2_chunk1 and C2_chunk2
There are some mismatches in alignment.

Now, I have following questions:
1. Can C2 be a spurious contig as it is very small and completely contained withing C1? PacBio recommended to discard contigs below coverage 20X.
2. Also, I tried to run Gepard with C1 to test circularization. Can you provide some link/guide to some example dot plots of circular genomes.
3. Any other tools to check circularization? (Currently, I am trying to break the Contig1 and mapping the reads against it to check if reads support the overlap).

Thanks
sagar

**rhall** · 10-10-2014, 01:27 PM

1. It is possible that contig C2 actually represents a sub-population of the sample being sequenced. This can be difficult to validate, one option is to try running the bridgemapper protocol (after removing the small contig from the reference) in SMRT Analysis and looking at the region to which it maps to see is there is evidence in the raw reads that the sample is not clonal in the region. It can also be useful to just blast the extra contig, or region to which it maps, and see if it hits any phage, or common insertion elements.
2. http://files.pacb.com/Training/Circu...ard/story.html
https://github.com/PacificBioscience...g-and-trimming
http://omicsomics.blogspot.com/2013/...from-more.html
3. Again, the Bridgemapper protocol can be used to check for circularity, as the reads at the end of the contig will show a bridged mapping.

**EulnayM** · 11-11-2014, 09:30 PM

Originally posted by sagarutturkar View Post

Hi Rhall,

Can you please help with genome circularization analysis with Pacbio

1. I have 2 contigs assembly. C1 - 4.3 Mb C2 - 13KB (average coverage above 40X)
2. The C2 completely match with C1 -
C2_chunk1 - first 1-6243 at location between 3.45-3.46 MB
C2_chunk2 - Next 6216-13244 maps somewhere 2.49 MB
There is 29KB gap between C2_chunk1 and C2_chunk2
There are some mismatches in alignment.

Now, I have following questions:
1. Can C2 be a spurious contig as it is very small and completely contained withing C1? PacBio recommended to discard contigs below coverage 20X.
2. Also, I tried to run Gepard with C1 to test circularization. Can you provide some link/guide to some example dot plots of circular genomes.
3. Any other tools to check circularization? (Currently, I am trying to break the Contig1 and mapping the reads against it to check if reads support the overlap).

Thanks
sagar

Hi,

In gepard, check the top right and bottom left corner of your alignment to see if there is a diagonal line which indicates overlapping ends to find out if your contig is circular, just align your contig together on both x and y axis.

Besides gepard, there is also a perl script (check_circularity.pl) in a long reads error correction package named SPRAI (http://zombie.cb.k.u-tokyo.ac.jp/sprai/index.html) (you can easily download it by using local package manager (LPM), which claimed to be able to identify circular contigs and remove the redundant ends, you might like to give it a try although when I try it using my gepard-validated circular contig it keeps showing that my circular contigs are linear contigs.

I am also trying to find out a tool which can help to trim the redundant ends of a circular contig and I find the AMOS packages indicated in the circularizing and trimming tutorial to be hard to install, any suggestion or page which can give a better guideline in terms of installation (for beginner in bioinformatics).

Thanks.

**jnfass** · 12-05-2014, 09:49 AM

@EulnayM I see the same thing with Gepard and sprai's check_circularity.pl script. Gepard shows clear ~6kbp overlap (after zooming in on the 5'-end to 3'-end region of the dot plot) with ~100% identity, but check_circularity.pl reports it as linear. Has anyone debugged this script, or found a better replacement out there?

**rhall** · 12-05-2014, 10:03 AM

So long as you are comftable with using the command line, and have SMRT Analysis installed, https://github.com/PacificBioscience...-circularizemk can be used to check for circular contigs.

AMOS packages indicated in the circularizing and trimming tutorial to be hard to install

The executables that are needed for AMOS are now part of SMRT Analysis, if you have 2.3 installed.

**jnfass** · 12-09-2014, 11:27 AM

@rhall Interesting ... I haven't seen this smrtmake stuff before. I'd note that the default OVLSIZE (20000) for circularize.mk seems large; I'm looking at two circular bacterial genomes (~5Mbp size) produced by HGAP.3 that each have overlapping ends in the 6-8kbp range. Also the overlapping regions aren't identical - they've got some indels in them, and could be as low a 99 or 98% identical.

So it seems like there are two options. (1) if the assembler has left overlapping ends, detect those, trim one of them, maybe permute the sequence to start at some commonly used gene (origin of replication), then check that reads still align across the implied join that's now somewhere within the fasta sequence, maybe using QUIVER to also do some final polishing. (2) if there aren't overlapping ends, use bridgemapper to see if the ends are reasonable close to each other? Then, the scaffolding protocol? If there's some good reason (like some incredibly long repeats, or low coverage) that the assembly failed at the ends in the first place, this seems like it's bound to fail.

The problem is, all this stuff takes eyes-on time for each sequence, chromosome, plasmid, whatever. What we really need are some assemblers or scaffolders that are aware of circularity.

**rhall** · 12-09-2014, 11:54 AM

The overlap size is a maximum (overlap is a function of the size of the preassembled reads, overlaps approaching 20kb are not uncommon given really good libraries), overlaps in the range 6-8kb should be found, the default minimum identity is 98%.
To robustly detect circularity I would look at the assembly overlap graph stage. Using the Falcon toolbox, it should be relatively simple to label circular contigs when constructing them from the graph. It should also be possible to detect circularity at the same stage in Celera assembler, but this would likely be more difficult to implement.

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 15 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

PacBio assembly evaluation, how to chose the 'best' one

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News