![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Sequence transposon flanking region | Akira | Sample Prep / Library Generation | 8 | 03-18-2012 07:53 AM |
PubMed: A global clustering algorithm to identify long intergenic non-coding RNA - wi | Newsbot! | Literature Watch | 0 | 10-08-2011 03:00 AM |
PubMed: CLOTU: an online pipeline for processing and clustering of 454 amplicon reads | Newsbot! | Literature Watch | 0 | 09-20-2011 03:00 AM |
clustering paired-end reads | rwenang | Bioinformatics | 2 | 02-06-2011 08:15 PM |
clustering short reads | lpantano | Bioinformatics | 2 | 02-02-2010 06:56 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: Pennsylvania Join Date: Apr 2011
Posts: 27
|
![]()
We have Ion Torrent reads from retrovirus (transposon) integration sites in unsequenced genome and we need to cluster them by sequence identity. The first fifty bases of each read is always the transposon end and the rest is basically random piece of genomic DNA that flanks the insertion. We need to collapse or cluster the reads from each unique integration site together. Currently we use de novo assembly algorithms, but those perform poorely. We need to relax the stringency of alignment because of the sequencing errors, and then de novo assembly joins artificially clusters together. Our clusters should have length of only one read.
Would anybody know of suitable algorithm to create these single read clusters? |
![]() |
![]() |
![]() |
#2 | |
Senior Member
Location: Vancouver, BC Join Date: Mar 2010
Posts: 275
|
![]() Quote:
|
|
![]() |
![]() |
![]() |
#3 |
Member
Location: Pennsylvania Join Date: Apr 2011
Posts: 27
|
![]()
Thanks for your response. The clusters should have a length of one read. They can contain for example 50 reads, but all reads start at position 1 ("left side" in aligned cluster). The reads in a cluster might differ in length based on the initial fragmentation.
To make it more difficult, our reads come from a pool of animals, so in addition to sequencing errors we also see SNPs. That is why we cannot use assembly based on let's say 99% homology. The de novo algorithm then starts adding read to our clusters that extend the cluster in length, mosty based on random inverted repeats in the genomic tags. |
![]() |
![]() |
![]() |
#4 |
Member
Location: Pennsylvania Join Date: Apr 2011
Posts: 27
|
![]()
OK, finally I found a great program USEARCH (http://www.drive5.com/usearch/usearch_docs.html) that does exactly that.
|
![]() |
![]() |
![]() |
Thread Tools | |
|
|