Seqanswers Leaderboard Ad

**swbarnes2** · 01-29-2013, 01:20 PM

Originally posted by bye View Post

Hi,

We are working on finding new transposon insertion sites using NGS data. Our candidate reads should contain part of transposon sequence and part of genome sequence around its insertion site. In another words, the reads that we are interested are the reads that can not perfectly aligned to genome, therefore the duplicate removing tools based on alignment to reference are not suitable for our project.

I'm just wondering if anyone know about any tools that can remove duplicated sequence as well as condense shorter reads into the longer ones for the reads that can't aligned to reference?

Thanks in advance!

bin

It's not computationally pretty, but you could try getting all your unaligned reads, using cut | sort| uniq -c to get a list of all the sequences, and how often they come up.

Maybe start with a grep to get all the reads that begin with edge of the transposon sequence; that will make the list more manageable.

**bye** · 01-29-2013, 01:36 PM

Originally posted by swbarnes2 View Post

It's not computationally pretty, but you could try getting all your unaligned reads, using cut | sort| uniq -c to get a list of all the sequences, and how often they come up.

Maybe start with a grep to get all the reads that begin with edge of the transposon sequence; that will make the list more manageable.

Thank you! This surely is a good starting point!

**HESmith** · 01-30-2013, 04:36 AM

We've used split-end alignment (described here) to map transposon insertions.

**bye** · 01-30-2013, 06:24 AM

Originally posted by HESmith View Post

We've used split-end alignment (described here) to map transposon insertions.

This is a great idea! Have you ever apply this method to human? May I ask which transposon reference databases were used?

**HESmith** · 01-30-2013, 08:48 AM

No, we have not screened human data, so I have no advice regarding reference databases.

**krobison** · 01-31-2013, 07:35 AM

I would use SMALT to align your reads to the transposon. SMALT is quite quick, particularly going against such a tiny reference. You can use the output from this to
(a) filter for the reads containing transposon ends and useful other sequence
(b) orient your reads relative to the transposon end
(c) extract the non-transposon portions of the reads

The data from (c) is what is then aligned to your genome of interest.

With paired end Illumina data, life gets a little more interesting as you will want to find cases in which one read maps entirely to the genome of interest and the other entirely (or nearly so) to the transposon. Merging reads with FLASH or similar will reduce many of these to the single read case, but for the rest you'll need to make sure you identify these cases.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

NGS reads condensation

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News