Seqanswers Leaderboard Ad

**chenyao** · 08-19-2011, 07:23 AM

Any help????

**swbarnes2** · 08-19-2011, 10:00 AM

Genome fastqs are generally not annotated for what sequence is intron and exon. You need some other file that says where introns and exons are, like a Gff.

**westerman** · 08-19-2011, 12:18 PM

I do agree with swbarnes2 that genome fastqs are not annotated and that generally you need a gff file for proper and verified splice sites. But since you asked about Tophat and presumably de-novo detection of junctions, I quote from the Tophat manual

TopHat finds splice junctions without a reference annotation. By first mapping RNA-Seq reads to the genome, TopHat identifies potential exons, since many RNA-Seq reads will contiguously align to the genome. Using this initial mapping, TopHat builds a database of possible splice junctions, and then maps the reads against this junction to confirm them.

Look at the manual for more help.

404 Not Found

http://tophat.cbcb.umd.edu/manual.html

**chenyao** · 08-19-2011, 05:41 PM

Originally posted by westerman View Post

I do agree with swbarnes2 that genome fastqs are not annotated and that generally you need a gff file for proper and verified splice sites. But since you asked about Tophat and presumably de-novo detection of junctions, I quote from the Tophat manual

Look at the manual for more help.
http://tophat.cbcb.umd.edu/manual.html

Thank you. That's exactly my question. Why reads contiguously align to the genome can define a exon? How define "congiguous" ?

**westerman** · 08-22-2011, 10:17 AM

I believe that you have the meaning of the word 'contiguous' correct -- that is the reads have to match exactly the genome.

As I said, look at the manual for more help. The part I quoted was just a small introduction to how Tophat works. Now I am far away from being a Tophat expert but basically the idea is to:

a) Split up the reads into small segments ... say 40 bases.

b) Align these splits contiguously (e.g., exactly) to the genome; many will align but many will not because they span junctions.

c) Where there are many reads aligning then consider this an 'island' which represents correct alignments. An island will not contain a junction because otherwise the split would not align.

d) Stitch these islands together to cover junctions. The strongest evidence of a junction is where a read has two different 'splits' in two different islands. In other words the only way a read could be in two islands is if the read spans a junction. There are other avenues of evidence as well (e.g., you can slowly build out from each island via adding parts of non-island reads to the island until a junction border is reached.)

Now as I said I am far from an Tophat expert. If someone with a better understanding can chime in then that would be great. In the meantime studying the manual is your (and my) best option.

**chenyao** · 08-22-2011, 04:15 PM

Originally posted by westerman View Post

I believe that you have the meaning of the word 'contiguous' correct -- that is the reads have to match exactly the genome.

As I said, look at the manual for more help. The part I quoted was just a small introduction to how Tophat works. Now I am far away from being a Tophat expert but basically the idea is to:

a) Split up the reads into small segments ... say 40 bases.

b) Align these splits contiguously (e.g., exactly) to the genome; many will align but many will not because they span junctions.

c) Where there are many reads aligning then consider this an 'island' which represents correct alignments. An island will not contain a junction because otherwise the split would not align.

d) Stitch these islands together to cover junctions. The strongest evidence of a junction is where a read has two different 'splits' in two different islands. In other words the only way a read could be in two islands is if the read spans a junction. There are other avenues of evidence as well (e.g., you can slowly build out from each island via adding parts of non-island reads to the island until a junction border is reached.)

Now as I said I am far from an Tophat expert. If someone with a better understanding can chime in then that would be great. In the meantime studying the manual is your (and my) best option.

Thank you very much! It's pretty clear. I also have a question. For pair end reads, what if one read mapped to one exon and the other mapped to the other exon? How define this kind of alignment, is it proper pair map or not ?

**westerman** · 08-23-2011, 05:04 AM

Thank you very much! It's pretty clear. I also have a question. For pair end reads, what if one read mapped to one exon and the other mapped to the other exon? How define this kind of alignment, is it proper pair map or not ?

Yes, the pairs would be correct. In fact such information can be used to determine junctions. In other words if the pairs are mapped to parts of the genome that are, say 5KB away from each other but you know that the ends should be within 200 bases of each other, give or take a 100 bases, then those pairs must be spanning a junction.

Once again I am not a tophat expert nor do I know the internals to Tophat, but I believe that tophat uses the above reasoning as part of its junction finding strategy.

**chenyao** · 08-23-2011, 05:43 AM

Originally posted by westerman View Post

Yes, the pairs would be correct. In fact such information can be used to determine junctions. In other words if the pairs are mapped to parts of the genome that are, say 5KB away from each other but you know that the ends should be within 200 bases of each other, give or take a 100 bases, then those pairs must be spanning a junction.

Once again I am not a tophat expert nor do I know the internals to Tophat, but I believe that tophat uses the above reasoning as part of its junction finding strategy.

Thank you. One thing confusing me is the defination of proper pair. It's a pair read which were aligned with the defined distance (I believe it's the fragment size). However, if the pair reads were aligned to two exons, then their distance should + intron length, so the distance must be much larger than the predifined fragment size. How tophat identify it's a proper pair reads?

**westerman** · 08-23-2011, 09:50 AM

Does tophat use the term 'proper pair' anywhere? If so could you please give a reference to its use.

In samtools there is "proper pair". If you are talking about this, then I am suspecting that tophat marks reads as "proper pair" inside the bam format if the pairs do indeed span a junction. That is, pairs that contribute to a junction call are good and thus "proper".

As far as I know there is no one definition of a "proper pair" in BAM/SAM. A pair is "proper" if the program that makes up the BAM/SAM file deems the pair as proper.

Once again I put my normal disclaimers about not being a Tophat expert.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 18 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 47 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Naive question about read mapping, where is intron in genome.fa data

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News