Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Naive question about read mapping, where is intron in genome.fa data

    Dear all.

    I have a question. I found the genome fastq data only contained sequence "ATCGG...", How the mapping softwre, such as tophat, decide where is the intron or exon?

  • #2
    Any help????

    Comment


    • #3
      Genome fastqs are generally not annotated for what sequence is intron and exon. You need some other file that says where introns and exons are, like a Gff.

      Comment


      • #4
        I do agree with swbarnes2 that genome fastqs are not annotated and that generally you need a gff file for proper and verified splice sites. But since you asked about Tophat and presumably de-novo detection of junctions, I quote from the Tophat manual

        TopHat finds splice junctions without a reference annotation. By first mapping RNA-Seq reads to the genome, TopHat identifies potential exons, since many RNA-Seq reads will contiguously align to the genome. Using this initial mapping, TopHat builds a database of possible splice junctions, and then maps the reads against this junction to confirm them.
        Look at the manual for more help.

        Comment


        • #5
          Originally posted by westerman View Post
          I do agree with swbarnes2 that genome fastqs are not annotated and that generally you need a gff file for proper and verified splice sites. But since you asked about Tophat and presumably de-novo detection of junctions, I quote from the Tophat manual



          Look at the manual for more help.
          http://tophat.cbcb.umd.edu/manual.html
          Thank you. That's exactly my question. Why reads contiguously align to the genome can define a exon? How define "congiguous" ?

          Comment


          • #6
            I believe that you have the meaning of the word 'contiguous' correct -- that is the reads have to match exactly the genome.

            As I said, look at the manual for more help. The part I quoted was just a small introduction to how Tophat works. Now I am far away from being a Tophat expert but basically the idea is to:

            a) Split up the reads into small segments ... say 40 bases.

            b) Align these splits contiguously (e.g., exactly) to the genome; many will align but many will not because they span junctions.

            c) Where there are many reads aligning then consider this an 'island' which represents correct alignments. An island will not contain a junction because otherwise the split would not align.

            d) Stitch these islands together to cover junctions. The strongest evidence of a junction is where a read has two different 'splits' in two different islands. In other words the only way a read could be in two islands is if the read spans a junction. There are other avenues of evidence as well (e.g., you can slowly build out from each island via adding parts of non-island reads to the island until a junction border is reached.)

            Now as I said I am far from an Tophat expert. If someone with a better understanding can chime in then that would be great. In the meantime studying the manual is your (and my) best option.

            Comment


            • #7
              Originally posted by westerman View Post
              I believe that you have the meaning of the word 'contiguous' correct -- that is the reads have to match exactly the genome.

              As I said, look at the manual for more help. The part I quoted was just a small introduction to how Tophat works. Now I am far away from being a Tophat expert but basically the idea is to:

              a) Split up the reads into small segments ... say 40 bases.

              b) Align these splits contiguously (e.g., exactly) to the genome; many will align but many will not because they span junctions.

              c) Where there are many reads aligning then consider this an 'island' which represents correct alignments. An island will not contain a junction because otherwise the split would not align.

              d) Stitch these islands together to cover junctions. The strongest evidence of a junction is where a read has two different 'splits' in two different islands. In other words the only way a read could be in two islands is if the read spans a junction. There are other avenues of evidence as well (e.g., you can slowly build out from each island via adding parts of non-island reads to the island until a junction border is reached.)

              Now as I said I am far from an Tophat expert. If someone with a better understanding can chime in then that would be great. In the meantime studying the manual is your (and my) best option.
              Thank you very much! It's pretty clear. I also have a question. For pair end reads, what if one read mapped to one exon and the other mapped to the other exon? How define this kind of alignment, is it proper pair map or not ?

              Comment


              • #8
                Thank you very much! It's pretty clear. I also have a question. For pair end reads, what if one read mapped to one exon and the other mapped to the other exon? How define this kind of alignment, is it proper pair map or not ?
                Yes, the pairs would be correct. In fact such information can be used to determine junctions. In other words if the pairs are mapped to parts of the genome that are, say 5KB away from each other but you know that the ends should be within 200 bases of each other, give or take a 100 bases, then those pairs must be spanning a junction.

                Once again I am not a tophat expert nor do I know the internals to Tophat, but I believe that tophat uses the above reasoning as part of its junction finding strategy.

                Comment


                • #9
                  Originally posted by westerman View Post
                  Yes, the pairs would be correct. In fact such information can be used to determine junctions. In other words if the pairs are mapped to parts of the genome that are, say 5KB away from each other but you know that the ends should be within 200 bases of each other, give or take a 100 bases, then those pairs must be spanning a junction.

                  Once again I am not a tophat expert nor do I know the internals to Tophat, but I believe that tophat uses the above reasoning as part of its junction finding strategy.
                  Thank you. One thing confusing me is the defination of proper pair. It's a pair read which were aligned with the defined distance (I believe it's the fragment size). However, if the pair reads were aligned to two exons, then their distance should + intron length, so the distance must be much larger than the predifined fragment size. How tophat identify it's a proper pair reads?

                  Comment


                  • #10
                    Does tophat use the term 'proper pair' anywhere? If so could you please give a reference to its use.

                    In samtools there is "proper pair". If you are talking about this, then I am suspecting that tophat marks reads as "proper pair" inside the bam format if the pairs do indeed span a junction. That is, pairs that contribute to a junction call are good and thus "proper".

                    As far as I know there is no one definition of a "proper pair" in BAM/SAM. A pair is "proper" if the program that makes up the BAM/SAM file deems the pair as proper.

                    Once again I put my normal disclaimers about not being a Tophat expert.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Current Approaches to Protein Sequencing
                      by seqadmin


                      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                      04-04-2024, 04:25 PM
                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 04-11-2024, 12:08 PM
                    0 responses
                    18 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 10:19 PM
                    0 responses
                    22 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 09:21 AM
                    0 responses
                    16 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-04-2024, 09:00 AM
                    0 responses
                    47 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X