Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Inner distance value for TopHat / Proper mapping with RNA-Seq PE data

    Dear forum,

    I have been using RNA-seq for some time now and have a rather basic question:

    When doing paired-end data, I have to provide TopHat with the inner distance of the mate pairs/paired ends (parameter: -r). But when I align them to a reference (which is obviously DNA) and my RNA-seq data is cDNA (so introns left out), then there should be a bias regarding the inner distance. The question came up when I tested the best value for the -r option by setting different (meaningful) values and checking them with samtools flagstat. On one value the statistics told me that there were around 30% properly mapped (seems bad) and the other value had 70% (rather good). When I assume that there are also introns in the DNA that are left out, which value fits better and is it even useful to give this parameter? Ok, TopHat forces me to do so ;-) and actually it cares about splice junctions. But the procedure is not clear to me and if this properly mapped statistic is a useful one in this context.

    I'm happy about any advice!

    Best,
    Oliver

  • #2
    FWIW, I don't really think the "properly paired" statistic is meaningful in this context, because of the intron issue that you discuss. I assume the mate inner distance is used in a meaningful way inside TopHat, but I don't care much about the percent properly paired I get from e.g. samtools flagstat. After all, a lot of the transcriptome is spliced.

    Comment


    • #3
      kopl-o's suggestion to check the percentage of proper pairs will help you find whether your analysis is ok. It does help to use the fragment size distribution from the library prep and use it to compute the mean inner distance. In the library prep that we used (and for one sample), we found that having a small distance (20-100) gave fewer properly paired reads. And the percentage reached close 90% for longer distances (>100).

      Comment


      • #4
        related question

        Hello all,

        I understood that in tophat 1.3.3 there is no need to assign the inner distance.
        I used data from recent published paper, where all reads in the repository were marked as properly paired (flags 99 or 147).
        However, when I re-aligned them using tophat 1.3.3 they mapped to the same positions (hg19) but don't marked as properly paired (they got the flags 129 or 65).
        My question is: why I don't get the exact flags?

        For example:
        The paper's alignment:
        HWUSI-EAS371_0021:3:28:19038:18734#0/ 147 chrM 16261 255 60M = 16162 -159 CCCCTCACCCACTAGGATATCAACAAACCTACCCACCCTTAACAGTACATAGCACATAAA EEEE?EAEEEFFFGGGGGGBGGFGGECEEC<CEE@GGGGGGGGGGGGGGGGGGGGGGGGG NM:i:2 XS:A:+

        My re-alignment (using tophat 1.3.3):
        HWUSI-EAS371_0021:3:28:19038:18734#0 129 chrM 16261 255 60M = 16162 -159 CCCCTCACCCACTAGGATATCAACAAACCTACCCACCCTTAACAGTACATAGCACATAAA EEEE?EAEEEFFFGGGGGGBGGFGGECEEC<CEE@GGGGGGGGGGGGGGGGGGGGGGGGG NM:i:2 NH:i:1

        Thanks in advance,
        Oz Solomon
        Last edited by ozs2006; 11-27-2011, 07:37 AM. Reason: mistake

        Comment


        • #5
          Interesting, I believe many things could potentially cause this. I am bit unclear on how you did the "re-alignment". Did you use the whole data again to align or only the read satisfying (99 or 147)? If later, did you make sure to include the paired end sequence as well. I think the flags 83, 99, 147, and 163 will give the all properly paired reads (twice actually.)

          Comment


          • #6
            Thanks for the quick reply
            As you noted, it is very strange, because all the publicly available reads are flagged as 99 and 147, and I used all of them.
            Last edited by ozs2006; 11-27-2011, 10:44 AM. Reason: spelling

            Comment


            • #7
              I created the fastq files from the sam files of the publication (using awk) and then ran tophat.

              1. awk:

              awk '{if($2==99) print "@" $1 "\n" $10 "\n" "+\n" $11}' > sample_1.fq
              awk '{if($2==147) print "@" $1 "\n" $10 "\n" "+\n" $11}' > sample_2.fq

              2. Tophat's command I used:

              /tophat-1.3.3.Linux_x86_64/tophat -p 8 --min-anchor-length 15 --splice-mismatches 0 --keep-tmp --GTF /data/pipeline_in/Genomes/Human_GRCh37/Homo_sapiens.GRCh37.64.gtf /data/pipeline_in/Genomes/Human_GRCh37/Index/hg19 sample_1.fq sample_2.fq
              Last edited by ozs2006; 11-27-2011, 11:12 AM. Reason: changes

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Essential Discoveries and Tools in Epitranscriptomics
                by seqadmin


                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
                Yesterday, 07:01 AM
              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              39 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              41 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              35 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-04-2024, 09:00 AM
              0 responses
              55 views
              0 likes
              Last Post seqadmin  
              Working...
              X