Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Tophat command line options

    Dear All,

    I have been playing with tophat but am not certain which combination of command line options I should use. I hope people with more experiences can help me on that. The main purpose is the expression analysis and my reads are single ended with 80bp.

    1. Is it better or not to provide an exon junction database? If such a database is provided, will tophat only map read against the exon junction database or it will also try the regular mapping if it fails to find a match in the database? If the answer it the latter, does that means it is always better to have such a junction database?

    2. If it is better to provide such a database, where and which file to download?

    3. For expression analysis should I used -g 1 to allow only unique mapping or use other values? The default is 40 and I have seen may reads reported multiple times in the sam output file. In this case, a read will be counted multiple times for different genes and this shouldn't be right.

    4. The map quality scores reported in sam file by tophat have a lot of low values (e.g. >20% read have score below 20). Should I use some criteria to filter reads with bad mapping quality score? What are sensible numbers to use?

    5. What other sensible parameters we should be paying attention to? For me I only used --microexon_search.

    Many many thanks!

  • #2
    Originally posted by ice View Post
    Dear All,

    1. Is it better or not to provide an exon junction database? If such a database is provided, will tophat only map read against the exon junction database or it will also try the regular mapping if it fails to find a match in the database? If the answer it the latter, does that means it is always better to have such a junction database?
    If you don't specifically turn off the novel junctions discovery, it should find those too. (that's the --no-novel-juncs option)

    2. If it is better to provide such a database, where and which file to download?
    What species are you're working with? If it's something relatively obscure, you may have to make your own junction file

    3. For expression analysis should I used -g 1 to allow only unique mapping or use other values? The default is 40 and I have seen may reads reported multiple times in the sam output file. In this case, a read will be counted multiple times for different genes and this shouldn't be right.
    I would keep the multireads in, you will undercount members of gene families with high sequence similarity if you look at unique reads only. Of course, there's the processed pseudogene problem too, but that's a somewhat different matter.

    Comment


    • #3
      Hi,

      I wanted to post a new thread, but cant find "New Thread" button anywhere.

      My question is :

      Can Tophat handle csfasta files which are in color space format ? I think there is no command line option for doing this.
      In order to work with SOLiD csfasta files, do I have to first convert it into sequence space format?

      Could anyone direct me also how to post a new thread.

      Thanks,

      Comment


      • #4


        Currently, TopHat does not allow short (fewer than a few nucleotides) insertions and deletions in the alignments it reports. Support for insertions and deletions will eventually be added. TopHat also does not natively support Applied Biosystems' Colorspace format.

        Comment


        • #5
          Originally posted by sridharacharya View Post
          Hi,

          I wanted to post a new thread, but cant find "New Thread" button anywhere.

          My question is :

          Can Tophat handle csfasta files which are in color space format ? I think there is no command line option for doing this.
          In order to work with SOLiD csfasta files, do I have to first convert it into sequence space format?

          Could anyone direct me also how to post a new thread.

          Thanks,
          See screenshot!
          Attached Files

          Comment


          • #6
            Originally posted by GKM View Post
            If you don't specifically turn off the novel junctions discovery, it should find those too. (that's the --no-novel-juncs option)
            Then it is clear then that providing the junction database is clearly a win.


            Originally posted by GKM View Post
            What species are you're working with? If it's something relatively obscure, you may have to make your own junction file
            My data is human RNA data. I can download the GTF file from UCSC and convert it to GFF3 format which tophat can take. Is that the right choice?

            Originally posted by GKM View Post
            I would keep the multireads in, you will undercount members of gene families with high sequence similarity if you look at unique reads only. Of course, there's the processed pseudogene problem too, but that's a somewhat different matter.
            This is very interesting. Actually I am puzzled by the -g option. Suppose a read is mapped to 50 places on the genome and I specify -g 20. Will tophat (1) report only the first 20 alignments of this reads or (2) it does not report this read at all since 50 is greater than 20? It is not clear when I read the manual. If (1) is correct, then specifying a small number is better since multiple hits will over-count the expression. However if (2) is right, a lot of reads will be missed due to similarities among genes in a gene family.

            Comment


            • #7
              Originally posted by ice View Post
              Then it is clear then that providing the junction database is clearly a win.




              My data is human RNA data. I can download the GTF file from UCSC and convert it to GFF3 format which tophat can take. Is that the right choice?
              It depends. UCSC isn't the absolutely most comprehensive annotation, but I don't know much you care about that. You can definitely use it. GFF3 will works, as will the simple junction format that you can supply to TopHat with the -j option (chr - tab - left - tab - right - tab - strand)


              This is very interesting. Actually I am puzzled by the -g option. Suppose a read is mapped to 50 places on the genome and I specify -g 20. Will tophat (1) report only the first 20 alignments of this reads or (2) it does not report this read at all since 50 is greater than 20? It is not clear when I read the manual. If (1) is correct, then specifying a small number is better since multiple hits will over-count the expression. However if (2) is right, a lot of reads will be missed due to similarities among genes in a gene family.
              The -g option should be equivalent to the -m option in bowtie, i.e. if you see a read mapping to more than N locations in the genome, it is discarded.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM
              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              25 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              29 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              25 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-04-2024, 09:00 AM
              0 responses
              52 views
              0 likes
              Last Post seqadmin  
              Working...
              X