Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Mapping to a transcriptome

    I'm working with some RNA-seq data (4 samples, 2 lanes on Illumina Hiseq 2000 = ~600mil 100bp PE reads) from a vertebrate without a sequenced genome. I initially wanted to both assemble an accurate de novo transcriptome from my combined reads and calculate differential expression for each transcript between 4 samples. I've done the de novo assembly with Trinity which looks great and is Blasting well. However now I am trying to map the reads back to this transcriptome using tuxedo so I can calculate accurate FPKM and fold changes.

    Problem 1: Mapping to the transcriptome, which has ~450k transcritps generates BAM/SAM files with massive headers which are difficult to manage and will not work with cuff- links/compare/diff at all. I cannot modify the source to allow larger headers because I am running everything on servers containing precomplied binaries that I cannot touch or replace.

    Problem 2: I've atrifically scaffolded my transcripts into "scaffolds" of 20k transcripts each with strings of 10 N's between each. This gives me 21 scaffolds that I can map to and generate BAMs of workable size that go through cufflinks without issue. What I can't do it prevent Tophat from integrating the N's into the mapped "exons". I've tried limiting the intron size which prevents transcripts from spanning the N's but will still map across them and integrate them into the exons.

    Does anyone have a solution that can prevent Tophat from mapping across these patches or N's? Or another way to limit my header size and still allow Cufflinks/Compare/Diff to run?

  • #2
    Hi Mike,
    I am currently trying to do the same thing (use a de novo transcriptome as a reference for Tophat and cufflinks). I am not sure about your 'N' problem.

    With your header issue have you tried converting you bam files to sam files? I believe that cufflinks does not have a size limitation for sam headers. I tried that approach and it seemed to work the same as when I changed cufflinks max header size for bam files (which you said you can't do).

    I have found some issues though with my cufflinks/cuffdiff output. It seems like some sequences (present in the reference and with good tophat coverage) do not get assembled by cufflinks. Thus, confirmed sequences are missing in downstream analysis (see: http://seqanswers.com/forums/showthread.php?t=17005).

    If you find the same is true in your analysis please let us know. I asked about this via the cufflinks technical help email and the response was basically that cufflinks wasn't designed to do this, so they are not sure whats going wrong or how to fix it.

    I don't really understand why it wouldn't work, but something appears to be amiss.

    Comment


    • #3
      An interesting question and one that I have not done myself, at least with Trinity/Tophat/Cufflinks. I do wonder why you have so many transcripts? 450K seems like a lot for a true transcriptome. Granted Trinity tends to produce a lot of contigs but, still, this seems to be 3-4 times greater than I would expect.

      The Trinity FAQ has some suggested downstream analyses which do not include Tophat/Cufflinks. I suspect that Tophat is not the tool you should use since, as far as I know, Tophat is meant for genome analysis. However if you do insist on using Tophat then I would separate the transcripts in your scaffolds by at least a read length (in your case over 100 Ns). I am not positive that Tophat will need this separation but other tools that I have used for this type of get confused if they can map a read across a short interval of Ns.



      In any case, good luck. It is always interesting working with unknown species.

      Comment


      • #4
        I see that tboothby responded while I was writing my response. I wish to quote what I consider to be tboothby's pertinent point ...

        Originally posted by tboothby View Post
        ... the response was basically that cufflinks wasn't designed to do this...
        I am not sure if the response from the cufflinks group was meant to say that cufflinks is not designed to handle denovo-transcriptome mapping or perhaps that it is not designed for the specific problem that tboothby found. In any case I do suggest that you explore other tools aside from tophap/cufflinks.

        Comment


        • #5
          @Mike
          As Westerman points out the Trinity website now has some basic procedures for aligning read fragments to a Trinity transcriptome.

          They suggest using bowtie to align read fragments and then use the bam output for quantification using RSEM. I have tried this approach and it seems to work pretty well for quantifying expression.

          Question:
          I like the ability to align read fragments to a Trinity transcriptome, but can anyone suggest software for getting actual transcripts from those mapped reads?

          Comment


          • #6
            Update:

            I redid the scaffolding so that my patches of N's were 120bp long (20bp longer than my reads) instead of 10 and it seems to have mapped great. I'm still sifting through the results to make sure it didn't introduce any unanticipated problems, but so far it looks the same as if I had mapped to a genome (minus introns).

            Thanks for pointing me toward the downstream tools for working with Trinity output, they look great and I'm working on implementing them as a comparison to the tuxedo results.

            Comment


            • #7
              Mike,
              Have you tried using cufflinks to assemble transcripts with your mapped reads yet?

              If so, do you see any instances where transcripts from your de novo transcriptome have good mapping coverage but are not assembled by cufflinks?

              Comment


              • #8
                tbooth,

                I have just looked into this and you're right. There are things with very good coverage that are completely missing from the cufflinks output. It almost seems like the transcripts with the best coverage may even be excluded.

                Any ideas on why this is happening?

                Comment


                • #9
                  My initial thought is that abundant transcripts are generating a lot of sequence reads (obviously) and that the de novo assembler is making many (potentially erroneous) isoforms for those transcripts.

                  The reads are being mapped between multiple isoforms (or maybe other transcripts with similar conserved domains) and this is leading to good coverage but bad cufflinks assembly. Cufflinks splits 'counts' for mapped reads between multi-mapped transcripts.

                  We are looking into ways of compressing these isoforms into unigenes. We will test to see if this helps reduce the number of multi-mapped reads and helps with cufflinks assembly.

                  If you have other ideas about how/why this is happening or how to fix/work around it feel free to share.

                  Comment


                  • #10
                    can I skip Tophat???

                    I want to quantify Refseq RNA based on RNAseq data and I am using bowtie-tophat-cufflinks algorithm for this. I have a doubt regarding the necessity of tophat.

                    If I have an index of transcriptome (human refesq) then can I skip tophat (i dont have an intention of discovering new transcripts).

                    There is no problem of exon junctions because I am mapping it to the transcriptome. I save time by skipping two steps: tophat and getting annotations (if i were to align against genome). Also, genome index is a bigger file.

                    I generate a sam alignment file from bowtie and pass it to cufflinks.

                    I am curious whether this can be done or not. Most people I see use tophat nonetheless. Is it just a habit or a necessity?

                    Comment


                    • #11
                      Originally posted by bharat_iyengar View Post
                      I want to quantify Refseq RNA based on RNAseq data and I am using bowtie-tophat-cufflinks algorithm for this. I have a doubt regarding the necessity of tophat.

                      ...

                      I am curious whether this can be done or not. Most people I see use tophat nonetheless. Is it just a habit or a necessity?
                      Most people use TopHat beacause that is the right tool for the job. When provided with a genome reference and annotation file TopHat will first align full reads to transcrpts and then split reads to the genome. You can tell TopHat not to search for new exons/junctions if you are not interested in that.

                      Many very smart people have tought long and hard about the best ways to properly analyze RNA-Seq data and the overwhelming consensus is that if you have a reference genome, especially in a model organism, align to the full genome with the annotion provided.

                      Comment


                      • #12
                        RUM + htseq-count + samseq (if you have many biological replicates) else limma

                        in my hands RUM + htseq-count + samseq (samR) gave the best results (most spliced reads mapped and most significant called genes)...

                        I compared to tophat or STAR + htseq-count + DESeq, edgeR, BaySeq, NoiSeq, limma and the tophat-cuffdiff pipeline.

                        even the new pipeline: map only against the transcriptome (bowtie, allow for unlimited multi-mappings) and use eXpress followed by all statistical methods mentioned above was not so good...

                        dietmar

                        Comment


                        • #13
                          Originally posted by kmcarr View Post
                          Most people use TopHat beacause that is the right tool for the job. When provided with a genome reference and annotation file TopHat will first align full reads to transcrpts and then split reads to the genome. You can tell TopHat not to search for new exons/junctions if you are not interested in that.

                          Many very smart people have tought long and hard about the best ways to properly analyze RNA-Seq data and the overwhelming consensus is that if you have a reference genome, especially in a model organism, align to the full genome with the annotion provided.
                          Understood. But i intend to know the reason why the consensus has been so?

                          Why is it logically better to map the reads to genome and provide annotations rather than mapping to an already annotated transcriptome index?

                          Comment


                          • #14
                            one answer:

                            because the transcriptome is much more complex (alternative splicing, exon skipping, exclusive exons, intron retention, alternative 5' and 3' splice sites, alternative tss and poly A-sites, ...) as the annotated transcriptome and tissue/disease specific... not to mention new lncRNA, small regulative RNAs and other transcripts

                            Comment


                            • #15
                              Originally posted by dietmar13 View Post
                              because the transcriptome is much more complex (alternative splicing, exon skipping, exclusive exons, intron retention, alternative 5' and 3' splice sites, alternative tss and poly A-sites, ...) as the annotated transcriptome and tissue/disease specific... not to mention new lncRNA, small regulative RNAs and other transcripts
                              but its all annotated.. i know what the variants are..


                              most of the times the input RNA for seq is poly-A fractionated.
                              most regulatory/intermediate guys are already lost..

                              the transcriptome index occupies less space than the genome too..

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              9 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X