Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Originally posted by alexdobin View Post
    In the Log.final.out, STAR counts the total number of "splices" - you can get it by counting the total number of N-operations in CIGARs of all unique alignments. Since some spliced reads can have more than one splice, the number of splices is bigger than the number of spliced reads, which is output by RSeQC, I guess.
    Thanks Alex! Seems logical.

    On a side note, do you know what is the best approach to generate the unique mappability of all genomic positions? Since I discard multi-mappers and consider only uniquely mapped reads, I want to exclude the positions in the genome (transcriptome) that are inherently unmappable and have a duplicate(s) somewhere else (e.g., paralogs) for a given read length.

    This is not trivial, mainly because I need to artificially synthesize the reads myself, if they span only one junction it is OK, but if they span multiple junctions, or the second exon undergoes A5SS, then this becomes a serious headache. Have you heard of any tools for this, or a readily-available map?

    GH

    PS. This is a MUCH harder task with Bowtie2 (or any splicing-ignorant mapper). STAR could map the read to any non-contiguous region of the genome, while one needs to make a library of all junctions with various exon arrangements for Bowtie2, but still forming the reads is not easy.

    Comment


    • #17
      Originally posted by dietmar13 View Post
      comparing different mapper with 3 mio reads 70 bases PE RNA-seq data:

      RUM (189 k) >> STAR (127 k) > tophat (124 k) > subread/subjunc (99 k) spliced reads mapped.

      RUM provides several output files for these spliced reads / junctions...
      dietmar13,

      Did you happen to evaluate GEM as well? It occurs to me that GEM and STAR are the two emerging mappers and will be very powerful contenders in near future. I have had very good experience with STAR, however, GEM seems to be not working very well. I have tried different parameter settings, but still not good.

      GH

      Comment


      • #18
        Originally posted by genomeHunter View Post
        Thanks Alex! Seems logical.

        On a side note, do you know what is the best approach to generate the unique mappability of all genomic positions? Since I discard multi-mappers and consider only uniquely mapped reads, I want to exclude the positions in the genome (transcriptome) that are inherently unmappable and have a duplicate(s) somewhere else (e.g., paralogs) for a given read length.

        This is not trivial, mainly because I need to artificially synthesize the reads myself, if they span only one junction it is OK, but if they span multiple junctions, or the second exon undergoes A5SS, then this becomes a serious headache. Have you heard of any tools for this, or a readily-available map?

        GH

        PS. This is a MUCH harder task with Bowtie2 (or any splicing-ignorant mapper). STAR could map the read to any non-contiguous region of the genome, while one needs to make a library of all junctions with various exon arrangements for Bowtie2, but still forming the reads is not easy.
        This is indeed a very non-trivial task for spliced reads, and I am not aware of any tool that can do it correctly, even though a number of attempts were made (here is a recent one). I could imagine that you are doing it for the transcriptome like this: generating all possible reads from the transcriptome, and mapping them to the genome with splice-enable aligner to determine whether they map uniquely or not. This should quite doable for single-end reads: e.g., Gencode 14 transcriptome contains 280M bases, i.e. 280M distinct start loci, so we would need to map 280M reads to the genome.

        For paired-end reads, however, we would have to consider all possible insert sizes in a range of the real libraries, which would increase the number of reads by a factor of several hundred - this still should be possible with STAR. Then the uniqueness for each position becomes a probabilistic quantity, depending on the actual insert size distribution in the libraries. I am planning to implement an algorithm like this for estimating FPKM within STAR.

        Comment


        • #19
          It makes sense to use uniquely mapped reads to assess alternative methods. What we did was that we remove repetitive 80bp sequences from the human genome and then created 100bp reads from the modified human genome (http://www.ncbi.nlm.nih.gov/pubmed/23558742).

          Cheers,
          Wei

          Comment


          • #20
            Final comparison RUM and STAR

            paired end 75 bp RNA-seq of clinical samples.

            Code:
            STAR 2.3.0 with gencode14 annotation (index downloaded from STAR homepage)
            Code:
            RUM 2.0.4 index generated with gencode14 annotation
            both: with standard parameters.

            Results (see below):
            57 more uniquely mapped reads for STAR
            3.4 more spliced reads for STAR

            --- B - U - T ---
            after counting (-unique) with htseq-count:
            RUM had 15.2% more called genes (10,747 vs. 9,328 with >=3 reads per gene)
            especially:
            protein_coding RUM: 9357 STAR: 8962
            pseudogene RUM: 843 STAR: 94
            lincRNA RUM: 84 STAR: 64
            (see bottom of this posting)

            RUM (RSeQC):
            Total Records: 11685516

            QC failed: 0
            Optical/PCR duplicate: 0
            Non Primary Hits 0
            Unmapped reads: 2632714
            Multiple mapped reads: 7662018

            Uniquely mapped: 1390784
            Read-1: 696484
            Read-2: 694300
            Reads map to '+': 673096
            Reads map to '-': 717688
            Non-splice reads: 1204215
            Splice reads: 186569
            Reads mapped in proper pairs: 1200364
            STAR(RSeQC):
            Total Records: 2424478

            QC failed: 0
            Optical/PCR duplicate: 0
            Non Primary Hits 131750
            Unmapped reads: 0
            Multiple mapped reads: 110496

            Uniquely mapped: 2182232
            Read-1: 1091116
            Read-2: 1091116
            Reads map to '+': 1091116
            Reads map to '-': 1091116
            Non-splice reads: 1989333
            Splice reads: 192899
            Reads mapped in proper pairs: 2182232
            called genes (>= 3 reads per gene)
            RUM STAR
            3prime_overlapping_ncrna NA NA
            antisense 25 20
            IG_C_gene 10 8
            IG_C_pseudogene 2 NA
            IG_D_gene NA NA
            IG_J_gene 4 NA
            IG_J_pseudogene NA NA
            IG_V_gene 108 73
            IG_V_pseudogene 23 NA
            lincRNA 84 64
            miRNA 6 5
            misc_RNA 135 13
            Mt_rRNA 2 2
            Mt_tRNA 3 1
            non_coding NA NA
            polymorphic_pseudogene 4 3
            processed_transcript 40 32
            protein_coding 9357 8962
            pseudogene 843 94
            rRNA 3 2
            sense_intronic 1 1
            sense_overlapping NA NA
            snoRNA 12 10
            snRNA 36 5
            TR_C_gene 2 2
            TR_D_gene NA NA
            TR_J_gene 8 1
            TR_J_pseudogene NA NA
            TR_V_gene 36 29
            TR_V_pseudogene 3 1

            Comment


            • #21
              Originally posted by alexdobin View Post
              This is indeed a very non-trivial task for spliced reads, and I am not aware of any tool that can do it correctly, even though a number of attempts were made (here is a recent one). I could imagine that you are doing it for the transcriptome like this: generating all possible reads from the transcriptome, and mapping them to the genome with splice-enable aligner to determine whether they map uniquely or not. This should quite doable for single-end reads: e.g., Gencode 14 transcriptome contains 280M bases, i.e. 280M distinct start loci, so we would need to map 280M reads to the genome.

              For paired-end reads, however, we would have to consider all possible insert sizes in a range of the real libraries, which would increase the number of reads by a factor of several hundred - this still should be possible with STAR. Then the uniqueness for each position becomes a probabilistic quantity, depending on the actual insert size distribution in the libraries. I am planning to implement an algorithm like this for estimating FPKM within STAR.
              Thanks Alex!

              Wow...this seems to be prohibitively hard. Say fragments lengths are distributed from 180 to 220. Then, one needs to create 41 reads for each (valid) position in any of the transcripts (annotated and de novo), and that is for All read lengths (2x100bp, 2x50bp). There could be around 10 billion such reads...

              GH

              Comment


              • #22
                Originally posted by dietmar13 View Post
                paired end 75 bp RNA-seq of clinical samples.

                Code:
                STAR 2.3.0 with gencode14 annotation (index downloaded from STAR homepage)
                Code:
                RUM 2.0.4 index generated with gencode14 annotation
                both: with standard parameters.

                Results (see below):
                57 more uniquely mapped reads for STAR
                3.4 more spliced reads for STAR

                --- B - U - T ---
                after counting (-unique) with htseq-count:
                RUM had 15.2% more called genes (10,747 vs. 9,328 with >=3 reads per gene)
                especially:
                protein_coding RUM: 9357 STAR: 8962
                pseudogene RUM: 843 STAR: 94
                lincRNA RUM: 84 STAR: 64
                (see bottom of this posting)

                RUM (RSeQC):


                STAR(RSeQC):


                called genes (>= 3 reads per gene)

                Have you got a chance to look at Subjunc?

                Cheers,
                Wei

                Comment


                • #23
                  Originally posted by dietmar13 View Post
                  paired end 75 bp RNA-seq of clinical samples.

                  Code:
                  STAR 2.3.0 with gencode14 annotation (index downloaded from STAR homepage)
                  Code:
                  RUM 2.0.4 index generated with gencode14 annotation
                  both: with standard parameters.

                  Results (see below):
                  57 more uniquely mapped reads for STAR
                  3.4 more spliced reads for STAR

                  --- B - U - T ---
                  after counting (-unique) with htseq-count:
                  RUM had 15.2% more called genes (10,747 vs. 9,328 with >=3 reads per gene)
                  especially:
                  protein_coding RUM: 9357 STAR: 8962
                  pseudogene RUM: 843 STAR: 94
                  lincRNA RUM: 84 STAR: 64
                  (see bottom of this posting)
                  Hi Dietmar,

                  there is something strange about this result. STAR aligns many more unique mappers than RUM (2,182,232 vs 1,390,784), while RUM considers more reads to be multi-mappers. When I made the comparison in the STAR paper, RUM(1.11) yielded just ~10% fewer unique mappers.

                  In terms of gene detection, it would be interesting to compare the full curves: number of detected genes vs. min number of reads per gene. Requiring just 3 reads per gene is a bit low on my scale - it would likely be irreproducible, i.e. if you perform the same experiment agains (or look at bio-reps), there will be a high probability of not detecting again the genes supported by 3 reads. Also, it would be interesting to make this comparison for correctly paired alignments only

                  It's interesting that for the 5 aligners I compared (STAR, RUM, GSNAP, TopHat, and MapSplice - all run without annotations) on one of our cell line samples, the proportion of reads that overlap annotated exons was very close to the same value of ~75% for all aligners, while the mapping rates varied quite a lot, with STAR and GSNAP yielding the highest % of uniquely mapped reads. I guess it tells us that all aligners have the same specificity towards annotations and somewhat different sensitivities.

                  Cheers
                  Alex

                  Comment


                  • #24
                    Originally posted by genomeHunter View Post
                    Thanks Alex!

                    Wow...this seems to be prohibitively hard. Say fragments lengths are distributed from 180 to 220. Then, one needs to create 41 reads for each (valid) position in any of the transcripts (annotated and de novo), and that is for All read lengths (2x100bp, 2x50bp). There could be around 10 billion such reads...

                    GH
                    Mapping 10 billion reads would take ~30 hours on a decent server, so it's probably still OK. We would need to do it for each of the read length sets like 2x100, 2x50 etc. Also, we would probably need to do it for wider distirbution of insert sizes to accomodate for different libraries. On the positive side, the number of distinct read pairs would be smaller because there is a lot of overlap for the isoforms.

                    Comment


                    • #25
                      De novo splicing events

                      Hi Alex,

                      I read the paper you introduced and it was interesting. However, I have a major problem: I am interested in isoform quantification or gene expression levels. I am focused on de novo splicing events, i.e., those found by STAR not in the annotations.

                      I have not been able to find a solid way to find the uniquely mappable positions for a de novo splice junction.

                      GH

                      Comment


                      • #26
                        Originally posted by genomeHunter View Post
                        Hi Alex,

                        I read the paper you introduced and it was interesting. However, I have a major problem: I am interested in isoform quantification or gene expression levels. I am focused on de novo splicing events, i.e., those found by STAR not in the annotations.

                        I have not been able to find a solid way to find the uniquely mappable positions for a de novo splice junction.

                        GH
                        Hi GH,

                        I am not sure I can follow your question. If you are interested in novel splice junctions, you can extract their coordinates from uniquely aligned RNA-seq reads. For instance, STAR's SJ.out.tab gives you the start and end of the introns, and also the number of reads supporting each junction, unique (column 7) and multiple (column 8). Column 6 tells you whether the junction is annotated or novel (if you used annotations).

                        On the other hand, if you are interested in novel isoforms, you need to feed the alignments to a "transcript assembler" such as Cufflinks.

                        Cheers
                        Alex

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Essential Discoveries and Tools in Epitranscriptomics
                          by seqadmin


                          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
                          Today, 07:01 AM
                        • seqadmin
                          Current Approaches to Protein Sequencing
                          by seqadmin


                          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                          04-04-2024, 04:25 PM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 04-11-2024, 12:08 PM
                        0 responses
                        37 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 10:19 PM
                        0 responses
                        39 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 09:21 AM
                        0 responses
                        35 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-04-2024, 09:00 AM
                        0 responses
                        54 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X