Seqanswers Leaderboard Ad

**genomeHunter** · 04-07-2013, 06:34 PM

Originally posted by alexdobin View Post

In the Log.final.out, STAR counts the total number of "splices" - you can get it by counting the total number of N-operations in CIGARs of all unique alignments. Since some spliced reads can have more than one splice, the number of splices is bigger than the number of spliced reads, which is output by RSeQC, I guess.

Thanks Alex! Seems logical.

On a side note, do you know what is the best approach to generate the unique mappability of all genomic positions? Since I discard multi-mappers and consider only uniquely mapped reads, I want to exclude the positions in the genome (transcriptome) that are inherently unmappable and have a duplicate(s) somewhere else (e.g., paralogs) for a given read length.

This is not trivial, mainly because I need to artificially synthesize the reads myself, if they span only one junction it is OK, but if they span multiple junctions, or the second exon undergoes A5SS, then this becomes a serious headache. Have you heard of any tools for this, or a readily-available map?

GH

PS. This is a MUCH harder task with Bowtie2 (or any splicing-ignorant mapper). STAR could map the read to any non-contiguous region of the genome, while one needs to make a library of all junctions with various exon arrangements for Bowtie2, but still forming the reads is not easy.

**genomeHunter** · 04-07-2013, 06:49 PM

Originally posted by dietmar13 View Post

comparing different mapper with 3 mio reads 70 bases PE RNA-seq data:

RUM (189 k) >> STAR (127 k) > tophat (124 k) > subread/subjunc (99 k) spliced reads mapped.

RUM provides several output files for these spliced reads / junctions...

dietmar13,

Did you happen to evaluate GEM as well? It occurs to me that GEM and STAR are the two emerging mappers and will be very powerful contenders in near future. I have had very good experience with STAR, however, GEM seems to be not working very well. I have tried different parameter settings, but still not good.

GH

**alexdobin** · 04-08-2013, 10:30 AM

Originally posted by genomeHunter View Post

Thanks Alex! Seems logical.

On a side note, do you know what is the best approach to generate the unique mappability of all genomic positions? Since I discard multi-mappers and consider only uniquely mapped reads, I want to exclude the positions in the genome (transcriptome) that are inherently unmappable and have a duplicate(s) somewhere else (e.g., paralogs) for a given read length.

This is not trivial, mainly because I need to artificially synthesize the reads myself, if they span only one junction it is OK, but if they span multiple junctions, or the second exon undergoes A5SS, then this becomes a serious headache. Have you heard of any tools for this, or a readily-available map?

GH

PS. This is a MUCH harder task with Bowtie2 (or any splicing-ignorant mapper). STAR could map the read to any non-contiguous region of the genome, while one needs to make a library of all junctions with various exon arrangements for Bowtie2, but still forming the reads is not easy.

This is indeed a very non-trivial task for spliced reads, and I am not aware of any tool that can do it correctly, even though a number of attempts were made (here is a recent one). I could imagine that you are doing it for the transcriptome like this: generating all possible reads from the transcriptome, and mapping them to the genome with splice-enable aligner to determine whether they map uniquely or not. This should quite doable for single-end reads: e.g., Gencode 14 transcriptome contains 280M bases, i.e. 280M distinct start loci, so we would need to map 280M reads to the genome.

For paired-end reads, however, we would have to consider all possible insert sizes in a range of the real libraries, which would increase the number of reads by a factor of several hundred - this still should be possible with STAR. Then the uniqueness for each position becomes a probabilistic quantity, depending on the actual insert size distribution in the libraries. I am planning to implement an algorithm like this for estimating FPKM within STAR.

**shi** · 04-08-2013, 02:42 PM

It makes sense to use uniquely mapped reads to assess alternative methods. What we did was that we remove repetitive 80bp sequences from the human genome and then created 100bp reads from the modified human genome (http://www.ncbi.nlm.nih.gov/pubmed/23558742).

Cheers,
Wei

**dietmar13** · 04-09-2013, 06:24 AM

Final comparison RUM and STAR

paired end 75 bp RNA-seq of clinical samples.

Code:

STAR 2.3.0 with gencode14 annotation (index downloaded from STAR homepage)

Code:

RUM 2.0.4 index generated with gencode14 annotation

both: with standard parameters.

Results (see below):
57 more uniquely mapped reads for STAR
3.4 more spliced reads for STAR

--- B - U - T ---
after counting (-unique) with htseq-count:
RUM had 15.2% more called genes (10,747 vs. 9,328 with >=3 reads per gene)
especially:
protein_coding RUM: 9357 STAR: 8962
pseudogene RUM: 843 STAR: 94
lincRNA RUM: 84 STAR: 64
(see bottom of this posting)

RUM (RSeQC):

Total Records: 11685516

QC failed: 0
Optical/PCR duplicate: 0
Non Primary Hits 0
Unmapped reads: 2632714
Multiple mapped reads: 7662018

Uniquely mapped: 1390784
Read-1: 696484
Read-2: 694300
Reads map to '+': 673096
Reads map to '-': 717688
Non-splice reads: 1204215
Splice reads: 186569
Reads mapped in proper pairs: 1200364

STAR(RSeQC):

Total Records: 2424478

QC failed: 0
Optical/PCR duplicate: 0
Non Primary Hits 131750
Unmapped reads: 0
Multiple mapped reads: 110496

Uniquely mapped: 2182232
Read-1: 1091116
Read-2: 1091116
Reads map to '+': 1091116
Reads map to '-': 1091116
Non-splice reads: 1989333
Splice reads: 192899
Reads mapped in proper pairs: 2182232

called genes (>= 3 reads per gene)

RUM STAR
3prime_overlapping_ncrna NA NA
antisense 25 20
IG_C_gene 10 8
IG_C_pseudogene 2 NA
IG_D_gene NA NA
IG_J_gene 4 NA
IG_J_pseudogene NA NA
IG_V_gene 108 73
IG_V_pseudogene 23 NA
lincRNA 84 64
miRNA 6 5
misc_RNA 135 13
Mt_rRNA 2 2
Mt_tRNA 3 1
non_coding NA NA
polymorphic_pseudogene 4 3
processed_transcript 40 32
protein_coding 9357 8962
pseudogene 843 94
rRNA 3 2
sense_intronic 1 1
sense_overlapping NA NA
snoRNA 12 10
snRNA 36 5
TR_C_gene 2 2
TR_D_gene NA NA
TR_J_gene 8 1
TR_J_pseudogene NA NA
TR_V_gene 36 29
TR_V_pseudogene 3 1

**genomeHunter** · 04-09-2013, 06:28 AM

Originally posted by alexdobin View Post

This is indeed a very non-trivial task for spliced reads, and I am not aware of any tool that can do it correctly, even though a number of attempts were made (here is a recent one). I could imagine that you are doing it for the transcriptome like this: generating all possible reads from the transcriptome, and mapping them to the genome with splice-enable aligner to determine whether they map uniquely or not. This should quite doable for single-end reads: e.g., Gencode 14 transcriptome contains 280M bases, i.e. 280M distinct start loci, so we would need to map 280M reads to the genome.

For paired-end reads, however, we would have to consider all possible insert sizes in a range of the real libraries, which would increase the number of reads by a factor of several hundred - this still should be possible with STAR. Then the uniqueness for each position becomes a probabilistic quantity, depending on the actual insert size distribution in the libraries. I am planning to implement an algorithm like this for estimating FPKM within STAR.

Thanks Alex!

Wow...this seems to be prohibitively hard. Say fragments lengths are distributed from 180 to 220. Then, one needs to create 41 reads for each (valid) position in any of the transcripts (annotated and de novo), and that is for All read lengths (2x100bp, 2x50bp). There could be around 10 billion such reads...

GH

**shi** · 04-09-2013, 02:43 PM

Originally posted by dietmar13 View Post

paired end 75 bp RNA-seq of clinical samples.

Code:

STAR 2.3.0 with gencode14 annotation (index downloaded from STAR homepage)

Code:

RUM 2.0.4 index generated with gencode14 annotation

both: with standard parameters.

Results (see below):
57 more uniquely mapped reads for STAR
3.4 more spliced reads for STAR

--- B - U - T ---
after counting (-unique) with htseq-count:
RUM had 15.2% more called genes (10,747 vs. 9,328 with >=3 reads per gene)
especially:
protein_coding RUM: 9357 STAR: 8962
pseudogene RUM: 843 STAR: 94
lincRNA RUM: 84 STAR: 64
(see bottom of this posting)

RUM (RSeQC):

STAR(RSeQC):

called genes (>= 3 reads per gene)

Have you got a chance to look at Subjunc?

Cheers,
Wei

**alexdobin** · 04-10-2013, 07:04 AM

Originally posted by dietmar13 View Post

paired end 75 bp RNA-seq of clinical samples.

Code:

STAR 2.3.0 with gencode14 annotation (index downloaded from STAR homepage)

Code:

RUM 2.0.4 index generated with gencode14 annotation

both: with standard parameters.

Results (see below):
57 more uniquely mapped reads for STAR
3.4 more spliced reads for STAR

--- B - U - T ---
after counting (-unique) with htseq-count:
RUM had 15.2% more called genes (10,747 vs. 9,328 with >=3 reads per gene)
especially:
protein_coding RUM: 9357 STAR: 8962
pseudogene RUM: 843 STAR: 94
lincRNA RUM: 84 STAR: 64
(see bottom of this posting)

Hi Dietmar,

there is something strange about this result. STAR aligns many more unique mappers than RUM (2,182,232 vs 1,390,784), while RUM considers more reads to be multi-mappers. When I made the comparison in the STAR paper, RUM(1.11) yielded just ~10% fewer unique mappers.

In terms of gene detection, it would be interesting to compare the full curves: number of detected genes vs. min number of reads per gene. Requiring just 3 reads per gene is a bit low on my scale - it would likely be irreproducible, i.e. if you perform the same experiment agains (or look at bio-reps), there will be a high probability of not detecting again the genes supported by 3 reads. Also, it would be interesting to make this comparison for correctly paired alignments only

It's interesting that for the 5 aligners I compared (STAR, RUM, GSNAP, TopHat, and MapSplice - all run without annotations) on one of our cell line samples, the proportion of reads that overlap annotated exons was very close to the same value of ~75% for all aligners, while the mapping rates varied quite a lot, with STAR and GSNAP yielding the highest % of uniquely mapped reads. I guess it tells us that all aligners have the same specificity towards annotations and somewhat different sensitivities.

Cheers
Alex

**alexdobin** · 04-10-2013, 12:56 PM

Originally posted by genomeHunter View Post

Thanks Alex!

Wow...this seems to be prohibitively hard. Say fragments lengths are distributed from 180 to 220. Then, one needs to create 41 reads for each (valid) position in any of the transcripts (annotated and de novo), and that is for All read lengths (2x100bp, 2x50bp). There could be around 10 billion such reads...

GH

Mapping 10 billion reads would take ~30 hours on a decent server, so it's probably still OK. We would need to do it for each of the read length sets like 2x100, 2x50 etc. Also, we would probably need to do it for wider distirbution of insert sizes to accomodate for different libraries. On the positive side, the number of distinct read pairs would be smaller because there is a lot of overlap for the isoforms.

**genomeHunter** · 04-16-2013, 06:10 AM

De novo splicing events

Hi Alex,

I read the paper you introduced and it was interesting. However, I have a major problem: I am interested in isoform quantification or gene expression levels. I am focused on de novo splicing events, i.e., those found by STAR not in the annotations.

I have not been able to find a solid way to find the uniquely mappable positions for a de novo splice junction.

GH

**alexdobin** · 04-17-2013, 10:39 AM

Originally posted by genomeHunter View Post

Hi Alex,

I read the paper you introduced and it was interesting. However, I have a major problem: I am interested in isoform quantification or gene expression levels. I am focused on de novo splicing events, i.e., those found by STAR not in the annotations.

I have not been able to find a solid way to find the uniquely mappable positions for a de novo splice junction.

GH

Hi GH,

I am not sure I can follow your question. If you are interested in novel splice junctions, you can extract their coordinates from uniquely aligned RNA-seq reads. For instance, STAR's SJ.out.tab gives you the start and end of the introns, and also the number of reads supporting each junction, unique (column 7) and multiple (column 8). Column 6 tells you whether the junction is annotated or novel (if you used annotations).

On the other hand, if you are interested in novel isoforms, you need to feed the alignments to a "transcript assembler" such as Cufflinks.

Cheers
Alex

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 37 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 39 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 35 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News