SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics
Similar Threads
Thread Thread Starter Forum Replies Last Post
who coined RNAseq? RNAseq as an alignment first approach only brachysclereid Bioinformatics 3 01-10-2012 12:17 PM
how do you find a possible intron on a gene? bbsinfo Illumina/Solexa 2 12-30-2011 10:05 AM
"coverage" of introns, intergenic regions for RNASEQ PFS Bioinformatics 2 09-07-2011 01:51 PM
coverage RNAseq Trudy Bioinformatics 1 06-22-2011 08:27 AM
Looking for an aligner allowing intron gap superligang Bioinformatics 6 04-12-2010 11:20 PM

Reply
 
Thread Tools
Old 05-17-2011, 04:56 AM   #1
dariober
Senior Member
 
Location: Cambridge, UK

Join Date: May 2010
Posts: 311
Default Intron coverage in RNAseq

Hi All,

Two questions about RNAseq coverage:

1) Introns: I find that introns get quite a lot of reads mapping to them compared to the exons. Is this expected or is it a sign of something that went wrong? (I would expect introns to be almost invisible).

2) Read distribution: The read count varies a lot in the exons while it seems quite uniform in the introns. Is there any explanation for this?

I attach here a plot of read count vs position (mpileup) in two samples for a gene as an example of what I mean (at a glance, other genes show similar patterns).

Reads & alignments: c.ca 30 million reads, 35 bp pair-end aligned with Tophat.

I realize these issues must have been discussed a number of times but I couldn't find any threads/papers.

Any and all comments welcome!
Dario

Attached Images
File Type: jpg intron_exon_coverage.jpg (66.9 KB, 857 views)
dariober is offline   Reply With Quote
Old 05-17-2011, 05:48 AM   #2
wjeck
Member
 
Location: Chapel Hill, NC

Join Date: Mar 2009
Posts: 39
Default

I would guess (and this is a guess) that you are including the "gap" part of your spliced reads in the coverage. That is, splice reads are being counted as covering the intron, which is def not what you want. Have you looked at your mappings in something like IGV, and does it show a ton of reads in your intronic regions?
wjeck is offline   Reply With Quote
Old 05-17-2011, 06:36 AM   #3
dariober
Senior Member
 
Location: Cambridge, UK

Join Date: May 2010
Posts: 311
Default

Quote:
Originally Posted by wjeck View Post
I would guess (and this is a guess) that you are including the "gap" part of your spliced reads in the coverage. That is, splice reads are being counted as covering the intron, which is def not what you want. Have you looked at your mappings in something like IGV, and does it show a ton of reads in your intronic regions?
Well thought wjeck, thanks so much! Here's below the picture from IGV.

The plot I first showed is from mpileup. I thought samtools mpileup would have been aware of gaps (splice-junctions) as they are encoded in the CIGAR string, right? Instead it seems to count coverage where there should be gaps. This is the samtools command I executed. Am I doing something wrong? (Unless I'm missing something here, it surprises me that samtools counts reads where there are gaps!?)

samtools mpileup -r 7:27658519-27661277 accepted_hits.bam > tnf.pileup

Dario

Attached Images
File Type: jpg igv.jpg (34.1 KB, 801 views)
dariober is offline   Reply With Quote
Old 05-17-2011, 06:50 AM   #4
zorph
Member
 
Location: FL

Join Date: May 2010
Posts: 40
Default

I have have some RNA-seq data and i have been to talks involving RNA-seq data and it seems that introns seem to be retained.

When I talked to a guy from PacBio presenting his data (where he had retained introns) he stated that this could be due to pre-mRNA being included in your sample or it could be a signal that these introns might be doing more (i.e. playing a regulatory role).

It's unknown as of now but I've seen 3 different sets of data and they all have intron retention :-/.
zorph is offline   Reply With Quote
Old 05-17-2011, 06:57 AM   #5
pbluescript
Senior Member
 
Location: Boston

Join Date: Nov 2009
Posts: 224
Default

It is a tricky question and it is something a lot of people are seeing, but you almost have to take it on a case-by-case basis to figure out what might be going on.
In addition to the pre-mRNA and retained intron possibilities, some of the reads may map to multiple places or map to repeat regions of the genome. I place less trust in those types of reads.
pbluescript is offline   Reply With Quote
Old 05-17-2011, 07:33 AM   #6
wjeck
Member
 
Location: Chapel Hill, NC

Join Date: Mar 2009
Posts: 39
Default

Quote:
Originally Posted by dariober View Post
Well thought wjeck, thanks so much! Here's below the picture from IGV.

The plot I first showed is from mpileup. I thought samtools mpileup would have been aware of gaps (splice-junctions) as they are encoded in the CIGAR string, right? Instead it seems to count coverage where there should be gaps. This is the samtools command I executed. Am I doing something wrong? (Unless I'm missing something here, it surprises me that samtools counts reads where there are gaps!?)

samtools mpileup -r 7:27658519-27661277 accepted_hits.bam > tnf.pileup

Dario


Yep, that's your problem. I don't know how to fix this because I haven't been working with RNA, but we had this problem when we were trying to use coverage to do indel detection (in a very naive way). We just ended up taking another route. Could you go through the SAM file and break the reads with introns into two separate records? It's a hack, but it might work.
--will
wjeck is offline   Reply With Quote
Old 05-17-2011, 07:50 AM   #7
Richard Finney
Senior Member
 
Location: bethesda

Join Date: Feb 2009
Posts: 700
Default

This picture, via Cancer Genome Workbench (cgwb.nci.nih.gov) shows rnaseq reads hitting introns. The wiggle plot show logarithmic coverage for a TCGA rnaseq samples. It may be immature rna or intronic control RNAexpression as others have theorized. This intronic rna expression appears quite common in many samples and most of the reads, in this case, appear to be "uniquely" mappable.

Attached Images
File Type: jpg intron.jpg (33.1 KB, 179 views)

Last edited by Richard Finney; 05-17-2011 at 09:08 AM. Reason: gramerr
Richard Finney is offline   Reply With Quote
Old 05-17-2011, 09:06 AM   #8
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Could you show a SAM record that is being improperly counted? Curiosity here...
nilshomer is offline   Reply With Quote
Old 05-19-2011, 02:34 AM   #9
steven
Senior Member
 
Location: Southern France

Join Date: Aug 2009
Posts: 269
Default

What I usually observe from public RNA-seq data is that about 10-15% of the non-spliced reads do overlap annotated introns. It can be much higher, depending on the library preparation. It also depends on the reference annotation (for instance Refseq is more restrictive than Ensembl).

Possible reasons:
- "background noise"
- "pervasive transcription"
- partially processed mRNAs
- genomic contamination
- functional non coding RNAs
- new transcripts / splicing isoforms
- degradation products (they can be polyAdenylated and even re-capped)

Last edited by steven; 05-19-2011 at 02:36 AM.
steven is offline   Reply With Quote
Old 05-19-2011, 06:01 AM   #10
dariober
Senior Member
 
Location: Cambridge, UK

Join Date: May 2010
Posts: 311
Default

Quote:
Originally Posted by nilshomer View Post
Could you show a SAM record that is being improperly counted? Curiosity here...
Hi nilshomer - Thank you and everyone else for discussion,

Here some lines from the SAM and pileup output just at the border between the first exon-intron. The CIGAR string of the reads spanning the intron have a 586 bp insertion, which is the size of the intron. These insertions are counted in the pileup file.

For the first time I realize that the insertion is coded as '>' or '<' in the pileup column reporting the bases. So a 'corrected' pileup should have at each position the read count minus the count of '>' and '<', right?

Any comments welcome!

All the best

Code:
EBRI093151_0001:7:65:657:1096#TCTCCC	73	7	27658852	255	35M	*	0	0	TGCACTTCGAGGTTATCGGCCCCCAGAAGGAAGAG	BCCCCBCBBCAAABCBA>@@BBCC@:<;A?0?@;>	NM:i:0	NH:i:1
EBRI093151_0001:7:10:722:420#TCTCTT	163	7	27658853	255	35M	=	27658864	0	GCACTTCGAGGTTATCGGCCCCCAGAAGGAAGAGT	AA+>BC@CACBBBBBBB;)@CBBAC>@1?&=;%=3	NM:i:1	NH:i:1
EBRI093151_0001:7:100:1536:385#CCCCCC	97	7	27658853	255	35M	=	27659701	0	GCACTTCGAGGTTATCGGCCCCCAGAAGGAAGAGT	BAAA5=>A9AB>?>-372?29BB?.':?A?)3)@2	NM:i:1	NH:i:1
EBRI093151_0001:7:105:1158:856#TTAACT	163	7	27658856	255	31M586N4M	=	27659472	0	CTTCGAGGTTATCGGCCCCCAGAAGGAAGAGTTTC	<C;?9?BCBC6B6BC>>CBACCCA8@1=@AC>C=*	NM:i:0	XS:A:+	NH:i:1
EBRI093151_0001:7:72:996:1595#TCTCTA	83	7	27658856	255	35M	=	27658780	0	CTTCGAGGTTATCGGCCCCCAGAAGGAAGAGTTGC	C4BBBA;BBBABBBBB@6@BB?BBBBCCABBBC@B	NM:i:2	NH:i:1
EBRI093151_0001:7:21:730:103#ACCTCC	163	7	27658859	255	28M586N7M	=	27659486	0	CGAGGTTATCGGCCCCCAGAAGGAAGAGTTTCCAG	ACCCC2CBBBBA@ABB@1><A@A1>;%=>=B7;)7	NM:i:0	XS:A:+	NH:i:1
EBRI093151_0001:7:42:94:608#TCCCCC	161	7	27658859	255	28M586N7M	=	27659700	0	CGAGGTTATCGGCCCCCAGAAGGAAGAGTTTCCAG	@B@CC?CBAAA?5>ACCA>9??A8@7;>3>B=?57	NM:i:0	XS:A:+	NH:i:1
EBRI093151_0001:7:117:1539:645#CCTCTC	97	7	27658859	255	28M586N7M	=	27659706	0	CGAGGTTATCGGCCCCCAGAAGGAAGAGTTTCCAG	BCB@A=AA;5?@BBBB=49?<>>=4628,=<A3##	NM:i:0	XS:A:+	NH:i:1
EBRI093151_0001:7:117:341:1130#CTTACC	147	7	27658862	255	25M586N10M	=	27658795	0	GGTTATCGGCCCCCAGAAGGAAGAGTTTCCAGCTG	C;=ABCC@@@>BCAAACC>??=BBC9>?ABBA=;@	NM:i:0	XS:A:+	NH:i:1
EBRI093151_0001:7:86:263:1733#TTTCTA	73	7	27658863	255	24M586N11M	*	0	0	GTTATCGGCCCCCAGAAGGAAGAGTTTCCAGCTGG	BBAB?A;1<B?B=82:92;:79785=72;648594	NM:i:0	XS:A:+	NH:i:1
EBRI093151_0001:7:10:722:420#TCTCTT	83	7	27658864	255	23M586N12M	=	27658853	0	TTATCGGCCCCCAGAAGGAAGAGTTTCCAGCTGGC	@C@ACCB?BA4ABCB@ABBBBBBBBAC;@BBBCCB	NM:i:0	XS:A:+	NH:i:1
EBRI093151_0001:7:58:1684:1481#TATCTA	83	7	27658864	255	23M586N12M	=	27658781	0	TTATCGGCCCCCAGAAGGAAGAGTTTCCAGCTGGC	ABA8CBB:7<ABC>CBAA@BBBBA=ABBCACABCB	NM:i:0	XS:A:+	NH:i:1
EBRI093151_0001:7:73:1052:643#CCCCCA	97	7	27658866	255	21M586N14M	=	27659703	0	ATCGGCCCCCAGAAGGAAGAGTTTCCAGCTGGCCC	BAA?A@A=BA?@>;=A=??=?4?=?:3;317>::5	NM:i:0	XS:A:+	NH:i:1
EBRI093151_0001:7:89:1185:1493#CCCCCT	99	7	27658866	255	21M586N14M	=	27659486	0	ATCGGCCCCCAGAAGGAAGAGTTTCCAGCTGGCCC	@AA?=A==A@2=<+B8(<@1@1@9?>1>=4=-.=:	NM:i:0	XS:A:+	NH:i:1
EBRI093151_0001:7:65:504:840#CCTCTT	99	7	27658867	255	20M586N15M	=	27659469	0	TCGGCCCCCAGAAGGAAGAGTTTCCAGCTGGCCCC	BABA<B?BB*@0;>69A@=97=BAB<6A?=9>=7@	NM:i:0	XS:A:+	NH:i:1


7	27658852	N	75	TTtTtTtttttTtttTTTtttttttTTTTttttttttttttttttttttttTtttttttttttttttTTttTttt	4AA>A?>CABA>=AAA69BA>B@=94>A@A@CB:A8ABABBAB@BA>B@BBAA7B=B:A8B>9@B?=AB@@?CB>
7	27658853	N	76	GGgGgGgggggGgggGGGgggggggGGGGggggggggggggggggggggggGgggggggggggggggGGggGggg^~G	9AB=ABBCABCBABAC:@CB@ABBA:BACBB>C:A@BCBC=BBBA?@ABB@B?0>:B4C@B>>BB@?=B@:BBBAA
7	27658854	N	76	CCcCcCcccccCcccCCCcccccccCCCCccccccccccccccccccccccCcccccccccccccccCCccCcccC	1=@A@ABCABBAAAB;99BA>BB=B>?@BBCAC@CCACBC?BCB=A?BCBC<B8@AA?CBBB@@BBB=B;B?ABBA
7	27658855	N	76	A$AaAaAaaaaaAaaaAAAaaaaaaaAAAAaaaaaaaaaaaaaaaaaaaaaaGaaaaaaaaaaaaaaaAAaaAaaaA	3A>>3B<:=B<B49=0=0A?@@8-;;?@B7BA@;>?:AA>=?9:.:9=C?B=<657=8?<;?8;=676;64A8=?+
7	27658856	N	77	CcCcCcccccCcccCCCcccccccCCCCccccccccccccccccccccccCcccccccccccccccCCccCcccC^~C^~c	=@AAACB@BAB?BB;B/BB?BA=A:BBB@CCCAB:@CCC>CCC@A>CCBC4B5C>BABA:BBABBB<?@B?B?B><C
7	27658857	N	77	TtTtTtttttTtttTTTtttttttTTTTttttttttttttttttttttttTtttttttttttttttTTttTtttTTt	?B<A?@?ABCB9AB>A>BA;A@A<.@?C9BB>8A3?B?B;ABA>?@?CA>=A4>5A9@A>;?:@??>BAAA:?<BC4
7	27658858	N	77	T$t$TtTtttttTtttTTTtttttttTTTTttttttttttttttttttttttTtttttttttttttttTTttTtttTTt	>A:?B@@<BCB:ABBB?A28@@?@?AAB%;BC=A<3BB=;?>8=<;?C>A=?=>:@0>A@>A>=6?ABB?A:>9C;B
7	27658859	N	76	C$cCcccccCcccCCCcccccccCCCCccccccccccccccccccccccCcccccccccccccccCCccCcccCCc^~C	@AACCAB@@@BB=B@C<=BBAB=ABA=BCCBCA@BC??CBB>A:BACB<A;C<BBCBBBB5@BC7ABBA@BB@?BA
7	27658860	N	75	g$GgggggGgggGGGgggggggGGGGggggggggggggggggggggggGgggggggggggggggGGggGgggGGgG	BBB@BBA8@BB;BABB99BABAAB:B?CCACBBCB@BCBBAA>?BBB@B>B>@;BBB?AAC@@:CBABB>BC9BC
7	27658861	N	74	A$a$a$aaaAaaaAAAaaaaaaaAAAAaaaaaaaaaaaaaaaaaaaaaaAaaaaaaaaaaaaaaaAAaaAaaaAAaA	3BBBBB3=B@2@<BB>=AAA.?52A@BAACCBABCBCBBABA>@BB8B@BAB?;BBBBABBA(>B==C?BA?AC
7	27658862	N	72	g$g$g$G$g$ggGGGgggggggGGGGggggggggggggggggggggggGgggggggggggggggGGggGgggGGgG^~g	@B@9@AB>A@CBABCBBA?A:BA>BBCBACBCBCBA@BA@BAB9BBBAA?CBBAB=BBA<CBAACBACB;CC
7	27658863	N	67	g$g$G$GGgggggggGGGGggggggggggggggggggggggGgggggggggggggggGGggGgggGGgGg	@@ACABB74AAB?>B9B8CC=CB>CACBBA8AB?4CBABA>AAA2CBA;@ABA@=CA?BBB<BCBC;
7	27658864	N	66	TTtttttttTTTTttttttttttttttttttttttTtttttttttttttttCTttTtttTTtTt^~t^~t	?)CB:BBBB(3499ACC/CB@BBB@CCC>BBCCBA1B?C7B7BBAB@>AB>(>@B4CC>BBB2=@A
7	27658865	N	66	TTtttttttTTTTttttttttttttttttttttttTtttttttttttttttTTttTtttTTtTttt	@2CB=9B>B1ABBB=>B/C>?>B+2CBB@BCBCBC5?BB?ABBB@B>BB>>6@AABCBABCBCACB
7	27658866	N	67	A$A$a$aaaaaaAAAAaaaaaaaaaaaaaaaaaaaaaaAaaaaaaaaaaaaaaaAAaaAaaaAAaAaaa^~A	@ABBB>BAC:A>B@=CB;ACBCA>;CAC=BCABBA2?BCBA@B@AB@AAB@7=BBACBBB6ABB@A@
7	27658867	N	67	t$t$t$t$t$t$TTTTttttttttttttttttttttttTtttttttttttttttTTttTtttTTtTtttT^~T^~T^~t	BBBB@B5BAC>CCB?CABBBCA?@ABB@BCC>1BABBB7B@B@ABBBA.ABA9CA?BBBBCA8ABA>
7	27658868	N	63	C$C$C$C$c$cccccccccccccccccccccCcccccccccccccccCCccCcccCCcCcccCCCc^~C^~c	48>@AACCBCCACC>A?=ABBABAB@,B@BAAABABAABACB4?@B0C?BB6BBCCCAA@AA;
7	27658869	N	58	g$g$g$g$g$g$g$g$g$g$g$g$g$g$g$g$gggggGgggggggggggggggGGggGgggGGgGgggGGGgGg	BABBBBBAB?A?BBABABCC?2BABBA@BCBB>BCBB8BBB8CBABBBB@CB?BBA@=
7	27658870	N	45	g$g$g$g$g$GgagggggggggggggGGggGgggGGgGgggGGGgGg^~g^~g^~g	BBBBB5@>BABA@B?A@ACBB.@B:@BBC;CBA@BB=AB;A)6=:
7	27658871	N	40	C$c$c$cccccccccccccCCccCcccCCcCcccCCCcCcccc	.BBCBABBB8?BBCCB>:AA9CCB)>B@@?:A<9>A@7A?
7	27658872	N	38	c$c$c$c$c$c$c$ccccccCCccCcccCCcCcccCCCcCcccc^~c	BBABBB>AABCCB9@@C6CCB@>@A>B7=BA=BB&;18
7	27658873	N	34	c$a$c$c$ccCCccCcccCCcCcccCCCcCccccc^~C^~C^~c	BBBBCB-ABC=CCACC6BBA<=?B;B@6;>?BB#
7	27658874	N	30	c$a$CCccCcccCCcCcccCCCcCcccccCCc	BB8BBB=CC>BB@BC4AABBBA(<@?=CC#
7	27658875	N	29	C$C$ccCcccCCcCcccCCCcCcccccCCc^~C	::BC8CCCBAB@AAB@BA@B3>A==CB#B
7	27658876	N	27	g$a$AaaaAAaAaaaAAAaAaaaaaAAaA	BB.CCCACB1ABC2*+@@+ABBBCB#B
7	27658877	N	30	GgggGGgGgggGGGgGgggggGGgG^~g^~g^~g^~g^~g	1ACCCC?>AC>=@@=@,=B;BC@#B9A42:
7	27658878	N	32	AaaaAAaAaaaAAAaAaaaaaAAaAaaaaa^~a^~a	1CCC>CB<CBC<0AB3<>A=?A9#B>B:=;>#
7	27658879	N	37	AaaaAAaAaaaAAAaAaaaaaAAaAaaaaaaa^~a^~a^~a^~a^~a	1ACB@ABAC@B+;B@@*=A??B>#A?A9A<B62<+9>
7	27658880	N	42	G$g$g$gGGgGgggGGGgGgggggGGgGgggggggggggg^~g^~g^~g^~g^~g	/>BC18B@>AAB>B?B>A>A>BA#B=B@?A?7;;:8A2:;>?
7	27658881	N	39	gGGgGgggGGGgGgggggGGgGggggggggggggggggg	B?@BA?BA86C9A5=B1=B>;AAB<BAA3=<33<2A7;?
7	27658882	N	39	aAAaAaaaAAAaAaaaaaAAaAaaaaaaaaaaaaaaaaa	B&1C1?B@(9<B=@<AA?B=5@AC8A?@3=;;<B1@;@A
7	27658883	N	39	aAAaAaaaAAAaAaaaaaAAaAaaaaaaaaaaaaaaaaa	C==C>=BB<AB;B3?A?AA>3B@CB@AB97?:;37?<;@
7	27658884	N	39	g$GGgGgggGGGgGgggggGGgGggggggggggggggggg	B;@A;BBB@@@>?7AA3AC=/B>B?A@=05=1;=7=>=@
7	27658885	N	38	AAaAaaaAAAaAgaaaaAAaAaaaaaaaaaaaaaaaaa	%AB%BBB1=BAA(9AAAC=4A@B=B>A7>=6:@0?<=B
7	27658886	N	38	GGgGgggGGGgGgggggGGgGggggggggggggggggg	=CB=CBB@9B6A@8B;:C?4B?B8?;A555:8=-?5A<
7	27658887	N	38	T$>t><<<>>><>g<<<<>><><<<<<<<<<<<<<<<<<	3>B>9BA17A0>7A?AAB?:.:@?B>@>8&;:96A??0
7	27658888	N	37	>t><<<>>><>t<<<<>><><<<<<<<<<<<<<<<<<	>C>9BA17A0>AA?AAB?:.:@?B>@>8&;:96A??0
7	27658889	N	37	>g><<<>>><>g<<<<>><><<<<<<<<<<<<<<<<<	>@>9BA17A0>3A?AAB?:.:@?B>@>8&;:96A??0
7	27658890	N	37	>c$><<<>>><>a<<<<>><><<<<<<<<<<<<<<<<<	>B>9BA17A0><A?AAB?:.:@?B>@>8&;:96A??0
7	27658891	N	36	>><<<>>><>g<<<<>><><<<<<<<<<<<<<<<<<	>>9BA17A0>BA?AAB?:.:@?B>@>8&;:96A??0
7	27658892	N	36	>><<<>>><>c<<<<>><><<<<<<<<<<<<<<<<<	>>9BA17A0>BA?AAB?:.:@?B>@>8&;:96A??0
7	27658893	N	36	>><<<>>><>g<<<<>><><<<<<<<<<<<<<<<<<	>>9BA17A0>BA?AAB?:.:@?B>@>8&;:96A??0
7	27658894	N	36	>><<<>>><>c<<<<>><><<<<<<<<<<<<<<<<<	>>9BA17A0>BA?AAB?:.:@?B>@>8&;:96A??0
7	27658895	N	36	>><<<>>><>c<<<<>><><<<<<<<<<<<<<<<<<	>>9BA17A0>BA?AAB?:.:@?B>@>8&;:96A??0
7	27658896	N	36	>><<<>>><>t<<<<>><><<<<<<<<<<<<<<<<<	>>9BA17A0>BA?AAB?:.:@?B>@>8&;:96A??0
7	27658897	N	36	>><<<>>><>g<<<<>><><<<<<<<<<<<<<<<<<	>>9BA17A0>BA?AAB?:.:@?B>@>8&;:96A??0
7	27658898	N	36	>><<<>>><>g<<<<>><><<<<<<<<<<<<<<<<<	>>9BA17A0>BA?AAB?:.:@?B>@>8&;:96A??0
7	27658899	N	36	>><<<>>><>c<<<<>><><<<<<<<<<<<<<<<<<	>>9BA17A0>BA?AAB?:.:@?B>@>8&;:96A??0
7	27658900	N	36	>><<<>>><>c<<<<>><><<<<<<<<<<<<<<<<<	>>9BA17A0>BA?AAB?:.:@?B>@>8&;:96A??0
7	27658901	N	36	>><<<>>><>a<<<<>><><<<<<<<<<<<<<<<<<	>>9BA17A0>BA?AAB?:.:@?B>@>8&;:96A??0
7	27658902	N	36	>><<<>>><>g$<<<<>><><<<<<<<<<<<<<<<<<	>>9BA17A0>AA?AAB?:.:@?B>@>8&;:96A??0
7	27658903	N	35	>><<<>>><><<<<>><><<<<<<<<<<<<<<<<<	>>9BA17A0>A?AAB?:.:@?B>@>8&;:96A??0
7	27658904	N	35	>><<<>>><><<<<>><><<<<<<<<<<<<<<<<<	>>9BA17A0>A?AAB?:.:@?B>@>8&;:96A??0
7	27658905	N	35	>><<<>>><><<<<>><><<<<<<<<<<<<<<<<<	>>9BA17A0>A?AAB?:.:@?B>@>8&;:96A??0
7	27658906	N	35	>><<<>>><><<<<>><><<<<<<<<<<<<<<<<<	>>9BA17A0>A?AAB?:.:@?B>@>8&;:96A??0
dariober is offline   Reply With Quote
Old 09-01-2011, 09:50 AM   #11
arecht
Junior Member
 
Location: Santa Cruz

Join Date: Aug 2009
Posts: 3
Default

Hi dariober, I was wondering if this problem of removing the gaps in the pileup step has ever been resolved for you? I ran into the same issue. Any tip would be greatly appreciated!
arecht is offline   Reply With Quote
Old 09-14-2011, 01:16 PM   #12
chrisbala
Member
 
Location: North Carolina

Join Date: Jan 2010
Posts: 82
Default

has anyone seen any discussion of the influence of intronic reads on transcript prediction? Or I suppose, mention of how people are dealing with this? I've raised this issue elsewhere and continue to be plagued by it... i really can't find any automated solution for dealing with this. it seems we have to get down to business and manually annotate genes... fun!
chrisbala is offline   Reply With Quote
Old 09-15-2011, 10:46 AM   #13
malachig
Senior Member
 
Location: WashU

Join Date: Aug 2010
Posts: 117
Default

The ALEXA-seq and Trans-ABySS manuscripts discuss the implications of intronic mapping reads for measuring expression and transcript assembly respectively. Due to all of the possible sources of sequences that will map to introns (nicely summarized by Steven above), it is important to consider whether the signal you see for an exon is really above that of the surrounding areas. RNA-seq is not a zero noise data type. . You will always have some amount of contaminating genomic DNA and unprocessed RNA (hnRNA). There are molecular strategies for reducing these (e.g. DNAseI treatment, cytoplasmic RNA isolation, polyA+ selection, etc.) but there will always be some. So, you expect signal across the genome in an RNA-seq library for purely molecular reasons and analytical sources add on top of this (e.g. multi-mapping reads). You can actually see hints of the difference between intergenic noise and intronic noise by examining the data. Intronic noise is expected to be correlated with gene expression level (more expression results in more unprocessed RNA when you take a snapshot of the transcriptome). This is exactly what we see in our RNA-seq libraries. This complicates the interpretation of whether an intron is really being retained as part of an alternative isoform. As others have mentioned, local context is important when interpreting RNA-seq data.

For all of these reasons, one can argue that exon-exon junctions might be a useful metric for evaluating RNA-seq library quality. The reason being that exon-exon sequences from real transcripts for the most part do not occur in the genome (and therefore not in unprocessed transcripts either). We also routinely evaluate exon-to-intron and exon-to-intergenic signal, among many other metrics. For example here is a report for a:
Human Breast vHMECS RNA-seq library.

Full disclosure. I am an author on the two papers linked above. Perhaps others can link to others that discuss this topic!
malachig is offline   Reply With Quote
Old 11-25-2011, 04:22 AM   #14
cjp
Member
 
Location: Cambridge, United Kingdom

Join Date: Jun 2011
Posts: 58
Default

Maybe HTseq-count deals with counting spliced reads:

From this page:

http://www-huber.embl.de/users/ander...doc/count.html


"Make sure to use a splicing-aware aligner such as TopHat. HTSeq-count makes full use of the information in the CIGAR field."

Chris
cjp is offline   Reply With Quote
Old 11-28-2011, 01:29 PM   #15
adameur
Member
 
Location: Uppsala, Sweden

Join Date: Nov 2009
Posts: 23
Default

Here is an answer to why there are so many intronic reads: it's because RNA-seq measures ongoing transcription!

We have recently published a paper in Nat Struct Mol Biol where this is described. See also this thread for more info.
adameur is offline   Reply With Quote
Old 11-28-2011, 02:28 PM   #16
cjp
Member
 
Location: Cambridge, United Kingdom

Join Date: Jun 2011
Posts: 58
Default

Hi adameur,

I'd thought to mention your paper in my previous answer as well!

Do you know how much it may affect mRNA-seq studies - where people try to select processed RNA's from total RNA using poly-A and related lab methods?

Chris
cjp is offline   Reply With Quote
Old 11-28-2011, 11:06 PM   #17
adameur
Member
 
Location: Uppsala, Sweden

Join Date: Nov 2009
Posts: 23
Default

Hi Chris,

From what I have seen the choice of lab method makes a big difference.. PolyA+ gives a much higher proportion of reads mapping to exons, which is good if you're mainly interested in expression of mature mRNAs and splice variants.

Total RNA on the other hand makes it possible to also study other types of events, like nascent transcript formation and co-transcriptional splicing. The drawbacks are that you need quite a lot of reads to study the intronic expression and that there are no established analysis tools.

Adam
adameur is offline   Reply With Quote
Old 11-28-2011, 11:15 PM   #18
polsum
Member
 
Location: Texas

Join Date: May 2009
Posts: 32
Default

a dumb question. How did you make that figure in the first post of this thread? which program you used?
polsum is offline   Reply With Quote
Old 11-28-2011, 11:28 PM   #19
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Just to clarify: We have two things going on here.

(i) The coverage curve shown in post #1, produced with samtools, disagrees from the one shown in post #3, produced, for the same data, with IGC. This points to an interesting and worrying bug in samtools and explains most of the coverage in the introns.

(ii) There are still reads left in the introns. They are most likely really there, and many people have seen them in RNA-Seq data. The thread has probably mentioned most of the suggestions the literature has discussed recently about what this means.
Simon Anders is offline   Reply With Quote
Old 11-29-2011, 01:51 AM   #20
cjp
Member
 
Location: Cambridge, United Kingdom

Join Date: Jun 2011
Posts: 58
Default

Yes is a two-issue thread!

The samtools thing is that mpileup adds '>' and '<' symbols where the cigar part of an alignment is an N. You can use "samtools depth" and that gives me 0 over introns.
cjp is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off



All times are GMT -8. The time now is 07:54 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2022, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO