SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
DEXSeq Using Counts File From htseq-count FuzzyCoder Bioinformatics 20 01-04-2016 12:18 AM
Very low read numbers from haloplex libraries henry.wood Illumina/Solexa 7 10-16-2015 02:15 AM
Suspiciously low read numbers with shotgun pipeline compared to amplicon pipeline Krisztab 454 Pyrosequencing 0 03-27-2014 08:05 AM
DEXSEQ dexseq_count.py counts sszelinger Bioinformatics 1 02-11-2013 02:56 PM
DEXSeq gene level counts Julien Roux Bioinformatics 3 11-28-2012 01:31 AM

Reply
 
Thread Tools
Old 04-01-2014, 08:05 AM   #1
kajot
Member
 
Location: Germany

Join Date: Dec 2013
Posts: 16
Default DEXseq - very low numbers of counts

Hello,

I wanted to ask the community for help with my problem related to using DEXseq, particularily generating count tables with the Python script. My problem is as follows:

I have paired-end, strand specific (Illumina TruSeq stranded) RNA-seq data, 101bp read length, multiplexed. The rawdata was aligned using Tophat2 2.0.8 with Bowtie2, no novel junctions. The annotation file used was GRCm38 from Ensembl (mouse) - both the full genome fasta and .gtf annotation file.

The bam files are around 16GB each, the example .bam I am trying to find solution on has 176609632 reads. Bam files were sorted by name using samtools (samtools sort -n) As next, I flattened the .gtf file using the Python annotation preparation script from DEXSeq which produced a .gff flat. A quick inspection of the formed .gff shows it is "healthy".

Now the problem begins, even though my dataset is pretty large, most of the exons in the count table have 0 counts. Some of them have few, there are single occurences of exons with 2000 counts, but they are very rare. Naturally, DEXseq returned NA for all the p-values since it is impossible to do anything with just zeroes.

I loaded the .bam file into IGV to visually inspect if what the count table tells me is true, but it is absolutely not right. As an example of Rbm20 gene, in IGV I can see the exon14 having high coverage whereas it seems almost empty in the count file. As you can see it's not a matter of the script not working at all, it is working only partially.

The exon 14 of Rbm20 is exon bin 19 for HTSeq, here is the counting part fo this gene:

ENSMUSG00000043639:001 0
ENSMUSG00000043639:002 0
ENSMUSG00000043639:003 0
ENSMUSG00000043639:004 0
ENSMUSG00000043639:005 0
ENSMUSG00000043639:006 2
ENSMUSG00000043639:007 1
ENSMUSG00000043639:008 0
ENSMUSG00000043639:009 0
ENSMUSG00000043639:010 0
ENSMUSG00000043639:011 0
ENSMUSG00000043639:012 4
ENSMUSG00000043639:013 1
ENSMUSG00000043639:014 1
ENSMUSG00000043639:015 1
ENSMUSG00000043639:016 0
ENSMUSG00000043639:017 0
ENSMUSG00000043639:018 0
ENSMUSG00000043639:019 15

Now look at that last exon on my IGV screenshot in the attachment.

And here are small excerpts from my .gtf and my .bam:

.GTF:

19 protein_coding exon 53859392 53859513 . + . exon_id "ENSMUSE00001178201"; exon_number "13"; gene_biotype "protein_coding"; gene_id "ENSMUSG00000043639"; gene_name "Rbm20"; p_id "P13408"; transcript_id "ENSMUST00000164202"; transcript_name "Rbm20-201"; tss_id "TSS23511";
19 protein_coding CDS 53864080 53864187 . + 0 exon_number "14"; gene_biotype "protein_coding"; gene_id "ENSMUSG00000043639"; gene_name "Rbm20"; p_id "P13408"; protein_id "ENSMUSP00000129447"; transcript_id "ENSMUST00000164202"; transcript_name "Rbm20-201"; tss_id "TSS23511";
19 nonsense_mediated_decay exon 53864080 53867080 . + . exon_id "ENSMUSE00001167670"; exon_number "7"; gene_biotype "protein_coding"; gene_id "ENSMUSG00000043639"; gene_name "Rbm20"; p_id "P15978"; transcript_id "ENSMUST00000161856"; transcript_name "Rbm20-002"; tss_id "TSS56270";
19 protein_coding exon 53864080 53867080 . + . exon_id "ENSMUSE00001107658"; exon_number "14"; gene_biotype "protein_coding"; gene_id "ENSMUSG00000043639"; gene_name "Rbm20"; p_id "P13408"; transcript_id "ENSMUST00000164202"; transcript_name "Rbm20-201"; tss_id "TSS23511";
19 protein_coding stop_codon 53864188 53864190 . + 0 exon_number "14"; gene_biotype "protein_coding"; gene_id "ENSMUSG00000043639"; gene_name "Rbm20"; p_id "P13408"; transcript_id "ENSMUST00000164202"; transcript_name "Rbm20-201"; tss_id "TSS23511";

Part of flattened .gff:

1 dexseq_prepare_annotation.py aggregate_gene 3054233 3054733 . + . gene_id "ENSMUSG00000090025"
1 dexseq_prepare_annotation.py exonic_part 3054233 3054733 . + . transcripts "ENSMUST00000160944"; exonic_part_number "001"; gene_id "ENSMUSG00000090025"
1 dexseq_prepare_annotation.py aggregate_gene 3102016 3102125 . + . gene_id "ENSMUSG00000064842"
1 dexseq_prepare_annotation.py exonic_part 3102016 3102125 . + . transcripts "ENSMUST00000082908"; exonic_part_number "001"; gene_id "ENSMUSG00000064842"
1 dexseq_prepare_annotation.py aggregate_gene 3205901 3671498 . - . gene_id "ENSMUSG00000051951"
1 dexseq_prepare_annotation.py exonic_part 3205901 3206522 . - . transcripts "ENSMUST00000162897"; exonic_part_number "001"; gene_id "ENSMUSG00000051951"
1 dexseq_prepare_annotation.py exonic_part 3206523 3207317 . - . transcripts "ENSMUST00000162897+ENSMUST00000159265"; exonic_part_number "002"; gene_id "ENSMUSG00000051951"
1 dexseq_prepare_annotation.py exonic_part 3213439 3213608 . - . transcripts "ENSMUST00000159265"; exonic_part_number "003"; gene_id "ENSMUSG00000051951"
1 dexseq_prepare_annotation.py exonic_part 3213609 3214481 . - . transcripts "ENSMUST00000162897+ENSMUST00000159265"; exonic_part_number "004"; gene_id "ENSMUSG00000051951"
1 dexseq_prepare_annotation.py exonic_part 3214482 3215632 . - . transcripts "ENSMUST00000070533+ENSMUST00000162897+ENSMUST00000159265"; exonic_part_number "005"; gene_id "ENSMUSG00000051951"
1 dexseq_prepare_annotation.py exonic_part 3215633 3216344 . - . transcripts "ENSMUST00000162897+ENSMUST00000070533"; exonic_part_number "006"; gene_id "ENSMUSG00000051951"
1 dexseq_prepare_annotation.py exonic_part 3216345 3216968 . - . transcripts "ENSMUST00000070533"; exonic_part_number "007"; gene_id "ENSMUSG00000051951"


and my .bam (first few lines):

DHCDZDN1:177:C31G2ACXX:3:1101:1206:88736 163 MT 6628 50 101M = 6668 141 CAGGAATACCACGACGCTACTCAGACTACCCAGATGCTTACACCACATGAAACACTGTCTCTTCTATAGGATCATTTATTTCACTAACAGCTGTTCTCATC ?;=?D:BBDD3:CEAEEDE?3?<DD?D@DDCCDDII@DEADEBDBDII@@@DCC=CCDDDCACDD@ADDD;;@;ADDEDAA:>>AAAA>A?########## AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU XS:A:+ NH:i:1
DHCDZDN1:177:C31G2ACXX:3:1101:1206:88736 83 MT 6668 50 101M = 6628 -141 CACCACATGAAACACTGTCTCTTCTATAGGATCATTTATTTCACTAACAGCTGTTCTCATCATGATNTTTATAATTTGAGAGGCCTTTGCTNCAAAACGAG B?BBABEDA?@>A@CCCCBB@;@;B=CE?3FCIEACCEFFGFFIIIGFFIEFCFBGFEFEEGDB:1#GCFFFIIIIIIFFGGCFEGFC<22#DD1;AB;@? AS:i:-2 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:66C24T9 YT:Z:UU XS:A:+ NH:i:1
DHCDZDN1:177:C31G2ACXX:3:1101:1207:49279 163 13 55133172 50 101M = 55133267 196 GACCTCCTCAAGAAGATTCTGGGGGAACTAAAAAGGAATTTAGAACTGGTGCAGGTTCTACAGAGAGCAATAGGGTCCCCAGACCCTAACCTGAGAAATTC @@@DBDDEDDHHHEIGIGGIIIIIIBHGHIGIE@GIICGIIIGHIIAFIBCECGGCHGGIIHHGGEE=DCC@C@B>CABD?BDDBDDDCBBDDCDCDDCD@ AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU XS:A:+ NH:i:1
DHCDZDN1:177:C31G2ACXX:3:1101:1207:49279 83 13 55133267 50 101M = 55133172 -196 AAATTCCTTCTGGGTTTGTGTCCTGGGATCCACTGCAGTGGGTTTATACATAACTGTGGATGGAGCNGGTCTTCCAGGAAGTATTCCTAAANCTTGTAGAG DCCCDB?DDDDD@DDDDDCDDDFEDDCEGHHEHGGIIGGGGCEIGGIHHICIHJJIEJGHCGFD?1#GHDIIGJJJIGJJJIGGJJIHFA2#HFDD?FC@@ AS:i:-2 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:66A24G9 YT:Z:UU XS:A:+ NH:i:1


and the tail of the .counts file:

_ambiguous 147010
_empty 90546556
_lowaqual 0
_notaligned 0


I searched a bit online, but it seemed like most people having problems with empty count tables had them completely empty. Any suggestions ? Thanks a lot in advance!
Attached Images
File Type: png Rbm20ForSEQans.resized.png (65.0 KB, 15 views)

Last edited by kajot; 04-01-2014 at 08:10 AM.
kajot is offline   Reply With Quote
Old 04-01-2014, 09:34 AM   #2
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Can you post a reproducible example? The simplest way to do this would be to:

Code:
samtools sort -@ 4 namesorted_sample.bam sorted_sample
samtools index sorted_sample.bam
samtools view sorted_sample.bam 19:53864080-53867080 > subset.sam
and then post "subset.sam" somewhere (an attachment here, copy.com, dropbox, where ever). That should correspond to exonic_bin number 19 (at least it does on GRCm38.71). Then we can look at exactly why things are wonky.
dpryan is offline   Reply With Quote
Old 04-02-2014, 12:44 AM   #3
areyes
Senior Member
 
Location: Heidelberg

Join Date: Aug 2010
Posts: 165
Default

Hi kajot,

I have the feeling that could be related to the protocol that you are using. In the most recent protocol of the illumina runs (at least the latest used by the sequencing facility at EMBL), you get strand specific data, but the reads map opposite to the strand where they come from. For example, for a gene that is in the forward strand, you will get all the reads mapping to the reverse strand. This is a it confusing!

So, what I recommend is that in your IGV browser you could colour your reads depending on the strand where they map to, and then see if what I describe above is the case. If so, I think the parameter "-s reverse" parameter of the dexseq_count.py script should do the trick.

Alejandro
areyes is offline   Reply With Quote
Old 04-02-2014, 07:56 AM   #4
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,178
Default

Quote:
Originally Posted by areyes View Post
...For example, for a gene that is in the forward strand, you will get all the reads mapping to the reverse strand. This is a [b]it confusing!

So, what I recommend is that in your IGV browser you could colour your reads depending on the strand where they map to, and then see if what I describe above is the case. If so, I think the parameter "-s reverse" parameter of the dexseq_count.py script should do the trick.

Alejandro
More accurately, for paired end reads, read 1 will map to the anti-sense strand while read 2 maps to the sense strand, so in this case you will get reads mapping to both strands. You need to distinguish both mapping strand and read (1 or 2). The attached image from IGV shows mapping of TruSeq Stranded, paired end reads. In this case the gene is transcribed from the forward strand of the reference (left to right). Reads are colored by strand, blue == minus, red == plus (in this highly zoomed in view the reads glyphs also have arrows indicating their direction). The alignments are then grouped by "first-in-pair". The top group is read 1 and the bottom group read 2.

Alejandro is correct that for TruSeq Stranded Kit libraries you need to specify "-s reverse" for dexseq_count.py or htseq-count if you are counting reads to genes for DESeq(2)
Attached Images
File Type: png igv_panel.png (18.1 KB, 21 views)
kmcarr is offline   Reply With Quote
Old 04-02-2014, 09:21 AM   #5
kajot
Member
 
Location: Germany

Join Date: Dec 2013
Posts: 16
Default

Thank you for all your rapid replies! I was swamped today with regular wet-lab work so I had little time to check all your suggestions, I will try to re-run counting script with -p reverse tomorrow morning.

I only managed to get the subset of reads mapping to aforementioned exon 14 of Rbm20. This was the exon that was covered by multiple reads in IGV and received only around 50 counts in DEXseq.

The subset for exon 14 is here:

https://www.dropbox.com/s/simbcsdzyvk8js0/subset.sam

I looked at IGV and displayed reads joined together when they are a pair. What I can see is that for Rbm20 which is on + strand, I have read 1 mapped to - strand, and read 2 mapped to + strand, pair orientation F2R1. As I understand this means I have the "Illumina protocol bug" and have to run counting with -s reverse option ?




--------------------

Update: I just ran the counting script with -s reverse and it all seems to work. I have now almost 6 million reads that are empty, 1.5 milion ambigous and the rest (around 72.5 million) counted. Since I specified fr-firststrand in Tophat, isn't it somehow affecting my alignment ? Shouldn't it be fr-secondstrand then ? Or am I mixing something up here now ?

Last edited by kajot; 04-02-2014 at 09:54 AM.
kajot is offline   Reply With Quote
Old 04-02-2014, 10:17 AM   #6
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,178
Default

Quote:
Originally Posted by kajot View Post

Update: I just ran the counting script with -s reverse and it all seems to work. I have now almost 6 million reads that are empty, 1.5 milion ambigous and the rest (around 72.5 million) counted. Since I specified fr-firststrand in Tophat, isn't it somehow affecting my alignment ? Shouldn't it be fr-secondstrand then ? Or am I mixing something up here now ?
fr-firststrand is the correct setting to use in Tophat for TruSeq Stranded libraries so you are fine. What 'fr-firststrand' means is the the first read of the pair (or only read for single end reads) matches the 'first strand of cDNA synthesized', which is anti-sense to the RNA.
kmcarr is offline   Reply With Quote
Old 08-04-2018, 12:04 AM   #7
fchen
Junior Member
 
Location: Madison, USA

Join Date: Aug 2018
Posts: 1
Default

Quote:
Originally Posted by areyes View Post

I have the feeling that could be related to the protocol that you are using. In the most recent protocol of the illumina runs (at least the latest used by the sequencing facility at EMBL), you get strand specific data, but the reads map opposite to the strand where they come from. For example, for a gene that is in the forward strand, you will get all the reads mapping to the reverse strand. This is a it confusing!

So, what I recommend is that in your IGV browser you could colour your reads depending on the strand where they map to, and then see if what I describe above is the case. If so, I think the parameter "-s reverse" parameter of the dexseq_count.py script should do the trick.

Alejandro
Thank you!
fchen is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:43 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO