Hello,
I wanted to ask the community for help with my problem related to using DEXseq, particularily generating count tables with the Python script. My problem is as follows:
I have paired-end, strand specific (Illumina TruSeq stranded) RNA-seq data, 101bp read length, multiplexed. The rawdata was aligned using Tophat2 2.0.8 with Bowtie2, no novel junctions. The annotation file used was GRCm38 from Ensembl (mouse) - both the full genome fasta and .gtf annotation file.
The bam files are around 16GB each, the example .bam I am trying to find solution on has 176609632 reads. Bam files were sorted by name using samtools (samtools sort -n) As next, I flattened the .gtf file using the Python annotation preparation script from DEXSeq which produced a .gff flat. A quick inspection of the formed .gff shows it is "healthy".
Now the problem begins, even though my dataset is pretty large, most of the exons in the count table have 0 counts. Some of them have few, there are single occurences of exons with 2000 counts, but they are very rare. Naturally, DEXseq returned NA for all the p-values since it is impossible to do anything with just zeroes.
I loaded the .bam file into IGV to visually inspect if what the count table tells me is true, but it is absolutely not right. As an example of Rbm20 gene, in IGV I can see the exon14 having high coverage whereas it seems almost empty in the count file. As you can see it's not a matter of the script not working at all, it is working only partially.
The exon 14 of Rbm20 is exon bin 19 for HTSeq, here is the counting part fo this gene:
ENSMUSG00000043639:001 0
ENSMUSG00000043639:002 0
ENSMUSG00000043639:003 0
ENSMUSG00000043639:004 0
ENSMUSG00000043639:005 0
ENSMUSG00000043639:006 2
ENSMUSG00000043639:007 1
ENSMUSG00000043639:008 0
ENSMUSG00000043639:009 0
ENSMUSG00000043639:010 0
ENSMUSG00000043639:011 0
ENSMUSG00000043639:012 4
ENSMUSG00000043639:013 1
ENSMUSG00000043639:014 1
ENSMUSG00000043639:015 1
ENSMUSG00000043639:016 0
ENSMUSG00000043639:017 0
ENSMUSG00000043639:018 0
ENSMUSG00000043639:019 15
Now look at that last exon on my IGV screenshot in the attachment.
And here are small excerpts from my .gtf and my .bam:
.GTF:
19 protein_coding exon 53859392 53859513 . + . exon_id "ENSMUSE00001178201"; exon_number "13"; gene_biotype "protein_coding"; gene_id "ENSMUSG00000043639"; gene_name "Rbm20"; p_id "P13408"; transcript_id "ENSMUST00000164202"; transcript_name "Rbm20-201"; tss_id "TSS23511";
19 protein_coding CDS 53864080 53864187 . + 0 exon_number "14"; gene_biotype "protein_coding"; gene_id "ENSMUSG00000043639"; gene_name "Rbm20"; p_id "P13408"; protein_id "ENSMUSP00000129447"; transcript_id "ENSMUST00000164202"; transcript_name "Rbm20-201"; tss_id "TSS23511";
19 nonsense_mediated_decay exon 53864080 53867080 . + . exon_id "ENSMUSE00001167670"; exon_number "7"; gene_biotype "protein_coding"; gene_id "ENSMUSG00000043639"; gene_name "Rbm20"; p_id "P15978"; transcript_id "ENSMUST00000161856"; transcript_name "Rbm20-002"; tss_id "TSS56270";
19 protein_coding exon 53864080 53867080 . + . exon_id "ENSMUSE00001107658"; exon_number "14"; gene_biotype "protein_coding"; gene_id "ENSMUSG00000043639"; gene_name "Rbm20"; p_id "P13408"; transcript_id "ENSMUST00000164202"; transcript_name "Rbm20-201"; tss_id "TSS23511";
19 protein_coding stop_codon 53864188 53864190 . + 0 exon_number "14"; gene_biotype "protein_coding"; gene_id "ENSMUSG00000043639"; gene_name "Rbm20"; p_id "P13408"; transcript_id "ENSMUST00000164202"; transcript_name "Rbm20-201"; tss_id "TSS23511";
Part of flattened .gff:
1 dexseq_prepare_annotation.py aggregate_gene 3054233 3054733 . + . gene_id "ENSMUSG00000090025"
1 dexseq_prepare_annotation.py exonic_part 3054233 3054733 . + . transcripts "ENSMUST00000160944"; exonic_part_number "001"; gene_id "ENSMUSG00000090025"
1 dexseq_prepare_annotation.py aggregate_gene 3102016 3102125 . + . gene_id "ENSMUSG00000064842"
1 dexseq_prepare_annotation.py exonic_part 3102016 3102125 . + . transcripts "ENSMUST00000082908"; exonic_part_number "001"; gene_id "ENSMUSG00000064842"
1 dexseq_prepare_annotation.py aggregate_gene 3205901 3671498 . - . gene_id "ENSMUSG00000051951"
1 dexseq_prepare_annotation.py exonic_part 3205901 3206522 . - . transcripts "ENSMUST00000162897"; exonic_part_number "001"; gene_id "ENSMUSG00000051951"
1 dexseq_prepare_annotation.py exonic_part 3206523 3207317 . - . transcripts "ENSMUST00000162897+ENSMUST00000159265"; exonic_part_number "002"; gene_id "ENSMUSG00000051951"
1 dexseq_prepare_annotation.py exonic_part 3213439 3213608 . - . transcripts "ENSMUST00000159265"; exonic_part_number "003"; gene_id "ENSMUSG00000051951"
1 dexseq_prepare_annotation.py exonic_part 3213609 3214481 . - . transcripts "ENSMUST00000162897+ENSMUST00000159265"; exonic_part_number "004"; gene_id "ENSMUSG00000051951"
1 dexseq_prepare_annotation.py exonic_part 3214482 3215632 . - . transcripts "ENSMUST00000070533+ENSMUST00000162897+ENSMUST00000159265"; exonic_part_number "005"; gene_id "ENSMUSG00000051951"
1 dexseq_prepare_annotation.py exonic_part 3215633 3216344 . - . transcripts "ENSMUST00000162897+ENSMUST00000070533"; exonic_part_number "006"; gene_id "ENSMUSG00000051951"
1 dexseq_prepare_annotation.py exonic_part 3216345 3216968 . - . transcripts "ENSMUST00000070533"; exonic_part_number "007"; gene_id "ENSMUSG00000051951"
and my .bam (first few lines):
DHCDZDN1:177:C31G2ACXX:3:1101:1206:88736 163 MT 6628 50 101M = 6668 141 CAGGAATACCACGACGCTACTCAGACTACCCAGATGCTTACACCACATGAAACACTGTCTCTTCTATAGGATCATTTATTTCACTAACAGCTGTTCTCATC ?;=?D:BBDD3:CEAEEDE?3?<DD?D@DDCCDDII@DEADEBDBDII@@@DCC=CCDDDCACDD@ADDD;;@;ADDEDAA:>>AAAA>A?########## AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU XS:A:+ NH:i:1
DHCDZDN1:177:C31G2ACXX:3:1101:1206:88736 83 MT 6668 50 101M = 6628 -141 CACCACATGAAACACTGTCTCTTCTATAGGATCATTTATTTCACTAACAGCTGTTCTCATCATGATNTTTATAATTTGAGAGGCCTTTGCTNCAAAACGAG B?BBABEDA?@>A@CCCCBB@;@;B=CE?3FCIEACCEFFGFFIIIGFFIEFCFBGFEFEEGDB:1#GCFFFIIIIIIFFGGCFEGFC<22#DD1;AB;@? AS:i:-2 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:66C24T9 YT:Z:UU XS:A:+ NH:i:1
DHCDZDN1:177:C31G2ACXX:3:1101:1207:49279 163 13 55133172 50 101M = 55133267 196 GACCTCCTCAAGAAGATTCTGGGGGAACTAAAAAGGAATTTAGAACTGGTGCAGGTTCTACAGAGAGCAATAGGGTCCCCAGACCCTAACCTGAGAAATTC @@@DBDDEDDHHHEIGIGGIIIIIIBHGHIGIE@GIICGIIIGHIIAFIBCECGGCHGGIIHHGGEE=DCC@C@B>CABD?BDDBDDDCBBDDCDCDDCD@ AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU XS:A:+ NH:i:1
DHCDZDN1:177:C31G2ACXX:3:1101:1207:49279 83 13 55133267 50 101M = 55133172 -196 AAATTCCTTCTGGGTTTGTGTCCTGGGATCCACTGCAGTGGGTTTATACATAACTGTGGATGGAGCNGGTCTTCCAGGAAGTATTCCTAAANCTTGTAGAG DCCCDB?DDDDD@DDDDDCDDDFEDDCEGHHEHGGIIGGGGCEIGGIHHICIHJJIEJGHCGFD?1#GHDIIGJJJIGJJJIGGJJIHFA2#HFDD?FC@@ AS:i:-2 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:66A24G9 YT:Z:UU XS:A:+ NH:i:1
and the tail of the .counts file:
_ambiguous 147010
_empty 90546556
_lowaqual 0
_notaligned 0
I searched a bit online, but it seemed like most people having problems with empty count tables had them completely empty. Any suggestions ? Thanks a lot in advance!
I wanted to ask the community for help with my problem related to using DEXseq, particularily generating count tables with the Python script. My problem is as follows:
I have paired-end, strand specific (Illumina TruSeq stranded) RNA-seq data, 101bp read length, multiplexed. The rawdata was aligned using Tophat2 2.0.8 with Bowtie2, no novel junctions. The annotation file used was GRCm38 from Ensembl (mouse) - both the full genome fasta and .gtf annotation file.
The bam files are around 16GB each, the example .bam I am trying to find solution on has 176609632 reads. Bam files were sorted by name using samtools (samtools sort -n) As next, I flattened the .gtf file using the Python annotation preparation script from DEXSeq which produced a .gff flat. A quick inspection of the formed .gff shows it is "healthy".
Now the problem begins, even though my dataset is pretty large, most of the exons in the count table have 0 counts. Some of them have few, there are single occurences of exons with 2000 counts, but they are very rare. Naturally, DEXseq returned NA for all the p-values since it is impossible to do anything with just zeroes.
I loaded the .bam file into IGV to visually inspect if what the count table tells me is true, but it is absolutely not right. As an example of Rbm20 gene, in IGV I can see the exon14 having high coverage whereas it seems almost empty in the count file. As you can see it's not a matter of the script not working at all, it is working only partially.
The exon 14 of Rbm20 is exon bin 19 for HTSeq, here is the counting part fo this gene:
ENSMUSG00000043639:001 0
ENSMUSG00000043639:002 0
ENSMUSG00000043639:003 0
ENSMUSG00000043639:004 0
ENSMUSG00000043639:005 0
ENSMUSG00000043639:006 2
ENSMUSG00000043639:007 1
ENSMUSG00000043639:008 0
ENSMUSG00000043639:009 0
ENSMUSG00000043639:010 0
ENSMUSG00000043639:011 0
ENSMUSG00000043639:012 4
ENSMUSG00000043639:013 1
ENSMUSG00000043639:014 1
ENSMUSG00000043639:015 1
ENSMUSG00000043639:016 0
ENSMUSG00000043639:017 0
ENSMUSG00000043639:018 0
ENSMUSG00000043639:019 15
Now look at that last exon on my IGV screenshot in the attachment.
And here are small excerpts from my .gtf and my .bam:
.GTF:
19 protein_coding exon 53859392 53859513 . + . exon_id "ENSMUSE00001178201"; exon_number "13"; gene_biotype "protein_coding"; gene_id "ENSMUSG00000043639"; gene_name "Rbm20"; p_id "P13408"; transcript_id "ENSMUST00000164202"; transcript_name "Rbm20-201"; tss_id "TSS23511";
19 protein_coding CDS 53864080 53864187 . + 0 exon_number "14"; gene_biotype "protein_coding"; gene_id "ENSMUSG00000043639"; gene_name "Rbm20"; p_id "P13408"; protein_id "ENSMUSP00000129447"; transcript_id "ENSMUST00000164202"; transcript_name "Rbm20-201"; tss_id "TSS23511";
19 nonsense_mediated_decay exon 53864080 53867080 . + . exon_id "ENSMUSE00001167670"; exon_number "7"; gene_biotype "protein_coding"; gene_id "ENSMUSG00000043639"; gene_name "Rbm20"; p_id "P15978"; transcript_id "ENSMUST00000161856"; transcript_name "Rbm20-002"; tss_id "TSS56270";
19 protein_coding exon 53864080 53867080 . + . exon_id "ENSMUSE00001107658"; exon_number "14"; gene_biotype "protein_coding"; gene_id "ENSMUSG00000043639"; gene_name "Rbm20"; p_id "P13408"; transcript_id "ENSMUST00000164202"; transcript_name "Rbm20-201"; tss_id "TSS23511";
19 protein_coding stop_codon 53864188 53864190 . + 0 exon_number "14"; gene_biotype "protein_coding"; gene_id "ENSMUSG00000043639"; gene_name "Rbm20"; p_id "P13408"; transcript_id "ENSMUST00000164202"; transcript_name "Rbm20-201"; tss_id "TSS23511";
Part of flattened .gff:
1 dexseq_prepare_annotation.py aggregate_gene 3054233 3054733 . + . gene_id "ENSMUSG00000090025"
1 dexseq_prepare_annotation.py exonic_part 3054233 3054733 . + . transcripts "ENSMUST00000160944"; exonic_part_number "001"; gene_id "ENSMUSG00000090025"
1 dexseq_prepare_annotation.py aggregate_gene 3102016 3102125 . + . gene_id "ENSMUSG00000064842"
1 dexseq_prepare_annotation.py exonic_part 3102016 3102125 . + . transcripts "ENSMUST00000082908"; exonic_part_number "001"; gene_id "ENSMUSG00000064842"
1 dexseq_prepare_annotation.py aggregate_gene 3205901 3671498 . - . gene_id "ENSMUSG00000051951"
1 dexseq_prepare_annotation.py exonic_part 3205901 3206522 . - . transcripts "ENSMUST00000162897"; exonic_part_number "001"; gene_id "ENSMUSG00000051951"
1 dexseq_prepare_annotation.py exonic_part 3206523 3207317 . - . transcripts "ENSMUST00000162897+ENSMUST00000159265"; exonic_part_number "002"; gene_id "ENSMUSG00000051951"
1 dexseq_prepare_annotation.py exonic_part 3213439 3213608 . - . transcripts "ENSMUST00000159265"; exonic_part_number "003"; gene_id "ENSMUSG00000051951"
1 dexseq_prepare_annotation.py exonic_part 3213609 3214481 . - . transcripts "ENSMUST00000162897+ENSMUST00000159265"; exonic_part_number "004"; gene_id "ENSMUSG00000051951"
1 dexseq_prepare_annotation.py exonic_part 3214482 3215632 . - . transcripts "ENSMUST00000070533+ENSMUST00000162897+ENSMUST00000159265"; exonic_part_number "005"; gene_id "ENSMUSG00000051951"
1 dexseq_prepare_annotation.py exonic_part 3215633 3216344 . - . transcripts "ENSMUST00000162897+ENSMUST00000070533"; exonic_part_number "006"; gene_id "ENSMUSG00000051951"
1 dexseq_prepare_annotation.py exonic_part 3216345 3216968 . - . transcripts "ENSMUST00000070533"; exonic_part_number "007"; gene_id "ENSMUSG00000051951"
and my .bam (first few lines):
DHCDZDN1:177:C31G2ACXX:3:1101:1206:88736 163 MT 6628 50 101M = 6668 141 CAGGAATACCACGACGCTACTCAGACTACCCAGATGCTTACACCACATGAAACACTGTCTCTTCTATAGGATCATTTATTTCACTAACAGCTGTTCTCATC ?;=?D:BBDD3:CEAEEDE?3?<DD?D@DDCCDDII@DEADEBDBDII@@@DCC=CCDDDCACDD@ADDD;;@;ADDEDAA:>>AAAA>A?########## AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU XS:A:+ NH:i:1
DHCDZDN1:177:C31G2ACXX:3:1101:1206:88736 83 MT 6668 50 101M = 6628 -141 CACCACATGAAACACTGTCTCTTCTATAGGATCATTTATTTCACTAACAGCTGTTCTCATCATGATNTTTATAATTTGAGAGGCCTTTGCTNCAAAACGAG B?BBABEDA?@>A@CCCCBB@;@;B=CE?3FCIEACCEFFGFFIIIGFFIEFCFBGFEFEEGDB:1#GCFFFIIIIIIFFGGCFEGFC<22#DD1;AB;@? AS:i:-2 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:66C24T9 YT:Z:UU XS:A:+ NH:i:1
DHCDZDN1:177:C31G2ACXX:3:1101:1207:49279 163 13 55133172 50 101M = 55133267 196 GACCTCCTCAAGAAGATTCTGGGGGAACTAAAAAGGAATTTAGAACTGGTGCAGGTTCTACAGAGAGCAATAGGGTCCCCAGACCCTAACCTGAGAAATTC @@@DBDDEDDHHHEIGIGGIIIIIIBHGHIGIE@GIICGIIIGHIIAFIBCECGGCHGGIIHHGGEE=DCC@C@B>CABD?BDDBDDDCBBDDCDCDDCD@ AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU XS:A:+ NH:i:1
DHCDZDN1:177:C31G2ACXX:3:1101:1207:49279 83 13 55133267 50 101M = 55133172 -196 AAATTCCTTCTGGGTTTGTGTCCTGGGATCCACTGCAGTGGGTTTATACATAACTGTGGATGGAGCNGGTCTTCCAGGAAGTATTCCTAAANCTTGTAGAG DCCCDB?DDDDD@DDDDDCDDDFEDDCEGHHEHGGIIGGGGCEIGGIHHICIHJJIEJGHCGFD?1#GHDIIGJJJIGJJJIGGJJIHFA2#HFDD?FC@@ AS:i:-2 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:66A24G9 YT:Z:UU XS:A:+ NH:i:1
and the tail of the .counts file:
_ambiguous 147010
_empty 90546556
_lowaqual 0
_notaligned 0
I searched a bit online, but it seemed like most people having problems with empty count tables had them completely empty. Any suggestions ? Thanks a lot in advance!
Comment