Hi I'm working with the Pasilla and DEXSeq bioconductor packagse and I'm trying to replicate the example data, but the flattened .txt files processed using the provided dexseq_count.py script is not matching the example data provided in the source package. I think the suporting R documentation may contain an error (???) that I address in the text below. Thanks in advance for any help. I'm hoping to use these examples in order to understand how they are working so I can process my own data eventually, but want to make sure they are functioning properly with the provided example data as well.
Here is what I did:
- download copies of the source packages for pasilla and DEXSeq (and unpackaged them to folders in my directory)
- copied the python scripts to the directory containing example data for pasilla (/inst/extdata)
- went to http://www.embl.de/~reyes/Graveley/bam/
- downloaded the bam files for treated1.bam and treated2.bam (because the source packages didn't seem to have these, but the above link was reference in the R documentation for downloading the BAM files).
- used the .gff package already processed in the pasilla package
- did the following in terminal
#This is for an example for processing a single-read type alignment
#creates an indexed bam file
$ samtools index treated1.bam
#converts the bam file to a sam file
$ samtools view treated1.bam > treated1.sam
#calls the python script dexseq_count to count the reads in the non-overlapping exotic parts.
$ python dexseq_count.py Dmel.BDGP5.25.62.DEXSeq.chr.gff treated1.sam treated1fb_TEST.txt
#This is an example for processing a paired-end type alignment
#creates an indexed bamfile
$ samtools index treated2.bam
#converts the nam file to a sam file
$ samtools view treated2.bam > treated2.sam
#sorts the sam file by position (-k option); and I'm assuming the bam in the tutorial is a typo here because in the R documentation the output is a sorted BAM file, but in the next line it is a sorted SAM file. I did try the leaving it has a bam file as well, but as expected received an error "no such file or directory"
$ sort -k 1,1 -k2,2n treated2.sam > treated2_sorted.sam
#runs the python script dexseq_count on paired-end data for the processed sam file treated_2_sorted in order to count thee reads in each non-overlapping exonic part.
python dexseq_count.py -p yes Dmel.BDGP5.25.62.DEXSeq.chr.gff treated2_sorted.sam treated2fb_TEST.txt
- I've provided the first 10 lines of both the example data and the processed output.
I also have a second question; since these python scripts are part of the HTSeq library why must the paired-end data be sorted by position and not by read name using samtools (since it was already called twice previously) as is conducted for the htseq-count script?
Thanks again
output of .txt (1st 10 lines)
SINGLE READ:
treated1fb_pasilla example.txt
FBgn0000003:001 0
FBgn0000008:001 0
FBgn0000008:002 1
FBgn0000008:003 3
FBgn0000008:004 2
FBgn0000008:005 8
FBgn0000008:006 0
FBgn0000008:007 17
FBgn0000008:008 4
FBgn0000008:009 35
treated1fb_my processed output.txt (TEST)
FBgn0000003:001 0
FBgn0000008:001 0
FBgn0000008:002 0
FBgn0000008:003 0
FBgn0000008:004 1
FBgn0000008:005 4
FBgn0000008:006 1
FBgn0000008:007 18
FBgn0000008:008 4
FBgn0000008:009 16
PAIRED-END:
treated2fb_pasilla example.txt
FBgn0000003:001 1
FBgn0000008:001 0
FBgn0000008:002 0
FBgn0000008:003 1
FBgn0000008:004 0
FBgn0000008:005 2
FBgn0000008:006 1
FBgn0000008:007 22
FBgn0000008:008 7
FBgn0000008:009 46
treated2fb_my processed output.txt (TEST)
FBgn0000003:001 0
FBgn0000008:001 0
FBgn0000008:002 0
FBgn0000008:003 1
FBgn0000008:004 0
FBgn0000008:005 1
FBgn0000008:006 0
FBgn0000008:007 8
FBgn0000008:008 1
FBgn0000008:009 17
Here is what I did:
- download copies of the source packages for pasilla and DEXSeq (and unpackaged them to folders in my directory)
- copied the python scripts to the directory containing example data for pasilla (/inst/extdata)
- went to http://www.embl.de/~reyes/Graveley/bam/
- downloaded the bam files for treated1.bam and treated2.bam (because the source packages didn't seem to have these, but the above link was reference in the R documentation for downloading the BAM files).
- used the .gff package already processed in the pasilla package
- did the following in terminal
#This is for an example for processing a single-read type alignment
#creates an indexed bam file
$ samtools index treated1.bam
#converts the bam file to a sam file
$ samtools view treated1.bam > treated1.sam
#calls the python script dexseq_count to count the reads in the non-overlapping exotic parts.
$ python dexseq_count.py Dmel.BDGP5.25.62.DEXSeq.chr.gff treated1.sam treated1fb_TEST.txt
#This is an example for processing a paired-end type alignment
#creates an indexed bamfile
$ samtools index treated2.bam
#converts the nam file to a sam file
$ samtools view treated2.bam > treated2.sam
#sorts the sam file by position (-k option); and I'm assuming the bam in the tutorial is a typo here because in the R documentation the output is a sorted BAM file, but in the next line it is a sorted SAM file. I did try the leaving it has a bam file as well, but as expected received an error "no such file or directory"
$ sort -k 1,1 -k2,2n treated2.sam > treated2_sorted.sam
#runs the python script dexseq_count on paired-end data for the processed sam file treated_2_sorted in order to count thee reads in each non-overlapping exonic part.
python dexseq_count.py -p yes Dmel.BDGP5.25.62.DEXSeq.chr.gff treated2_sorted.sam treated2fb_TEST.txt
- I've provided the first 10 lines of both the example data and the processed output.
I also have a second question; since these python scripts are part of the HTSeq library why must the paired-end data be sorted by position and not by read name using samtools (since it was already called twice previously) as is conducted for the htseq-count script?
Thanks again
output of .txt (1st 10 lines)
SINGLE READ:
treated1fb_pasilla example.txt
FBgn0000003:001 0
FBgn0000008:001 0
FBgn0000008:002 1
FBgn0000008:003 3
FBgn0000008:004 2
FBgn0000008:005 8
FBgn0000008:006 0
FBgn0000008:007 17
FBgn0000008:008 4
FBgn0000008:009 35
treated1fb_my processed output.txt (TEST)
FBgn0000003:001 0
FBgn0000008:001 0
FBgn0000008:002 0
FBgn0000008:003 0
FBgn0000008:004 1
FBgn0000008:005 4
FBgn0000008:006 1
FBgn0000008:007 18
FBgn0000008:008 4
FBgn0000008:009 16
PAIRED-END:
treated2fb_pasilla example.txt
FBgn0000003:001 1
FBgn0000008:001 0
FBgn0000008:002 0
FBgn0000008:003 1
FBgn0000008:004 0
FBgn0000008:005 2
FBgn0000008:006 1
FBgn0000008:007 22
FBgn0000008:008 7
FBgn0000008:009 46
treated2fb_my processed output.txt (TEST)
FBgn0000003:001 0
FBgn0000008:001 0
FBgn0000008:002 0
FBgn0000008:003 1
FBgn0000008:004 0
FBgn0000008:005 1
FBgn0000008:006 0
FBgn0000008:007 8
FBgn0000008:008 1
FBgn0000008:009 17
Comment