Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem working with DEXSeq package

    Hi I'm new to SEQanswers, and fairly new to bioinformatics, but I'm trying to work with the python scripts for the DEXSeq R package through Bioconductor.

    We have paired-end data from human samples, (4 controls 4 treatments), using SOLID. I have access to the BAM files, that were aligned using TopHat (hg19).

    Whenever I get the text output I have nothing but 0 values for all the count values. I was able to work through some of the example data, but am having no luck with our actual data.

    Here are the steps I've taken (and admittedly I'm using the code(s) provided with the packages). I personally don't do a lot of programming but am actively learning, including from many resources I've found browsing the forums on SEQAnswers.

    -Went to http://useast.ensembl.org/info/data/ftp/index.html and obtained the gtf file for humans, and put it in a folder with the python scripts dexseq_prepare annotation.py & dexseq_count.py, and the human bam files.

    In bash terminal:
    - gunzip Homo_sapiens.GRCh37.67.gtf.gz
    - python dexseq_prepare_annotation.py Homo_sapiens.GRCh37.67.gtf Homo_sapiens.GRCh37.67_DEXSeq.gff

    - samtools index human1.bam
    - samtools view human1.bam > human1.sam
    - sort -k 1,1 -k2,2n human1.sam > human1_sorted.sam
    - python dexseq_count.py Homo_sapiens.GRCh37.67_DEXSeq.gff human1_sorted.sam human1_dexseq.txt

    The above executes, and I get a .txt file, but always gets 0 values for all of our samples. I originally thought it might be that we were using SOLID data; however, I also tried with some human Illumina data and had the same end result. Because the only entry not containing the a 0 value is the "empty" I'm thinking it is a problem I have either with the gtf/gff file or more likely not sorting the SAM file properly. Is the above correct? Or am I missing something? Also does the gff file need sorted as well?

    Any help is much appreciated. Thank you. Oh and if it helps I'm working on a Mac OS X 10.7.4.

  • #2
    Hi senkewiczs,

    Do you get a message when running "dexseq_count.py" saying: "X number of reads processed"?

    Could you include the first lines of your files Homo_sapiens.GRCh37.67_DEXSeq.gff and human1_sorted.sam ???

    Thanks,
    Alejandro

    Comment


    • #3
      Hi Alejandro,

      Yes, I was getting 'xxxx number of reads processed.'

      Here is the head of human1_sorted.sam:
      PHP Code:
      1000_1000_1096    73    chr2    110601762    0    50M    *    0    0    CTGTCCAAATGGAAAAATTATTAAACAAATCTTTTTTAAAATAAAATGCT    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII:    RG:Z:20110216225354574    NH:i:0    CM:i:0    SM:i:1    CQ:Z:BBABBBA?@ABBA@?@@A@BB?B@A>=@>A=AA??@@@><?@?==@;<6:    CS:Z:T22112010031020000303303001100322000003000330003132
      1000_1000_1096    133    *    0    0    *    chr2    108495505    0    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN    *    RG:Z:20110216225354574    NH:i:0    CQ:Z:BBA@BB@+<?B><>A>2'%@A?13:==3%@BA)5%    CS:Z:G00103210022022202223321212322331323
      1000_1000_1096    133    *    0    0    *    chr2    110601762    0    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN    *    RG:Z:20110216225354574    NH:i:0    CQ:Z:BBA@BB@+<?B><>A>2'%@A?13:==3%@BA)5%    CS:Z:G00103210022022202223321212322331323
      1000_1000_1587    81    chr3    96336408    0    25H25M    =    96336433    86    AGGCTTATGCGGAGGAGAATGTTTT    FIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:20110216225354574    NH:i:2    CM:i:0    SM:i:3    CQ:Z:@<@;<;>;>?>?==>@:5@=85:>-:>8:99=.?9><:?9>:1?69><93    CS:Z:T30001130222022033133023021331212230300011123012202
      1000_1000_1587    161    chr3    96336433    0    33M2H    =    96336408    -86    CATGTTACTTATACTAACATTAGTTCTTCTATA    IH8IHGII>.<H=7AIII=2?IE7.CD0+>II8    RG:Z:20110216225354574    NH:i:2    CM:i:0    SM:i:3    CQ:Z:B@)0A(@;9&)45)/3@>1-&:42&);*'%:@/*)    CS:Z:G31311031203331230113032102202233321
      1000_1000_1673    99    chr14    79703924    51    50M    =    79703974    84    AGAGACACTGGGCCTTTTCTCCTTTACTCCAAGAAGAAATGCTCTTTTTT    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII=    RG:Z:20110216225354574    NH:i:1    CM:i:0    SM:i:100    CQ:Z:BBBBAA<BBA@ABBBBABBB@BAA>BAB?B@>BAAB?@B>;@A@<BB>==    CS:Z:T32222111210030200022202003122010220220031322200000
      1000_1000_1673    147    chr14    79703974    51    35M    =    79703924    -84    CATATATCTTTATTGAATACCTATTATGGTCCATG    @IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:20110216225354574    NH:i:1    CM:i:0    SM:i:65    CQ:Z:BBB?BBBBABB)BABBBAA=AA7AA=;A@AA??=@    CS:Z:G31310210133033201330210330022333331
      1000_1000_1921    83    chr10    15880069    45    6H44M    =    15879996    -116    TAATTTAAAGGCAAATTAATCCTGAAGCAATGGATTTAAAATTT    EIIBIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:20110216225354574    NH:i:1    CM:i:0    SM:i:97    CQ:Z:BBBB=>@BB:A>BB@97BB;;=BB<;0BB@99>@<4>@@=03<A%%)%;8    CS:Z:T30030003003201301320212023030300130200300303333030
      1000_1000_1921    163    chr10    15879996    45    33M2H    =    15880069    116    TTAATGTTTTTGAAAAGCGTATCTGGGTAGTTA    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIC    RG:Z:20110216225354574    NH:i:1    CM:i:0    SM:i:58    CQ:Z:B6BBBB7ABB?9BABB)BBAB.@<B?4A5=;2<(B    CS:Z:G10303110000120002331332210013210333
      1000_1000_260    99    chr2    145882329    42    50M    =    145882410    112    AATGTATACAGGACTTATATGCTGAAAACATGTTGCTGAAAAAATCAAAG    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII7AIIIIIIII1    RG:Z:20110216225354574    NH:i:1    CM:i:0    SM:i:100    CQ:Z:BBBA@BABA=7@@A<<ABA?@B@@<;4?@=B=>98B=@83%=A@=0?@?1    CS:Z:T30311333112021203333132120001131101321200000321002


      Here is the head of Homo_sapiens.GRCh37.67_DEXSeq.gff:

      PHP Code:
      1    Homo_sapiens.GRCh37.67.gtf    aggregate_gene    11869    14412    .    +    .    gene_id "ENSG00000223972"
      1    Homo_sapiens.GRCh37.67.gtf    exonic_part    11869    11871    .    +    .    transcripts "ENST00000456328"exonic_part_number "001"gene_id "ENSG00000223972"
      1    Homo_sapiens.GRCh37.67.gtf    exonic_part    11872    11873    .    +    .    transcripts "ENST00000456328+ENST00000515242"exonic_part_number "002"gene_id "ENSG00000223972"
      1    Homo_sapiens.GRCh37.67.gtf    exonic_part    11874    12009    .    +    .    transcripts "ENST00000456328+ENST00000515242+ENST00000518655"exonic_part_number "003"gene_id "ENSG00000223972"
      1    Homo_sapiens.GRCh37.67.gtf    exonic_part    12010    12057    .    +    .    transcripts "ENST00000456328+ENST00000515242+ENST00000450305+ENST00000518655"exonic_part_number "004"gene_id "ENSG00000223972"
      1    Homo_sapiens.GRCh37.67.gtf    exonic_part    12058    12178    .    +    .    transcripts "ENST00000456328+ENST00000515242+ENST00000518655"exonic_part_number "005"gene_id "ENSG00000223972"
      1    Homo_sapiens.GRCh37.67.gtf    exonic_part    12179    12227    .    +    .    transcripts "ENST00000456328+ENST00000515242+ENST00000450305+ENST00000518655"exonic_part_number "006"gene_id "ENSG00000223972"
      1    Homo_sapiens.GRCh37.67.gtf    exonic_part    12595    12612    .    +    .    transcripts "ENST00000518655"exonic_part_number "007"gene_id "ENSG00000223972"
      1    Homo_sapiens.GRCh37.67.gtf    exonic_part    12613    12697    .    +    .    transcripts "ENST00000456328+ENST00000515242+ENST00000450305+ENST00000518655"exonic_part_number "008"gene_id "ENSG00000223972"
      1    Homo_sapiens.GRCh37.67.gtf    exonic_part    12698    12721    .    +    .    transcripts "ENST00000456328+ENST00000515242+ENST00000518655"exonic_part_number "009"gene_id "ENSG00000223972" 
      It appears that part of the problem is differences in the annotation of chromosomes between the two files?

      Thanks in advance for any advice......

      Comment


      • #4
        Indeed, I believe that is the problem! The chromosome names should match

        Alejandro

        Comment


        • #5
          Hi there,

          I am new to DEXSeq. I want to know how to calculate the number of reads mapping to each of the exons of a genome. I mean how to prepare input for the DEXSeq analysis.

          Many thanks
          Ashwini

          Comment


          • #6
            Hi Ashwini,

            The pasilla package is an example dataset for the DEXSeq vignette and in the pasilla vignette you will find the steps to create an ExonCountSet object.

            Alejandro

            Comment


            • #7
              Hi Alejandro,

              Thanks for your quick response. OK I think in section 9 I can found this creating ExonCountSet objects. can you please also let me know that do I need to write any python script or I can found this in the DEXSeq package.

              Thanks again
              Ashwini
              Ashwini

              Comment


              • #8
                You will find these scripts in the DEXSeq package directory!
                Last edited by areyes; 08-21-2012, 07:08 AM.

                Comment


                • #9
                  Many Thanks!!!!!!!!!!!!!! Alejandro
                  You saved me

                  Comment


                  • #10
                    Hi Alejandro,

                    I have run both the python scripts-
                    1.[akumar@mars python_scripts]$ /apps/python262/bin/python dexseq_prepare_annotation.py /homes/akumar/PROJECT_FOLDER/Homo_sapiens.GRCh37.68.gtf Homo_sapiens.GRCh37.68.gff

                    2. [akumar@mars python_scripts]$ /apps/python262/bin/python dexseq_count.py Homo_sapiens.GRCh37.68.gff /homes/akumar/PROJECT_FOLDER/tophat_paired_sorted.sam Output.txt

                    Now the output is -48100000 reads processed.

                    [akumar@mars python_scripts]$ head Output.txt
                    ENSG00000000003:001 6
                    ENSG00000000003:002 0
                    ENSG00000000003:003 1
                    ENSG00000000003:004 1
                    ENSG00000000003:005 0
                    ENSG00000000003:006 0
                    ENSG00000000003:007 0
                    ENSG00000000003:008 0
                    ENSG00000000003:009 1
                    ENSG00000000003:010 0

                    So now what I have to do, how to create ExonCountSet and how to do the further analysis. I am stuck here, please help me.

                    Many thanks
                    Ashwini

                    Comment


                    • #11
                      Check the function read.HTSeqCounts from DEXSeq. This will read this files, parse relevant information for the package and return an ExonCountSet object in your R session.

                      Comment


                      • #12
                        Thanks Alejandro

                        Comment


                        • #13
                          Hi Alejandro,

                          I am wondering that when I am converting the bam file to sam file using samtools at that time is there any need to sort the sam file. If so then which command I have to use.

                          Thanks
                          Ashwini

                          Comment


                          • #14
                            Originally posted by Ashwini Kumar View Post
                            I am wondering that when I am converting the bam file to sam file using samtools at that time is there any need to sort the sam file. If so then which command I have to use.
                            Sorting is required only for paired-end data; for single-end data, there is no need for sorting. To sort by read name, use 'samtools sort -n'.

                            Comment


                            • #15
                              Originally posted by areyes View Post
                              Indeed, I believe that is the problem! The chromosome names should match

                              Alejandro
                              So should we add "chr" to the gff file?

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Essential Discoveries and Tools in Epitranscriptomics
                                by seqadmin




                                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                04-22-2024, 07:01 AM
                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Today, 11:49 AM
                              0 responses
                              11 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 08:47 AM
                              0 responses
                              16 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              61 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              60 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X