Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Issue with Htseq-count on BAM files from Tophat2 using Galaxy

    Hello,

    I'm currently facing troubles using galaxy. I want to compare differentially expressed genes between two treatment groups. I already map my reads on my reference genome (70% remaping) and now I'm trying to obtain the differential expression matrix using Htseq count. (For information, my data are Illumina Hiseq 2500, pair end, 125pb).

    I already map my reads on my reference genome thanks to Tophat2 (70%remaping), but when I tried to run Htseq on the Bam files from Htseq send me this error message:

    Fatal error: Unknown error occured Error occured when processing GFF file (line 40 of file /opt/galaxy-dist/database/files/002/052/dataset_2052791.dat): Feature DS10_00012179-RA:exon:1059 does not contain a 'gene_id' attribute [Exception type: ValueError, raised in count.py:53]

    I though that maybe it could an issue due to my gff3 file, and I tried to convert it into a gtf file using the GFF to GTF converter. But I obtain the following error message:

    Traceback (most recent call last): File "/opt/shed_tools/toolshed.g2.bx.psu.edu/repos/vipints/fml_gff3togtf/6e589f267c14/fml_gff3togtf/gff_to_gtf.py", line 17, in <module> import GFFParser File "/opt/shed_tools/toolshed.g2.bx.psu.edu/repos/vipints/fml_gff3togtf/6e589f267c14/fml_gff3togtf/GFFParser.py", line 20, in <module> import scipy.io as sio ImportError: No module named scipy.io

    I read that it could be because my Bam files were not sorted by the gene id. So, I tried to sort my Bam files using the tool sort from the SAMtool suite, and obtain an error message again:

    Tool execution generated the following error message: Error running samtools sort. mv: cannot stat `foo.bam': No such file or directory The tool produced the following additional output: [bam_sort] Use -T PREFIX / -o FILE to specify temporary and final output files Usage: samtools sort [options...] [in.bam] Options: -l INT Set compression level, from 0 (uncompressed) to 9 (best) -m INT Set maximum memory per thread; suffix K/M/G recognized [768M] -n Sort by read name -o FILE Write final output to FILE rather than standard output -T PREFIX Write temporary files to PREFIX.nnnn.bam -@, --threads INT Set number of sorting and compression threads [1] --input-fmt-option OPT[=VAL] Specify a single input file format option in the form of OPTION or OPTION=VALUE -O, --output-fmt FORMAT[,OPT[=VAL]]... Specify output format (SAM, BAM, CRAM) --output-fmt-option OPT[=VAL] Specify a single output file format option in the form of OPTION or OPTION=VALUE --reference FILE Reference sequence FASTA FILE [null]

    I do not understand why I received as much error messages. Does anyone face up a similar issue? Or knows where this problems come from?

    Thank you in advance

  • #2
    Hi Enriquez,

    You should not need to sort the bam file, so I don't think that's the problem.

    Do you know if your gff file contains the 'gene_id' attribute? You can open the file in a text editor and check that this is listed. Otherwise, you can change the gene id variable using '--idattr'. This option should also be available in galaxy.

    I think converting your file from gff3 to gtf is also a pretty good idea. I think I've done this in the past and it worked. The error you are getting suggests that the python library 'scipy' is not installed in your galaxy configuration. Perhaps you can get the system administrators to install it for you?

    Best,

    Matt.

    Comment


    • #3
      Hello ,
      Sorry for this long time withtout answer.
      So i set up Htseq count with the term "transcript_id" instead of gene_id and I convert my gff3 into gtf using gffread. Htseq has finished his job without any error report, but the output matrix contains only 0 :
      Geneid TopHat on data 69 data 16 and data 15: accepted_hits
      DS10_00000001 0
      DS10_00000002 0
      DS10_00000003 0
      DS10_00000004 0
      DS10_00000005 0
      DS10_00000006 0
      DS10_00000007 0
      DS10_00000008 0
      DS10_00000009 0

      Did you face up a similar problem?
      Thanks a lot for your help,
      Thomas

      Comment


      • #4
        Hi Thomas, nice work figuring out those other problems!

        I'm not sure why HTSeq is outputting only 0s. You might just have to look very carefully at all of the input files. Make sure that gtf is in the right format, and nothing weird has happened in the conversion. Also make sure that the names in your reference genome (which you used to make the bam files) match the names in your new gff.

        Also, by default in HTseq, multi-mapping reads will not be counted. Perhaps by using 'transcript_id' this produces what appears to be multi-mapping reads across the same gene?

        These are my guesses anyway!

        Good luck,

        Matt.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Strategies for Sequencing Challenging Samples
          by seqadmin


          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
          03-22-2024, 06:39 AM
        • seqadmin
          Techniques and Challenges in Conservation Genomics
          by seqadmin



          The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

          Avian Conservation
          Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
          03-08-2024, 10:41 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, Yesterday, 06:37 PM
        0 responses
        10 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, Yesterday, 06:07 PM
        0 responses
        9 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 03-22-2024, 10:03 AM
        0 responses
        50 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 03-21-2024, 07:32 AM
        0 responses
        67 views
        0 likes
        Last Post seqadmin  
        Working...
        X