Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Issue with htseq-count

    Hello,
    I am working on an RNA-Seq project in soybean. I used Illumina HiSeq for single end reads on my soybean samples and am working on my alignment. I have run into a problem when trying to get my read counts from HTSeq. I used Tophat2 to do my alignment. When I use htseq-count to get read counts from my accepted_hits file against the soybean GFF3 file I get the error:

    Warning: Skipping read 'DBRHHJN1:263:C0REVACXX:5:1101:2413:101528', because chromosome 'Glyma07g15800.1|pacid:16267415', to which it has been aligned, did not appear in the GFF file.

    This occurs for each read and the job stops running.

    Here is my command line for htseq-count:
    module load python/2.7.3
    htseq-count -m union -s no -t gene -i ID /home/a-m/leisner1/accepted_hits.bam/accepted_hits_Blk1Con.sam /home/a-m/leisner1/accepted_hits.bam/Gmax_109_gene.gff3 > Blk1Con_counts_out

    I realized that the accepted_hits.bam file (that I sorted by read name and converted to a sam file) has a different gene name than the GFF3 file I am using. Since I have multiple isoforms of my transcripts my gene names end in '.1', '.2' while the gene names in my GFF file do not. I am not sure how to solve this problem. When the bioinformatics core originally ran my alignment for me they said that to avoid this problem they extracted sequences for CDS, 5’ UTR and 3’ UTRs for each gene based on the coordinates in the GFF3 file provided by phytozome (version 8) and then concatenated them to form a single reference transcript sequence for each gene (.fna file). If I use this new file to do my alignment in TopHat I could get around the naming issue. However, they said that I may have to modify the GFF3 file with gene names and coordinates matching the Gmax_genes.fna file.

    My overall question is 1) is there an easier/simpler way to avoid the different gene names so I can get htseq-count to work, and 2) if not, how to I modify the GFF3 file in order to get my read counts?

    I realize I could simply 'trim' the names of the files (with some simple code I'm sure) but then I am not sure if that would mess up my count data.

    Thanks!

  • #2
    There are good reasons to align against the genome, not the transcriptome (see various previous threads here). Please don't deviate from the accepted best practise unless you know what you are doing. Especially, using TopHat and htseq-count when aligning against a transcriptome does not make any sense. (Read up on what these tools do precisely to se why.)

    Comment


    • #3
      Simon,
      Thank you for your response. I re-did the alignment using the complete nucleotide sequence (Gmax_109.fa) instead of the transcriptome using TopHat. I then sorted the accepted_hits.bam file by read name (-n) using samtools and then converted this sorted file into a SAM file. When I loaded this file into htseq-count I am getting this error repeatedly:
      Warning: Skipping read 'DBRHHJN1:263:C0REVACXX:5:1101:1886:109580', because chromosome 'scaffold_785', to which it has been aligned, did not appear in the GFF file.

      I am still using the same htseq-count code as before the same GFF3 file. I am not really sure how to tackle this error.

      Thanks!

      Comment


      • #4
        So, does the scaffold appear in your GFF file?

        Have you checked whether the chromosome names in the genome fasta file and in the gff file match?

        Comment


        • #5
          Simon,
          It looks like my gff file does not contain information about scaffold_785 and that is why reads mapping to scaffold_785 are skipped for counting. I'm assuming then that there is most likely there is no annotation information available for scaffold_785 and so is not present in gff file. There are only a handful of scaffolds that are being skipped. My chromosome names are the same in both files.

          I re-ran the analysis and got an output file with:
          no_feature 2421689
          ambiguous 345554
          too_low_aQual 0
          not_aligned 0
          alignment_not_unique 19096489

          I started with 44,613,416 reads so I'm assuming this output is "normal" (?).

          Thanks for the help!

          Comment


          • #6
            Should be okay. Only the fraction of reads with ambiguous alignment is quite high. Transcribed sequence is usually not that repetitive that you would have so many reads mapping to more than one location. Maybe somethig went wrong in your assembly, and some contigs appear in several scaffolds.

            Comment


            • #7
              Would a large number of isoforms not account for the high number of ambiguous alignments? I was also going to try running my htseq-count alignment with a different mode. Currently I am using union, but I was also going to try running it with intersection_nonempty to see what kind of output I get.

              If it is a problem with my alignment how would I troubleshoot that? Thanks!

              Comment


              • #8
                Did you align against the genome or the transcriptome? If the latter, using htseq-count is pointless, of course.

                Comment


                • #9
                  I re-did the alignment against the genome.

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Advancing Precision Medicine for Rare Diseases in Children
                    by seqadmin




                    Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                    12-16-2024, 07:57 AM
                  • seqadmin
                    Recent Advances in Sequencing Technologies
                    by seqadmin



                    Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                    Long-Read Sequencing
                    Long-read sequencing has seen remarkable advancements,...
                    12-02-2024, 01:49 PM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 12-17-2024, 10:28 AM
                  0 responses
                  33 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 12-13-2024, 08:24 AM
                  0 responses
                  48 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 12-12-2024, 07:41 AM
                  0 responses
                  34 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 12-11-2024, 07:45 AM
                  0 responses
                  46 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X