Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Issue with htseq-count

    Hello,
    I am working on an RNA-Seq project in soybean. I used Illumina HiSeq for single end reads on my soybean samples and am working on my alignment. I have run into a problem when trying to get my read counts from HTSeq. I used Tophat2 to do my alignment. When I use htseq-count to get read counts from my accepted_hits file against the soybean GFF3 file I get the error:

    Warning: Skipping read 'DBRHHJN1:263:C0REVACXX:5:1101:2413:101528', because chromosome 'Glyma07g15800.1|pacid:16267415', to which it has been aligned, did not appear in the GFF file.

    This occurs for each read and the job stops running.

    Here is my command line for htseq-count:
    module load python/2.7.3
    htseq-count -m union -s no -t gene -i ID /home/a-m/leisner1/accepted_hits.bam/accepted_hits_Blk1Con.sam /home/a-m/leisner1/accepted_hits.bam/Gmax_109_gene.gff3 > Blk1Con_counts_out

    I realized that the accepted_hits.bam file (that I sorted by read name and converted to a sam file) has a different gene name than the GFF3 file I am using. Since I have multiple isoforms of my transcripts my gene names end in '.1', '.2' while the gene names in my GFF file do not. I am not sure how to solve this problem. When the bioinformatics core originally ran my alignment for me they said that to avoid this problem they extracted sequences for CDS, 5’ UTR and 3’ UTRs for each gene based on the coordinates in the GFF3 file provided by phytozome (version 8) and then concatenated them to form a single reference transcript sequence for each gene (.fna file). If I use this new file to do my alignment in TopHat I could get around the naming issue. However, they said that I may have to modify the GFF3 file with gene names and coordinates matching the Gmax_genes.fna file.

    My overall question is 1) is there an easier/simpler way to avoid the different gene names so I can get htseq-count to work, and 2) if not, how to I modify the GFF3 file in order to get my read counts?

    I realize I could simply 'trim' the names of the files (with some simple code I'm sure) but then I am not sure if that would mess up my count data.

    Thanks!

  • #2
    There are good reasons to align against the genome, not the transcriptome (see various previous threads here). Please don't deviate from the accepted best practise unless you know what you are doing. Especially, using TopHat and htseq-count when aligning against a transcriptome does not make any sense. (Read up on what these tools do precisely to se why.)

    Comment


    • #3
      Simon,
      Thank you for your response. I re-did the alignment using the complete nucleotide sequence (Gmax_109.fa) instead of the transcriptome using TopHat. I then sorted the accepted_hits.bam file by read name (-n) using samtools and then converted this sorted file into a SAM file. When I loaded this file into htseq-count I am getting this error repeatedly:
      Warning: Skipping read 'DBRHHJN1:263:C0REVACXX:5:1101:1886:109580', because chromosome 'scaffold_785', to which it has been aligned, did not appear in the GFF file.

      I am still using the same htseq-count code as before the same GFF3 file. I am not really sure how to tackle this error.

      Thanks!

      Comment


      • #4
        So, does the scaffold appear in your GFF file?

        Have you checked whether the chromosome names in the genome fasta file and in the gff file match?

        Comment


        • #5
          Simon,
          It looks like my gff file does not contain information about scaffold_785 and that is why reads mapping to scaffold_785 are skipped for counting. I'm assuming then that there is most likely there is no annotation information available for scaffold_785 and so is not present in gff file. There are only a handful of scaffolds that are being skipped. My chromosome names are the same in both files.

          I re-ran the analysis and got an output file with:
          no_feature 2421689
          ambiguous 345554
          too_low_aQual 0
          not_aligned 0
          alignment_not_unique 19096489

          I started with 44,613,416 reads so I'm assuming this output is "normal" (?).

          Thanks for the help!

          Comment


          • #6
            Should be okay. Only the fraction of reads with ambiguous alignment is quite high. Transcribed sequence is usually not that repetitive that you would have so many reads mapping to more than one location. Maybe somethig went wrong in your assembly, and some contigs appear in several scaffolds.

            Comment


            • #7
              Would a large number of isoforms not account for the high number of ambiguous alignments? I was also going to try running my htseq-count alignment with a different mode. Currently I am using union, but I was also going to try running it with intersection_nonempty to see what kind of output I get.

              If it is a problem with my alignment how would I troubleshoot that? Thanks!

              Comment


              • #8
                Did you align against the genome or the transcriptome? If the latter, using htseq-count is pointless, of course.

                Comment


                • #9
                  I re-did the alignment against the genome.

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM
                  • seqadmin
                    Techniques and Challenges in Conservation Genomics
                    by seqadmin



                    The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                    Avian Conservation
                    Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                    03-08-2024, 10:41 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 03-27-2024, 06:37 PM
                  0 responses
                  12 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-27-2024, 06:07 PM
                  0 responses
                  11 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-22-2024, 10:03 AM
                  0 responses
                  53 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-21-2024, 07:32 AM
                  0 responses
                  69 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X