Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • IGV sees many counts, but HTSeq-count counts 0

    I appreciate any help to figure out this puzzle.

    What I did:

    1. I first used tophat2 to map the reads to the genome, get accepted.bam. This file is ~900 MB. Here is the alignment summary:
    ------------------------------------------------------------------------------------
    Reads:
    Input : 45593954
    Mapped : 44546323 (97.7% of input)
    of these: 1496005 ( 3.4%) have multiple alignments (7 have >20)
    97.7% overall read mapping rate.
    ------------------------------------------------------------------------------------

    2. I then used samtools to convert the bam file to sam file for htseq-count to handle:

    samtools view -h -o WT3.sam WT3_th2out/accepted_hits.bam

    The output sam file is quite large, is ~11 GB.

    3. Finally I passed the sam file to htseq-count to generate a count table:

    htseq-count --stranded=no WT3.sam genes.gtf > WT3_counts

    The first line of the output is just one number followed by a tab:
    " 1851236"
    I assume this is the total counts for all genes (Is this right?).

    At the end of the file it says:
    ------------------------------------------------------------------------------------
    __no_feature 5542374
    __ambiguous 5280500
    __too_low_aQual 0
    __not_aligned 0
    __alignment_not_unique 3266554
    ------------------------------------------------------------------------------------

    The questions:

    I. The sum of all counts are merely 16 million, while the uniquely mapped reads are 43 million, why is the huge discrepancy?

    II. For a few genes of interest, I got "0" in the count output, while I can see hundreds to thousands in IGV when looking at the bam file mapping.

    What I have checked:

    The genes.gtf was the same UCSC source as the genome I used for tophat2 mapping, so the chromosome names are consistent (chrI chrII etc).

    The genes of interest I was looking at are not duplicated genes, so the mapped reads inside them should be mostly unique.

    I am desparate to get some explanation for this.

    Thanks!
    Shaohe

  • #2
    The first line of the output is just one number followed by a tab:
    " 1851236"
    That shouldn't happen, there's probably an error in the GTF file. BTW, just give htseq-count the BAM file.

    I. The sum of all counts are merely 16 million, while the uniquely mapped reads are 43 million, why is the huge discrepancy?
    Just because an alignment is "unique" to the genome doesn't mean that (A) it overlaps a gene or (B) that it overlaps only 1 gene.

    II. For a few genes of interest, I got "0" in the count output, while I can see hundreds to thousands in IGV when looking at the bam file mapping.
    Perhaps they're multimappers. You can always use the -o option and grep for a few read names to see why they're not being counted (it's incredibly unlikely that htseq-count is doing something wrong).

    [QUOTE] The genes.gtf was the same UCSC source as the genome I used for tophat2 mapping, so the chromosome names are consistent (chrI chrII etc).[QUOTE]

    Avoid using GTF files from UCSC, they tend to have a variety of problems. Stick to Ensembl for everything.

    Comment


    • #3
      Hi Shaohe,

      could you please provide an image of one of these cases, including the gtf-annotation in the IGV?

      It seems to me, that the first line in your result file is due to the fact, that some gene_ids are ether empty or not not set properly (e.g. special chars?). Thus, your last gene with this issue counted 1.8 M reads.

      Regarding your first question, you have to account the no_feature reads (intronic or inter-genic mapped), as well as the ambiguous reads (reads mapping in overlapping features). These would sum up to 10.8 M reads.
      In the alignments_not_unique the number of alignments (not reads) are counted, you can leave this out.

      BTW: With the new version of HTSeq-count, you are using, you don't need to convert your bam file.

      Cheers,
      Michael

      Comment


      • #4
        Hi Ryan,

        Thank you very much for your quick response.

        Originally posted by dpryan View Post
        That shouldn't happen, there's probably an error in the GTF file. BTW, just give htseq-count the BAM file.
        Thanks for the heads up. I pulled out the numbers that were assigned to genes and they sum up to be ~30 million, much closer to the mapped reads number (43 million). However, it is quite unsatisfying to have this weird empty head thing that takes up 1.8 million counts.

        It'll be really nice to bypass the large sam files. However, when I tried to use BAM file directly with "-f bam", an error occured saying I don't have pysam module. So I tried to install pysam by "pip install pysam", which failed with an error (downloading the pysam tarball and install from there has a similar error):

        Code:
        Command "/Library/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python -c "import setuptools, tokenize;__file__='/private/var/folders/k0/1s_c443x4sz_k0xlt15g2z_r0000gn/T/pip-build-LF1tZX/pysam/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /var/folders/k0/1s_c443x4sz_k0xlt15g2z_r0000gn/T/pip-xfBXDS-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /private/var/folders/k0/1s_c443x4sz_k0xlt15g2z_r0000gn/T/pip-build-LF1tZX/pysam
        Please help me if you know how to fix this problem!

        Just because an alignment is "unique" to the genome doesn't mean that (A) it overlaps a gene or (B) that it overlaps only 1 gene.
        The genes I was looking at are not that complicated, so I didn't except to get "0" for them. In fact, when I use cufflinks I got 96 fkpm for one gene -- a snapshot from IGV for this example gene is attached.

        Perhaps they're multimappers. You can always use the -o option and grep for a few read names to see why they're not being counted (it's incredibly unlikely that htseq-count is doing something wrong).
        The first five lines of the -o output is:

        Code:
        ['HWI-D00611:139:C6VKWANXX:8:2203:1197:2158\t272\tchrI\t2241\t0\t50M\t*\t0\t0\tCGCCCACGAAACCGTGCCGCACGTGTGGGTTTACGAGCTGAATATTTTCC\tFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBBBBB\tAS:i:0\tXN:i:0\tXM:i:0\tXO:i:0\tXG:i:0\tNM:i:0\tMD:Z:50\tYT:Z:UU\tNH:i:7\tCC:Z:=\tCP:i:9569203\tHI:i:0\tXF:Z:__alignment_not_unique\n', 'HWI-D00611:139:C6VKWANXX:8:1314:14632:94054\t272\tchrI\t2241\t0\t50M\t*\t0\t0\tCGCCCACGAAACCGTGCCGCACGTGTGGGTTTACGAGCTGAATATTTTCC\tFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBBBBB\tAS:i:0\tXN:i:0\tXM:i:0\tXO:i:0\tXG:i:0\tNM:i:0\tMD:Z:50\tYT:Z:UU\tNH:i:7\tCC:Z:=\tCP:i:9569203\tHI:i:0\tXF:Z:__alignment_not_unique\n', 'HWI-D00611:139:C6VKWANXX:8:1101:12688:99667\t272\tchrI\t2523\t0\t50M\t*\t0\t0\tATATTTATCAATTTTCTTCTGAGAGTCTCATTGAGACTCTTATTTACGCC\tFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBBBBB\tAS:i:0\tXN:i:0\tXM:i:0\tXO:i:0\tXG:i:0\tNM:i:0\tMD:Z:50\tYT:Z:UU\tNH:i:7\tCC:Z:=\tCP:i:5594774\tHI:i:0\tXF:Z:__alignment_not_unique\n', 'HWI-D00611:139:C6VKWANXX:8:1108:19834:88650\t272\tchrI\t2523\t0\t50M\t*\t0\t0\tATATTTATCAATTTTCTTCTGAGAGTCTCATTGAGACTCTTATTTACGCC\tFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBBBBB\tAS:i:0\tXN:i:0\tXM:i:0\tXO:i:0\tXG:i:0\tNM:i:0\tMD:Z:50\tYT:Z:UU\tNH:i:7\tCC:Z:=\tCP:i:5594774\tHI:i:0\tXF:Z:__alignment_not_unique\n', 'HWI-D00611:139:C6VKWANXX:8:2111:11430:84665\t272\tchrI\t2523\t0\t50M\t*\t0\t0\tATATTTATCAATTTTCTTCTGAGAGTCTCATTGAGACTCTTATTTACGCC\t<FBFFF/FFBBFFFFFFFFFFFBFFFFFFFBFFFFFFFFFFFFFFBBBB/\tAS:i:0\tXN:i:0\tXM:i:0\tXO:i:0\tXG:i:0\tNM:i:0\tMD:Z:50\tYT:Z:UU\tNH:i:7\tCC:Z:=\tCP:i:5594774\tHI:i:0\tXF:Z:__alignment_not_unique\n']
        I have no idea what these are, but all five have the feature "__alignment_not_unique". I don't know how or if there is method to pinpoint the reads that aligned to particular genes.

        Avoid using GTF files from UCSC, they tend to have a variety of problems. Stick to Ensembl for everything.
        I'll try the Ensembl genome assembly and GTF files. BTW I am using C. elegans rather than human genome and I heard UCSC is pretty good. Apparently this is not the case since they have some empty gene IDs and names for lots of entries (see my reply to Michael below).

        Many thanks!
        Shaohe
        Attached Files
        Last edited by snownontrace; 08-25-2015, 06:58 PM.

        Comment


        • #5
          Hi Michael,
          Also thank you for your quick and kind reply!

          Originally posted by Michael.Ante View Post
          could you please provide an image of one of these cases, including the gtf-annotation in the IGV?
          Does the image in my last post do the job? This is one of the genes I was curious about, which has many reads in the coding region, but the count table says "0" for it.

          It seems to me, that the first line in your result file is due to the fact, that some gene_ids are ether empty or not not set properly (e.g. special chars?). Thus, your last gene with this issue counted 1.8 M reads.
          That's likely the reason! When I inspect the gtf file, this is the example gene entries I posted the IGV picture for:
          Code:
          chrIII	unknown	CDS	10808247	10808360	.	+	0	gene_id "klp-7"; gene_name "klp-7"; p_id "P25071"; transcript_id "NM_001027507"; tss_id "TSS33773";
          chrIII	unknown	exon	10808247	10808360	.	+	.	gene_id "klp-7"; gene_name "klp-7"; p_id "P25071"; transcript_id "NM_001027507"; tss_id "TSS33773";
          chrIII	unknown	start_codon	10808247	10808249	.	+	.	gene_id "klp-7"; gene_name "klp-7"; p_id "P25071"; transcript_id "NM_001027507"; tss_id "TSS33773";
          chrIII	unknown	CDS	10808501	10808717	.	+	0	gene_id ""; gene_name ""; p_id "P6745"; transcript_id "NM_001027506"; tss_id "TSS5767";
          chrIII	unknown	CDS	10808501	10808717	.	+	0	gene_id "klp-7"; gene_name "klp-7"; p_id "P25071"; transcript_id "NM_001027507"; tss_id "TSS33773";
          chrIII	unknown	exon	10808501	10808717	.	+	.	gene_id ""; gene_name ""; p_id "P6745"; transcript_id "NM_001027506"; tss_id "TSS5767";
          chrIII	unknown	exon	10808501	10808717	.	+	.	gene_id "klp-7"; gene_name "klp-7"; p_id "P25071"; transcript_id "NM_001027507"; tss_id "TSS33773";
          chrIII	unknown	CDS	10808766	10809423	.	+	2	gene_id ""; gene_name ""; p_id "P6745"; transcript_id "NM_001027506"; tss_id "TSS5767";
          chrIII	unknown	CDS	10808766	10809423	.	+	2	gene_id "klp-7"; gene_name "klp-7"; p_id "P25071"; transcript_id "NM_001027507"; tss_id "TSS33773";
          chrIII	unknown	exon	10808766	10809423	.	+	.	gene_id ""; gene_name ""; p_id "P6745"; transcript_id "NM_001027506"; tss_id "TSS5767";
          chrIII	unknown	exon	10808766	10809423	.	+	.	gene_id "klp-7"; gene_name "klp-7"; p_id "P25071"; transcript_id "NM_001027507"; tss_id "TSS33773";
          chrIII	unknown	CDS	10809474	10809935	.	+	1	gene_id ""; gene_name ""; p_id "P6745"; transcript_id "NM_001027506"; tss_id "TSS5767";
          chrIII	unknown	CDS	10809474	10809935	.	+	1	gene_id "klp-7"; gene_name "klp-7"; p_id "P25071"; transcript_id "NM_001027507"; tss_id "TSS33773";
          chrIII	unknown	exon	10809474	10809935	.	+	.	gene_id ""; gene_name ""; p_id "P6745"; transcript_id "NM_001027506"; tss_id "TSS5767";
          chrIII	unknown	exon	10809474	10809935	.	+	.	gene_id "klp-7"; gene_name "klp-7"; p_id "P25071"; transcript_id "NM_001027507"; tss_id "TSS33773";
          chrIII	unknown	CDS	10809986	10810297	.	+	1	gene_id ""; gene_name ""; p_id "P6745"; transcript_id "NM_001027506"; tss_id "TSS5767";
          chrIII	unknown	CDS	10809986	10810297	.	+	1	gene_id "klp-7"; gene_name "klp-7"; p_id "P25071"; transcript_id "NM_001027507"; tss_id "TSS33773";
          chrIII	unknown	exon	10809986	10810297	.	+	.	gene_id ""; gene_name ""; p_id "P6745"; transcript_id "NM_001027506"; tss_id "TSS5767";
          chrIII	unknown	exon	10809986	10810297	.	+	.	gene_id "klp-7"; gene_name "klp-7"; p_id "P25071"; transcript_id "NM_001027507"; tss_id "TSS33773";
          chrIII	unknown	CDS	10810345	10810648	.	+	1	gene_id ""; gene_name ""; p_id "P6745"; transcript_id "NM_001027506"; tss_id "TSS5767";
          chrIII	unknown	CDS	10810345	10810648	.	+	1	gene_id "klp-7"; gene_name "klp-7"; p_id "P25071"; transcript_id "NM_001027507"; tss_id "TSS33773";
          chrIII	unknown	exon	10810345	10810651	.	+	.	gene_id ""; gene_name ""; p_id "P6745"; transcript_id "NM_001027506"; tss_id "TSS5767";
          chrIII	unknown	exon	10810345	10810651	.	+	.	gene_id "klp-7"; gene_name "klp-7"; p_id "P25071"; transcript_id "NM_001027507"; tss_id "TSS33773";
          chrIII	unknown	stop_codon	10810649	10810651	.	+	.	gene_id ""; gene_name ""; p_id "P6745"; transcript_id "NM_001027506"; tss_id "TSS5767";
          chrIII	unknown	stop_codon	10810649	10810651	.	+	.	gene_id "klp-7"; gene_name "klp-7"; p_id "P25071"; transcript_id "NM_001027507"; tss_id "TSS33773";
          As you may notice, for many exons there is a duplicated entry with EMPTY gene_name and gene_id, likely explaining the 1.8 million counts assigned to "EMPTY" in line 1 of the output.

          Regarding your first question, you have to account the no_feature reads (intronic or inter-genic mapped), as well as the ambiguous reads (reads mapping in overlapping features). These would sum up to 10.8 M reads.
          In the alignments_not_unique the number of alignments (not reads) are counted, you can leave this out.
          Thanks for pointing that out. I totally thought the first line of the output is the sum, so I just added that with the special features. In fact if I add all counts including the ones assigned to genes, the special features and 1.8 million from line 1, they add up to 43 million.

          BTW: With the new version of HTSeq-count, you are using, you don't need to convert your bam file.
          Thanks for the tip. As I was saying in the reply to Ryan, I got error when installing the pysam module, which prevented me from using bam file directly. I'd appreciate it if you have some insights into that.

          Thanks!
          Shaohe
          Last edited by snownontrace; 08-25-2015, 06:56 PM.

          Comment


          • #6
            The example read was marked "not unique" because it was a multimapper (it had 7 alignments). Those should rightfully be excluded.

            BTW, you might just use featureCounts instead. It's MUCH faster than HTSeq-count and you can just avoid the pysam installation problem (no clue what's going on there).

            Comment


            • #7
              Having a look at your GTF explains, why you have a quite high number of ambiguous reads:
              You have in your example a lot of exons with the same positions, but different gene_ids. Each read falling into those exons cannot be assigned unambiguously and is therefore not counted for your genes.

              If you want to proceed with this data, you could use an awk command to clean the GTF:
              Code:
              awk 'length($10)>3{print $0}'  genes.gtf > clean_genes.gtf
              This will discard al gene_ids containing only ""; .

              Comment


              • #8
                Thanks! I'll try featureCounts. From the paper it looks decent!
                Shaohe

                Comment


                • #9
                  Thanks Michael. I decided to stay away from this problematic gtf file and instead use the one from Ensembl. I wrote a python script to convert the chromosome names from (I, II, III, IV, V, X, MtDNA) to (chrI, chrII, chrIII, chrIV, chrV, chrX and chrM) to make it compatible with the sam files I already have.

                  Here is the script if someone happens to run into the same problem as mine and do not want to spend another day to re-do the mapping.

                  Code:
                  import re
                  
                  folder = '/Users/snownontrace/Desktop/RNAseq/'
                  datafile = folder + "genes_Ensembl_WBcel235.gtf"
                  
                  romanNumeral = re.compile('^(?=[MDCLXVI])M*(C[MD]|D?C{0,3})(X[CL]|L?X{0,3})(I[XV]|V?I{0,3})$') #this is strict Roman numerals regex that check order
                  
                  newGTFfile = open(datafile.split('.')[0]+'_converted_chromosome.gtf','w')
                  
                  with open(datafile) as f:
                      for r in f:
                          if romanNumeral.match(r.split()[0]):
                              newGTFfile.write('chr'+r)
                          else:
                              newGTFfile.write('chrM'+'\t'+'\t'.join(r.split()[1:])+'\n')
                  Shaohe

                  Comment


                  • #10
                    Hi Shaohe, featureCounts can automatically match up chr names with or without the 'chr' prefix in the annotation and bam/sam files (eg. 'chrI' is matched with 'I'). This wont work for your 'MtDNA' chr though, but you can use the '-A' option to provide chr aliases to the program. You do not need to redo the mapping.

                    Wei

                    Comment


                    • #11
                      Hi Wei,

                      The script I posted did the conversion of chromosome names and I finished my analysis. I am still very interested in trying featureCounts and will post questions here to get help if stuck.

                      Thanks!
                      Shaohe

                      Comment


                      • #12
                        FYI, if you ever need to convert more complicated/annoying chromosome names, I have a github repo that has some of the more common mappings.

                        Comment


                        • #13
                          Thanks. Will certainly be useful someday!
                          Shaohe

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Strategies for Sequencing Challenging Samples
                            by seqadmin


                            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                            03-22-2024, 06:39 AM
                          • seqadmin
                            Techniques and Challenges in Conservation Genomics
                            by seqadmin



                            The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                            Avian Conservation
                            Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                            03-08-2024, 10:41 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, Yesterday, 06:37 PM
                          0 responses
                          10 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, Yesterday, 06:07 PM
                          0 responses
                          10 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 03-22-2024, 10:03 AM
                          0 responses
                          51 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 03-21-2024, 07:32 AM
                          0 responses
                          67 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X