Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • HTSeq, human genomes and low read counts: am I doing anything wrong?

    This is my basic workflow:

    DATA and ALIGNMENT:

    50SE Illumina reads mapped with tophat2 (v 2.0.11; default setting) to a bowtie2 index of hg38 (all chromosomes plus mitochondria, but no alternate chromosomal assemblies)

    outfile reports as follows:
    Reads:
    Input : 29397870
    Mapped : 28589633 (97.3% of input)
    of these: 5179700 (18.1%) have multiple alignments (612 have >20)
    97.3% overall read mapping rate.
    READ COUNTING

    1) Generated a "corrected" gtf file from UCSC (i.e., geneID and transcript IDs are different) using the method described here.
    2) converted tophat alignment bam to sam (samtools)
    3) counted reads with HTSeq using the general command:
    htseq-count --stranded=no accepted_hits.sam hg38_UCSC_alt.gtf >HTSeq.count
    My HTSeq output reports as follows:

    no_feature 13107880
    ambiguous 5784082
    too_low_aQual 0
    not_aligned 0
    alignment_not_unique 25653910
    The result is that I have ~4.5 million unique alignments as counted by HTSeq. That seems low and when I query the bam file,

    samtools view accepted_hits.bam | grep "NH:i:1" -c
    the output reports 25707742 alignments as unique (NH:i:1)

    Can anyone offer a simple explanation of what is happening here? The hand-wavy explanation that I have at the moment is that HTSeq count is very conservative, and if a read cannot be assigned unambiguously, and exclusively to a gene, it is not counted. However, the HTSeq output is secretly making me question the integrity of my pipeline and/or my input files.

    Thank you for any advice or reassurance.

    David

    EDIT: and it occurs to me that it might be relevant that the library prep was RiboMinus. Since I'm used to working with poly-A purified samples, my expectations might be too high in this case.
    Last edited by ExMachina; 07-30-2014, 10:42 AM.

  • #2
    Suresect adapter sequence

    what is sureselect adapter sequences? indexing strand of Sureselect is different from indexing strand of Truseq LT, can anyone give some information?

    Comment


    • #3
      Love the location. You should sign your name the same way.

      I have not done the 'correction' method you have referenced. I normally download the reference and annotation file from igenomes in tophat ready format. Can you post a snippet from your corrected gtf file?

      I believe HTSeq will now output a samfile with a field indicating what the read contributed to , have you looked at that and looked to see where they map?

      -o <samout>, --samout=<samout>
      write out all SAM alignment records into an output SAM file called <samout>, annotating each line with its assignment to a feature or a special counter (as an optional field with tag ‘XF’)

      Comment


      • #4
        One other change I just tried was the HTSeq option "--mode=intersection-nonempty "

        I was hoping to get more information, but this option only decreased my "no_feature" count by about 5000. So I'm still left with the question: is it reasonable to have only ~15% of my mapped reads associate with annotated genes?

        Comment


        • #5
          No, that is not reasonable. Can you post a few lines of your corrected gtf?

          Comment


          • #6
            Originally posted by bioBob View Post
            I have not done the 'correction' method you have referenced. I normally download the reference and annotation file from igenomes in tophat ready format. Can you post a snippet from your corrected gtf file?
            Here's a small chunk from chr1. As you can see some genes have corrected formatting but others (generally non-coding, I think) remain uncorrected:

            chr1 knownGene exon 69091 70008 . + . gene_id "Q8NH21"; transcript_id "uc001aal.1"; exon_number "1"; exon_id "uc001aal.1.1"; gene_name "Q8NH21";
            chr1 knownGene CDS 69091 70005 . + 0 gene_id "Q8NH21"; transcript_id "uc001aal.1"; exon_number "1"; exon_id "uc001aal.1.1"; gene_name "Q8NH21";
            chr1 knownGene start_codon 69091 69093 . + 0 gene_id "Q8NH21"; transcript_id "uc001aal.1"; exon_number "1"; exon_id "uc001aal.1.1"; gene_name "Q8NH21";
            chr1 knownGene stop_codon 70006 70008 . + 0 gene_id "Q8NH21"; transcript_id "uc001aal.1"; exon_number "1"; exon_id "uc001aal.1.1"; gene_name "Q8NH21";
            chr1 knownGene exon 134773 139696 . - . gene_id "B4DF06"; transcript_id "uc021oeg.2"; exon_number "1"; exon_id "uc021oeg.2.1"; gene_name "B4DF06";
            chr1 knownGene CDS 138533 139696 . - 0 gene_id "B4DF06"; transcript_id "uc021oeg.2"; exon_number "1"; exon_id "uc021oeg.2.1"; gene_name "B4DF06";
            chr1 knownGene exon 139790 139847 . - . gene_id "B4DF06"; transcript_id "uc021oeg.2"; exon_number "2"; exon_id "uc021oeg.2.2"; gene_name "B4DF06";
            chr1 knownGene CDS 139790 139792 . - 0 gene_id "B4DF06"; transcript_id "uc021oeg.2"; exon_number "2"; exon_id "uc021oeg.2.2"; gene_name "B4DF06";
            chr1 knownGene exon 140075 140566 . - . gene_id "B4DF06"; transcript_id "uc021oeg.2"; exon_number "3"; exon_id "uc021oeg.2.3"; gene_name "B4DF06";
            chr1 knownGene start_codon 139790 139792 . - 0 gene_id "B4DF06"; transcript_id "uc021oeg.2"; exon_number "1"; exon_id "uc021oeg.2.1"; gene_name "B4DF06";
            chr1 knownGene stop_codon 138530 138532 . - 0 gene_id "B4DF06"; transcript_id "uc021oeg.2"; exon_number "1"; exon_id "uc021oeg.2.1"; gene_name "B4DF06";
            chr1 knownGene exon 182393 182746 . + . gene_id "uc031tlc.1"; transcript_id "uc031tlc.1"; exon_number "1"; exon_id "uc031tlc.1.1";
            chr1 knownGene exon 183132 183240 . + . gene_id "uc031tlc.1"; transcript_id "uc031tlc.1"; exon_number "2"; exon_id "uc031tlc.1.2";
            chr1 knownGene exon 183740 184878 . + . gene_id "uc031tlc.1"; transcript_id "uc031tlc.1"; exon_number "3"; exon_id "uc031tlc.1.3";
            I'm also wondering if we're jumping the gun on trying to use the newer hg38 assembly and annotation? As you have alluded, there are more refined input files available for hg19/37.

            Currently I am waiting on the imminent ENSEMBL release of 38 to run the same pipeline on an compare results to the UCSC-based results.


            I believe HTSeq will now output a samfile with a field indicating what the read contributed to , have you looked at that and looked to see where they map?

            -o <samout>, --samout=<samout>
            write out all SAM alignment records into an output SAM file called <samout>, annotating each line with its assignment to a feature or a special counter (as an optional field with tag ‘XF’)
            Wow, that's a great tip. That's exactly what I need to have to see what's going on. Thanks!

            Comment


            • #7
              If it were me, I would align and count to the previous release just to see what the count assignments look like in terms of the number of reads in no_feature etc.

              Another thing, are you sure the chromosome id's are all identical in the fasta and gtf? In the build I have, the chromosomes are labeled as 1, 2 etc instead of chr1, chr2 etc.

              Comment


              • #8
                Give "featureCounts" a try, while you are at it.

                As bioBio pointed out, check to make sure your chrom ID's match in your ref/BAM/GTF files.

                Comment


                • #9
                  Oh, and make sure the reads are sorted in the bam file. HTSeq now has a flag for which method the sorting was performed, ie name or position.

                  Comment


                  • #10
                    If it were me, I would align and count to the previous release just to see what the count assignments look like in terms of the number of reads in no_feature etc.
                    Good idea and I already tried that
                    Same general results

                    Another thing, are you sure the chromosome id's are all identical in the fasta and gtf? In the build I have, the chromosomes are labeled as 1, 2 etc instead of chr1, chr2 etc.
                    The "1, 2..." IDs are ENSEMBL while UCSC uses "chr1, chr2..."

                    And good point on the sorting bioBob. I checked and all my bam files are sorted ( SO:coordinate)

                    Comment


                    • #11
                      Right, so you need to set that flag as the default is name. You probably had a lot of messages in the console about mate not found.

                      -r <order>, --order=<order>¶
                      For paired-end data, the alignment have to be sorted either by read name or by alignment position. If your data is not sorted, use the samtools sort function of samtools to sort it. Use this option, with name or pos for <order> to indicate how the input data has been sorted. The default is name.

                      Comment


                      • #12
                        Originally posted by bioBob View Post
                        Right, so you need to set that flag as the default is name. You probably had a lot of messages in the console about mate not found.

                        -r <order>, --order=<order>¶
                        For paired-end data, the alignment have to be sorted either by read name or by alignment position. If your data is not sorted, use the samtools sort function of samtools to sort it. Use this option, with name or pos for <order> to indicate how the input data has been sorted. The default is name.
                        Interesting. It looks like "-r" and "--order" are not options in my version (0.5.4). I wonder if that's the problem right there?!

                        EDIT: actually, I don't think this is a problem for me here, since all I have are SE reads.
                        Last edited by ExMachina; 07-31-2014, 11:22 AM.

                        Comment


                        • #13
                          Thanks for the input on this. As an update, here's what I've figured out:

                          The UCSC gtf file should not be made with the "knownGenes" table but should instead be made from the "refGene" table--the "knownGenes" table has too many overlapping coordinates.

                          Using the refGene annotations, I now get ~33% of my mapped reads mapping (uniquely) to known genes. I feel a little more confident with this result, especially given the RiboMinus library prep.

                          More comments are always welcome and I will report back here once the new ENSEMBL annotation is released.

                          Thanks for the help!

                          -David

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Current Approaches to Protein Sequencing
                            by seqadmin


                            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                            04-04-2024, 04:25 PM
                          • seqadmin
                            Strategies for Sequencing Challenging Samples
                            by seqadmin


                            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                            03-22-2024, 06:39 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, 04-11-2024, 12:08 PM
                          0 responses
                          18 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 10:19 PM
                          0 responses
                          22 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 09:21 AM
                          0 responses
                          17 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-04-2024, 09:00 AM
                          0 responses
                          49 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X