Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Zero counts in htseq-count output

    I am pretty desperate, so if anyone has any advice, I will be forever grateful.

    Basically, I am using the following command:

    htseq-count -s no -a 0 FourA.sam hg19.gtf > FourA.count

    and getting zero counts in the output. I have tried -s as yes/no (I am almost sure it should be "yes" based off the site I got my files), and -a between 0 and 10. No matter what parameters, I get zero counts.

    This is for a quick project, so I am not going for accuracy, I just want any results (of any quality)!

    Here is the background of my data
    1) Downloaded .sra file.
    2) Used fastq-dump to convert .sra file to .fastq file
    3) Only kept 1/10 sequences from .fastq file (it was too large).
    4) On Galaxy: Ran "Fastq Groomer" tool to remove the leading quality score for the adaptor base (input 'Color Space Sanger', default for the rest)
    5) On Galaxy: Ran "Manipulation Reads" to convert to base-space "0123. -> ATCGN" and remove the adaptor itself. This generated fastqcssanger file.
    6) On Galaxy: Ran ""Map with BWA for SOLiD" tool on the fastqcssanger file and mrna.fa file to get the SAM file

    I downloaded the mrna.fa file from: (http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/)

    Any advice is greatly appreciated. I don't think I have time to rerun/remap. If there is any faster solution for where I went wrong in this pipeline, I would need to do that. I just need something to present, regardless of the quality, and will admit that to my class!

    (Also, I am getting an warning "Warning: Malformed SAM line: RNAME != '*' although flag bit &0x0004 set", although I read in other forums that is usually ignorable).
    Last edited by lanner; 05-04-2014, 07:16 PM. Reason: To add detail of the warning

  • #2
    This is head output of some of my files:

    hg19.gtf:

    chr1 hg19_refGene start_codon 67000042 67000044 0.000000 + . gene_id "NM_032291"; transcript_id "NM_032291";
    chr1 hg19_refGene CDS 67000042 67000051 0.000000 + 0 gene_id "NM_032291"; transcript_id "NM_032291";
    chr1 hg19_refGene exon 66999825 67000051 0.000000 + . gene_id "NM_032291"; transcript_id "NM_032291";
    chr1 hg19_refGene CDS 67091530 67091593 0.000000 + 2 gene_id "NM_032291"; transcript_id "NM_032291";
    chr1 hg19_refGene exon 67091530 67091593 0.000000 + . gene_id "NM_032291"; transcript_id "NM_032291";
    chr1 hg19_refGene CDS 67098753 67098777 0.000000 + 1 gene_id "NM_032291"; transcript_id "NM_032291";
    chr1 hg19_refGene exon 67098753 67098777 0.000000 + . gene_id "NM_032291"; transcript_id "NM_032291";
    chr1 hg19_refGene CDS 67101627 67101698 0.000000 + 0 gene_id "NM_032291"; transcript_id "NM_032291";
    chr1 hg19_refGene exon 67101627 67101698 0.000000 + . gene_id "NM_032291"; transcript_id "NM_032291";
    chr1 hg19_refGene CDS 67105460 67105516 0.000000 + 0 gene_id "NM_032291"; transcript_id "NM_032291";


    mrna.fasta:

    >AF001540 1
    ggcacgaggcaggtctgtctgttctgttggcaagtaaatgcagtactgtt
    ctgatcccgctgctattagaatgcattgtgaaacgactggagtatgatta
    aaagttgtgttccccaatgcttggagtagtgattgttgaaggaaaaaatc
    cagctgagtgataaggctgagtgttgaggaaatttctgcagttttaagca
    gtcgtatttgtgattgaagctgagtacatttgctggtgtatttttaggta
    aaatgcttttttgttcatttctgggtggtgggaggggactgaagccttta
    gtcttttccagatgcaaccttaaaatcagtgacaagaaacattccaaaca
    agcaacagtcttcaagaaattaaactggcaagtggaaatgtttaaacagt
    tcagtgatctttagtgcattgtttatgtgtgggtttctctctcccctccc


    FourA.sam:

    @SQ SN:AF001540 LN:1781
    @SQ SN:AF001541 LN:1138
    @SQ SN:AF001542 LN:2992
    @SQ SN:AF001543 LN:903
    @SQ SN:AF001544 LN:434
    @SQ SN:AF001545 LN:370
    @SQ SN:AF001546 LN:1142
    @SQ SN:AF001547 LN:1092
    @SQ SN:AF034176 LN:7232
    @SQ SN:AF038950 LN:2384

    FourA.count (head and tail)

    head:
    NM_000014 0
    NM_000015 0
    NM_000016 0
    NM_000017 0
    NM_000018 0
    NM_000019 0
    NM_000020 0
    NM_000021 0
    NM_000022 0
    NM_000023 0

    tail:
    NR_046431 0
    NR_046432 0
    NR_046433 0
    NR_046435 0
    NR_046436 0
    no_feature 257
    ambiguous 0
    too_low_aQual 0
    not_aligned 6366159
    alignment_not_unique 0

    Comment


    • #3
      Since you mapped against the transcriptome, htseq-count (or any tool for that matter) won't be able to give you counts. The reason for this is that these tools look at the coordinates in the SAM file (e.g., "AF001540:100-300") and then the GTF file and try to see if they overlap (and if the read overlaps multiple things). Since AF001540 (or any other transcript ID) doesn't exist as a chromosome or contig in your annotation file, this will never work.

      You have two possibilities for moving forward. (1) Map to the genome with STAR or tophat2 or a similar tool. The resulting alignments will work fine with htseq-counts (or featureCounts, which is faster). (2) Realize that the counts are equivalent to the number of reads mapping to each contig/transcript and that you can get this information with "samtools idxstats sorted.and.indexed.alignments.bam". Option (2) is the less ideal of the two, since you can't filter by MAPQ (well you can, but you need to just create a new BAM file to do so) and even if you could it would be misleading since you'd have reads that look like they map to multiple places but actually don't since it's likely you have many transcripts that overlap. I would strongly encourage you to just remap things.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Current Approaches to Protein Sequencing
        by seqadmin


        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
        04-04-2024, 04:25 PM
      • seqadmin
        Strategies for Sequencing Challenging Samples
        by seqadmin


        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
        03-22-2024, 06:39 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, 04-11-2024, 12:08 PM
      0 responses
      30 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 10:19 PM
      0 responses
      32 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 09:21 AM
      0 responses
      28 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-04-2024, 09:00 AM
      0 responses
      53 views
      0 likes
      Last Post seqadmin  
      Working...
      X