Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Inconsistent results between mapping and read summarization

    Hi everyone =) I present you a question that has been bothering me for a while now and, unfortunately, none of my colleagues have been able to help me.

    I've finally got my biological replicates data back from the sequencing platforms so I can move into the DE analysis and make some stats and fancy graphs to show my boss =) but I'm having some problems right at the start of the pipeline.

    I'll put you into context regarding my data: I'm sequencing little amounts of RNA, therefore no ribodepletion or polyA pull-down was possible. The whole transcriptome is being sequenced (kind of this single-cell sequencing using SPIA sequencing). The quality of the sequences is top-notch (no base with score under 40 in the fastqc report). I can post you the fastqc reports if you wanna look at them. Three biological replicates from two different zones of a single tissue. The project aims to detect the DE genes between those two zones of the tissue. I'm working with sheep.

    I've mapped my reads using STAR with Ensembl sheep genome fasta and GTF annotation files, and this program told me that, on average, around 60% were uniquely mapped reads, around 38% were multimappers and around 2% were deemed as unmapped for all my files. So far so good 'cause I think they are expected percentages.

    The problem comes when I used featureCounts, because when I read summarization it tells me that, on average, only around 10% of my reads were successfully assigned to a feature for all my files.

    I've been thinking for a while, but to me it doesn't make any sense. If the sequence could not be properly mapped to any feature, they would have been deemed as unmapped by STAR in the first place.

    I have no interest at the moment at multimappers, so I left them out in featureCounts, but anyway I was expecting to have more or less the same percentages than the STAR output, because If the reads were successfully mapped with STAR it means they recognize the feature inside the GTF file so, how can featureCounts not assign it to the feature if they are using the same reference files?

    These are the codes used for:

    Creating the genomic index:

    module load STAR

    STAR --runThreadN 16 \
    --runMode genomeGenerate \
    --genomeDir ~/RNASeqSQ/GenomeDir_ensembl/ \
    --genomeFastaFiles ~/RNASeqSQ/ref_genome_ensembl/Sheep_genome_ensembl.fa \
    --sjdbGTFfile ~/RNASeqSQ/ref_genome_ensembl/Sheep_genome_annotations_ensembl.gtf \
    --sjdbOverhang 49


    Mapping:

    module load STAR

    fqFiles=`find $1 -name '*.gz' -type f`

    for fqFile in $fqFiles;do
    STAR --runThreadN 16 \
    --genomeDir ~/RNASeqSQ/GenomeDir_ensembl/ \
    --readFilesIn $fqFile \
    --readFilesCommand zcat \
    --outFileNamePrefix ~/RNASeqSQ/mapped_data_ensembl/$(basename $fqFile .fastq.gz)_ensembl_ \
    --outSAMstrandField intronMotif \
    --sjdbGTFfile ~/RNASeqSQ/ref_genome_ensembl/Sheep_genome_annotations_ensembl.gtf \
    --outSAMtype BAM Unsorted \
    --outSAMunmapped Within \
    --outFilterIntronMotifs RemoveNoncanonical
    done


    Reading summarization:

    module load subread

    featureCounts -a ../ref_genome_ensembl/Sheep_genome_annotations_ensembl.gtf \
    -F GTF \
    -o GvS_counts_full_ensembl.txt \
    *_Aligned.out.bam


    Another colleague of mine told me that I should go around this by making a de novo assembly, but, wouldn't that be too biased given that my data comes only from one tissue (brain) and only from one cell type from that tissue (cortical neurons). I'm trying to address if the cortical neurons transcriptome profiles from one side of the brain differs from another side of it.

    Thanks for your support =)

  • #2
    Hi Sebastian,

    You should have a look at the rRNA content. Download just the sheep's rRNA sequences from NCBI as fasta, create a bowtie2 index, and map the reads with bowtie2 against them. The mapping rate is equal to the rRNA content in your sample. The remaining reads can be used for your STAR mapping.

    Cheers,

    Michael

    Comment


    • #3
      What featureCounts does is it counts the number of reads overlapping with features included in the annotation. Reads that were reported as mapped by aligner but not overlapping any features will not be counted by featureCounts. For an RNA-seq dataset, typically you will see around 30-50 percent of reads that were mapped but not counted.

      featureCounts outputs a counting summary file (called "GvS_counts_full_ensembl.txt.summary" for your data), which should be helpful for you to understand why some reads were not counted.

      Comment


      • #4
        Originally posted by Michael.Ante View Post
        Hi Sebastian,

        You should have a look at the rRNA content. Download just the sheep's rRNA sequences from NCBI as fasta, create a bowtie2 index, and map the reads with bowtie2 against them. The mapping rate is equal to the rRNA content in your sample. The remaining reads can be used for your STAR mapping.

        Cheers,

        Michael
        Thanks Michael, I'll try that and see what I can get.

        Originally posted by shi View Post
        What featureCounts does is it counts the number of reads overlapping with features included in the annotation. Reads that were reported as mapped by aligner but not overlapping any features will not be counted by featureCounts. For an RNA-seq dataset, typically you will see around 30-50 percent of reads that were mapped but not counted.

        featureCounts outputs a counting summary file (called "GvS_counts_full_ensembl.txt.summary" for your data), which should be helpful for you to understand why some reads were not counted.
        I just checked that file and it tells me the following stats:

        Status v |Sample > 1251.1G 1251.1S 1251.2G 1251.2S 658.1G 658.1S
        1 Assigned 21298779 19203813 23927215 20277649 21917507 27799505
        2 Unassigned_Ambiguity 231677 212367 246618 210001 292053 256487
        3 Unassigned_MultiMapping 146004332 162822044 159017518 147923500 116264627 154408557
        4 Unassigned_NoFeatures 33888112 33895039 35942035 29565235 24630110 39141330
        5 Unassigned_Unmapped 6568710 4145532 5041164 5343282 37451751 4849700
        6 Unassigned_MappingQuality 0 0 0 0 0 0
        7 Unassigned_FragmentLength 0 0 0 0 0 0
        8 Unassigned_Chimera 0 0 0 0 0 0
        9 Unassigned_Secondary 0 0 0 0 0 0
        10 Unassigned_Nonjunction 0 0 0 0 0 0
        11 Unassigned_Duplicate 0 0 0 0 0 0


        So most of my reads were assigned as multimappers, whereas STAR did not count them as such =/.

        Thanks a lot for your help.

        Do you think the de novo assembly pipeline would be useful to go around this problem as a backup plan?
        Last edited by Sebastian_Quezada_R; 08-23-2016, 09:01 PM.

        Comment


        • #5
          So most of my reads were assigned as multimappers, whereas STAR did not count them as such =/.
          featureCounts uses 'NH' tag to identify multi-mapping reads. So you have a lot of reads with this tag having a value greater than 1 in your mapping result.

          Comment


          • #6
            Originally posted by shi View Post
            featureCounts uses 'NH' tag to identify multi-mapping reads. So you have a lot of reads with this tag having a value greater than 1 in your mapping result.
            Thanks, Shi. I finally got it properly. I'm going to start another thread regarding the issues I'm having with my data, so I can see if I can get some insight that my bioinformatician colleagues might have missed.

            Thanks again =)

            Comment


            • #7
              Hi Sebastian,
              the latest versions of STAR report the NH field. In older versions you should be able to report the number of hits with an according parameters setting.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM
              • seqadmin
                Techniques and Challenges in Conservation Genomics
                by seqadmin



                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                Avian Conservation
                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                03-08-2024, 10:41 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Yesterday, 06:37 PM
              0 responses
              10 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, Yesterday, 06:07 PM
              0 responses
              10 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-22-2024, 10:03 AM
              0 responses
              51 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-21-2024, 07:32 AM
              0 responses
              67 views
              0 likes
              Last Post seqadmin  
              Working...
              X