Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Help with mapping RNA-seq reads from Illumina HiSeq

    I have been trying to process and analyze several RNA-seq data sets, but am having trouble with the mapping process. The data are from total RNA (not just mRNA) because we are interested in looking at non-coding RNAs in these samples. I've noticed that many of the tools out there, as well as a majority of the published analyses, are biased towards investigation of mRNA levels and/or differential expression.

    I am using the Galaxy platform to process and analyze these data, but a surprisingly low number of the reads are being mapped to my reference genome. For example: I have approximately 48 million reads in one sample, of which only ~500,000 are being mapped with Bowtie or BWA. Looking at the read quality statistics boxplots, only the first 3 bases of the reads have "low" scores, the rest are in the high 30's, with some as high as 41 (these are from Illumina sequencing using the 1.8 version of Casava, so they are back to the original Sanger quality scale).

    I thought that with such high-quality scores for most of the reads at every position would allow for a majority of the reads to be mapped. I trimmed the reads, removing the first 3 bases, and then tried to do the alignments with those two tools. I used their default parameters, which I believe includes allowing up to two base mismatches. I am not sure what to look at in the data to determine the cause of the low mapping %. I'm using a "custom" reference genome for the most current version of the C. elegans genome (WS231) because it is not provided in Galaxy. I'm looking for some suggestions about how to troubleshoot this problem, and possibly some links/references to help me figure out how to alter the default parameters (if they are causing my problem) in Bowtie or BWA. As it is, having ~1% of the data mapped doesn't allow me to do any analysis.

  • #2
    Hard to say what's wrong. You could start by randomly selecting some reads and BLASTing them, just to make sure you are dealing with C. elegans sequences and not some kind of massive contamination from another organism. A variation on this theme could be "adapter contamination", where you get lots of adapter sequence in the reads; this could happen if you for some reason have sequenced very short fragments.

    If this check doesn't indicate anything unusual, I would check that the correct quality scale is being used in the BWA and Bowtie settings. Actually this would probably only matter for Bowtie.

    Comment


    • #3
      Did you try mapping to E. coli genome? A good chunk should map to E. coli genome since C. elegans eat E. coli, but only 1% mapping to C. elegans is odd.

      Comment


      • #4
        I actually did that just recently and it turned out that about 80% of my sequences are from E. coli. I was told that the bacteria should have been mostly removed during the sample prep, but apparently not. Would it make sense that only 1% of the data maps to C. elegans and over 80% maps to E. coli?

        Comment


        • #5
          The amount of E. coli contamination of C. elegans libraries varies by user (although 80% is pretty high). If the remaining 20% are derived from C. elegans total RNA, then the bulk (~90%) should be ribosomal RNA. Most aligners report only the uniquely aligned reads (not possible for the ribosomal gene clusters), so you'd expect a low % alignment.

          Comment


          • #6
            Help with mapping RNA-seq reads from Illumina HiSeq - were the ribosomal RNA removed?

            Am assuming that during your library prep, a ribosomal depletion step was performed?

            Comment


            • #7
              Hi,
              I suggest you read tophat paper for your questions.

              WT mapping using bwa needs some sort of trimming in order to map good % of reads.Still you will miss splice reads and so many more,simply because such mappers are not designed for WT mapping.You sure could use WT to map to WT.
              But you will miss novel regions as well .
              I use tophat for mapping and use that output files forum further analysis.

              Best,
              Aparna

              Comment


              • #8
                Originally posted by aparna View Post
                Hi,
                I suggest you read tophat paper for your questions.

                WT mapping using bwa needs some sort of trimming in order to map good % of reads.Still you will miss splice reads and so many more,simply because such mappers are not designed for WT mapping.You sure could use WT to map to WT.
                But you will miss novel regions as well .
                I use tophat for mapping and use that output files forum further analysis.

                Best,
                Aparna
                Aparna, I'm assuming that by wt you mean WHOLE TRANSCRIPTOME?

                I am analyzing some Illumina libraries that appear to have a lot of ribosomal RNA contamination.

                I'm using Bowtie to align the reads only to a specific set of sequences, and because of the differing amount of rRNA contamination in each sample, each of them maps a different percentage of reads to the dataset (some half of what others map), ranging from 1% to 0.3%.

                I wonder if the amount of rRNA contamination in the preparation of a library can have an impact on the apparent expression level of a gene -- even though one normalizes its counts agains the total number of reads that mapped.

                What's your opinion in this subject?

                Carmen

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Essential Discoveries and Tools in Epitranscriptomics
                  by seqadmin




                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                  04-22-2024, 07:01 AM
                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                59 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                57 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 09:21 AM
                0 responses
                51 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-04-2024, 09:00 AM
                0 responses
                56 views
                0 likes
                Last Post seqadmin  
                Working...
                X