Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Combined mapping of RNA-Seq reads originating from multiple species

    Dear SEQanswers community,

    I am working on Illumina RNA-Seq data from human samples. While it is quite clear to me how to map the reads against the human reference under the consideration of splice junctions with tools such as Tophat or MapSplice, I have an additional requirement that makes things more inteersting.

    In my data, the sampled tissue is infected by one or many different viral species that may induce the expression of viral genes. I now would like to identify these expressed viral transcripts together with the human transcripts. The idea is that viral transcripts can be detected by finding all reads that cannot be mapped to the human reference and subsequently re-mapping these reads against a collection of known viral genomes (perhaps with BLAST or even Smith-Waterman since viral genomes tend to be short and highly variant).

    However, this strikes me as inelegant and it also requires post-processing of the main mapping to identify unaligned reads. Therefore my question is: are there is any mapping tools geared towards RNA-Seq that allows multiple reference genomes and map each read to the most likely genome while also considering splice junctions? This would be a special case of RNA-Seq for metagenomics data with one large mammal reference genome and many very small reference genomes. Bonus points if the mapper can deal with a higher number of alignment errors (viral transcripts can be quite variant). Thanks a bunch for any leads.

  • #2
    MapSplice outputs unaligned reads to the output sam file, the unaligned reads' cigar tag is '*'. The unaligned reads can be converted to fastq or fasta format easily.

    To allow more mismatch, you can run mapsplice with --map-segments-directly option on and --full-running option off, segment mismatch -E 3(because bowtie allow 3 mismatches at most), splice mismatch -m is not restricted and can be set to a higher value. For example, if the read length is 100bp, segment length is 25bp, -E 3 and -m 3, it can allow 12 mismatches in an alignment

    Comment


    • #3
      Thanks greenshell, the ability of junction-aware mappers like MapSplice to allow for more mismatches is indeed useful. After some researching I think there is no golden bullet for this problem yet. I am looking for viral metagenomics approaches for read mapping, although it is doubtful I find anything useful since these are very much de-novo oriented.

      In the end I will most probably either append the viral genomes as artificial chromosomes to the human reference and keep ambiguously aligned reads, or I extract all unmapped reads as you proposed and map them against all viral genomes via BLAT with short seeds to allow for viral variance. In the latter case I would have to transform the per-genome BLAT hits back into a format that Cufflinks or DeSeq can utilize for calculation of the expression values, but that should be doable.

      Comment


      • #4
        Does MapSplice automatically report unaligned reads to the output? I didn't see any options about the unmapped reads?

        Also, can MapSplice run for the whole genome per time, not just one chromosome per time?


        Originally posted by greenshell View Post
        MapSplice outputs unaligned reads to the output sam file, the unaligned reads' cigar tag is '*'. The unaligned reads can be converted to fastq or fasta format easily.

        To allow more mismatch, you can run mapsplice with --map-segments-directly option on and --full-running option off, segment mismatch -E 3(because bowtie allow 3 mismatches at most), splice mismatch -m is not restricted and can be set to a higher value. For example, if the read length is 100bp, segment length is 25bp, -E 3 and -m 3, it can allow 12 mismatches in an alignment

        Comment


        • #5
          Originally posted by zchou View Post
          Does MapSplice automatically report unaligned reads to the output? I didn't see any options about the unmapped reads?

          Yes MapSplice automatically report unaligned reads to the output

          Also, can MapSplice run for the whole genome per time, not just one chromosome per time?
          MapSplice takes a directory of chromosome files as input. It can run multiple chromosomes per time.

          In the internal step, MapSplice processes chromosome files one by one

          Comment


          • #6
            The best would be to avoid this two-steps processing and simultaneously align on all genomes, keeping for each read the best hit only. What if you put human + viral chromosomes in the same directory before aligning? Too big? It doesn't allow for setting differential alignment parameters depending on the chromosome though.

            Comment


            • #7
              Alright, thanks a lot for the ongoing discussion. We decided to map in parallel against the human and viral genomes with MapSplice and will compare the results against the serial mapping approach described in the OP. However, I still have a couple of questions regarding the use of MapSplice. Perhaps greenshell would be kind enough to answer them.

              First, we have 2x36bp Illumina reads as primary sequence source. Consequently, we plan to use the MapPER extension of MapSplice to incoprorate the linkage information of the paired reads. However, it is still unclear to me which MapSplice fragment size to choose for reads as short as 36bp - is 25bp fragment size still recommended? In this way MapSplice will not find splice junctions within one of the paired end reads, but perhaps this is the only way to go since shorter fragments would map ambiguously to the reference. Also, MapPER requires the specification of mean and SD of the insert size (or library fragment size, I am not sure right now). While I can obtain the mean insert size from the library generation protocol, I am not sure how to generate a good estimate for the SD of the insert size distribution without doing the mapping first. Would this mean that I should map twice and use the first run to estimate the SD?

              Second, MapSplice allows to specify two .sff files to define genomic regions that should be either avoided or preferred for the mapping. My question is, how the precedence of the regions specified in these files is, and how this interplays with soft-masked (lower-cased) regions in the reference sequence. Does MapSplice consider the soft-masked regions in the reference sequence automatically, and have preferred regions contained in avoided regions precedence or vice versa?

              Comment


              • #8
                Please use 18bp fragment size(segment length) for 36bp reads

                MapPER: The SD reflects the quality of the data, or the deviation from the mean distance (library fragment size - 2x36bp in this case). But the mean is the most important. You may simply use a loose SD here, for example, about 1/3 of the mean.

                Lower-cased regions and upper-cases regions are treated in the same way. The two gff files don't interplay with the soft-masked regions.
                Bowtie mapped candidates are filtered by the two gff files at beginning, in the following steps, MapSplice won't try to find alignments not in the regions in the two gff files. So the regions specified in these files have highest precedence

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM
                • seqadmin
                  Techniques and Challenges in Conservation Genomics
                  by seqadmin



                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                  Avian Conservation
                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                  03-08-2024, 10:41 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 03-27-2024, 06:37 PM
                0 responses
                12 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-27-2024, 06:07 PM
                0 responses
                11 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-22-2024, 10:03 AM
                0 responses
                53 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-21-2024, 07:32 AM
                0 responses
                68 views
                0 likes
                Last Post seqadmin  
                Working...
                X