SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
some odd out put in combined GTF RNA-seq Cufflink honey Bioinformatics 0 07-30-2011 08:37 AM
Can BLAST do the RNA-seq reads mapping efficiently? nivea Bioinformatics 10 05-02-2011 08:33 AM
RNA-Seq: MapSplice: Accurate mapping of RNA-seq reads for splice junction discovery. Newsbot! Literature Watch 2 10-14-2010 08:35 AM
PubMed: A combined massively parallel sequencing - indicator species approach reveale Newsbot! Literature Watch 0 08-24-2010 02:40 AM
RNA-seq reads mapping to coding regions m!x RNA Sequencing 0 02-17-2010 12:04 PM

Reply
 
Thread Tools
Old 10-02-2010, 05:16 AM   #1
schelhorn
Member
 
Location: Germany

Join Date: Sep 2010
Posts: 10
Default Combined mapping of RNA-Seq reads originating from multiple species

Dear SEQanswers community,

I am working on Illumina RNA-Seq data from human samples. While it is quite clear to me how to map the reads against the human reference under the consideration of splice junctions with tools such as Tophat or MapSplice, I have an additional requirement that makes things more inteersting.

In my data, the sampled tissue is infected by one or many different viral species that may induce the expression of viral genes. I now would like to identify these expressed viral transcripts together with the human transcripts. The idea is that viral transcripts can be detected by finding all reads that cannot be mapped to the human reference and subsequently re-mapping these reads against a collection of known viral genomes (perhaps with BLAST or even Smith-Waterman since viral genomes tend to be short and highly variant).

However, this strikes me as inelegant and it also requires post-processing of the main mapping to identify unaligned reads. Therefore my question is: are there is any mapping tools geared towards RNA-Seq that allows multiple reference genomes and map each read to the most likely genome while also considering splice junctions? This would be a special case of RNA-Seq for metagenomics data with one large mammal reference genome and many very small reference genomes. Bonus points if the mapper can deal with a higher number of alignment errors (viral transcripts can be quite variant). Thanks a bunch for any leads.
schelhorn is offline   Reply With Quote
Old 10-05-2010, 12:23 PM   #2
greenshell
Member
 
Location: Lexington

Join Date: Aug 2010
Posts: 20
Default

MapSplice outputs unaligned reads to the output sam file, the unaligned reads' cigar tag is '*'. The unaligned reads can be converted to fastq or fasta format easily.

To allow more mismatch, you can run mapsplice with --map-segments-directly option on and --full-running option off, segment mismatch -E 3(because bowtie allow 3 mismatches at most), splice mismatch -m is not restricted and can be set to a higher value. For example, if the read length is 100bp, segment length is 25bp, -E 3 and -m 3, it can allow 12 mismatches in an alignment
greenshell is offline   Reply With Quote
Old 10-06-2010, 04:27 AM   #3
schelhorn
Member
 
Location: Germany

Join Date: Sep 2010
Posts: 10
Default

Thanks greenshell, the ability of junction-aware mappers like MapSplice to allow for more mismatches is indeed useful. After some researching I think there is no golden bullet for this problem yet. I am looking for viral metagenomics approaches for read mapping, although it is doubtful I find anything useful since these are very much de-novo oriented.

In the end I will most probably either append the viral genomes as artificial chromosomes to the human reference and keep ambiguously aligned reads, or I extract all unmapped reads as you proposed and map them against all viral genomes via BLAT with short seeds to allow for viral variance. In the latter case I would have to transform the per-genome BLAT hits back into a format that Cufflinks or DeSeq can utilize for calculation of the expression values, but that should be doable.
schelhorn is offline   Reply With Quote
Old 10-22-2010, 10:53 AM   #4
zchou
Member
 
Location: detroit, US

Join Date: May 2009
Posts: 13
Default

Does MapSplice automatically report unaligned reads to the output? I didn't see any options about the unmapped reads?

Also, can MapSplice run for the whole genome per time, not just one chromosome per time?


Quote:
Originally Posted by greenshell View Post
MapSplice outputs unaligned reads to the output sam file, the unaligned reads' cigar tag is '*'. The unaligned reads can be converted to fastq or fasta format easily.

To allow more mismatch, you can run mapsplice with --map-segments-directly option on and --full-running option off, segment mismatch -E 3(because bowtie allow 3 mismatches at most), splice mismatch -m is not restricted and can be set to a higher value. For example, if the read length is 100bp, segment length is 25bp, -E 3 and -m 3, it can allow 12 mismatches in an alignment
zchou is offline   Reply With Quote
Old 10-22-2010, 11:39 AM   #5
greenshell
Member
 
Location: Lexington

Join Date: Aug 2010
Posts: 20
Default

Quote:
Originally Posted by zchou View Post
Does MapSplice automatically report unaligned reads to the output? I didn't see any options about the unmapped reads?

Yes MapSplice automatically report unaligned reads to the output

Also, can MapSplice run for the whole genome per time, not just one chromosome per time?
MapSplice takes a directory of chromosome files as input. It can run multiple chromosomes per time.

In the internal step, MapSplice processes chromosome files one by one
greenshell is offline   Reply With Quote
Old 10-23-2010, 05:46 AM   #6
steven
Senior Member
 
Location: Southern France

Join Date: Aug 2009
Posts: 269
Default

The best would be to avoid this two-steps processing and simultaneously align on all genomes, keeping for each read the best hit only. What if you put human + viral chromosomes in the same directory before aligning? Too big? It doesn't allow for setting differential alignment parameters depending on the chromosome though.
steven is offline   Reply With Quote
Old 11-05-2010, 07:01 AM   #7
schelhorn
Member
 
Location: Germany

Join Date: Sep 2010
Posts: 10
Default

Alright, thanks a lot for the ongoing discussion. We decided to map in parallel against the human and viral genomes with MapSplice and will compare the results against the serial mapping approach described in the OP. However, I still have a couple of questions regarding the use of MapSplice. Perhaps greenshell would be kind enough to answer them.

First, we have 2x36bp Illumina reads as primary sequence source. Consequently, we plan to use the MapPER extension of MapSplice to incoprorate the linkage information of the paired reads. However, it is still unclear to me which MapSplice fragment size to choose for reads as short as 36bp - is 25bp fragment size still recommended? In this way MapSplice will not find splice junctions within one of the paired end reads, but perhaps this is the only way to go since shorter fragments would map ambiguously to the reference. Also, MapPER requires the specification of mean and SD of the insert size (or library fragment size, I am not sure right now). While I can obtain the mean insert size from the library generation protocol, I am not sure how to generate a good estimate for the SD of the insert size distribution without doing the mapping first. Would this mean that I should map twice and use the first run to estimate the SD?

Second, MapSplice allows to specify two .sff files to define genomic regions that should be either avoided or preferred for the mapping. My question is, how the precedence of the regions specified in these files is, and how this interplays with soft-masked (lower-cased) regions in the reference sequence. Does MapSplice consider the soft-masked regions in the reference sequence automatically, and have preferred regions contained in avoided regions precedence or vice versa?
schelhorn is offline   Reply With Quote
Old 11-05-2010, 08:55 AM   #8
greenshell
Member
 
Location: Lexington

Join Date: Aug 2010
Posts: 20
Default

Please use 18bp fragment size(segment length) for 36bp reads

MapPER: The SD reflects the quality of the data, or the deviation from the mean distance (library fragment size - 2x36bp in this case). But the mean is the most important. You may simply use a loose SD here, for example, about 1/3 of the mean.

Lower-cased regions and upper-cases regions are treated in the same way. The two gff files don't interplay with the soft-masked regions.
Bowtie mapped candidates are filtered by the two gff files at beginning, in the following steps, MapSplice won't try to find alignments not in the regions in the two gff files. So the regions specified in these files have highest precedence
greenshell is offline   Reply With Quote
Reply

Tags
human, mapping, rna-seq, splice-junction, virus

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:52 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO