SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
mapping strand-specific RNA-seq reads using TopHat caswater Bioinformatics 2 03-17-2015 09:27 AM
Does Strand-specific RNA-seq mean reads from one genome strand? shangzhong0619 RNA Sequencing 4 06-16-2014 01:56 AM
Mixing strand-specific and non strand-specific reads in RNA-seq alexanderxcy Bioinformatics 2 08-27-2013 02:39 AM
Mapping strand-specific data with TopHat Comosus RNA Sequencing 6 08-13-2012 02:25 AM
strand specific RNA-seq mapping qqtwee RNA Sequencing 1 07-10-2012 10:23 PM

Reply
 
Thread Tools
Old 10-31-2015, 06:51 PM   #1
weirdkid
Member
 
Location: Earth

Join Date: Jun 2013
Posts: 10
Default Help with RNA-seq and strand specific mapping

I don't have much of a background in bioinformatics so any help that you can give me in figuring this out would be much appreciated.
I have paired end strand specific RNA seq data (sequenced with Illumina) from a non-model organism (Tasmanian Devil if you want to know), and I have a reference genome against which I can map the reads to. What I want to be able to do with this data is look for novel transcripts and non-coding RNAs, and visualise the data using IGV.

This is what I've been doing thus far:

I indexed the ref genome and used trimmomatic to trim the RNA reads, then I used tophat as follows:

tophat2 -r 60 -p 12 -G /path/to/GFF_file/Fgenesh_EST_aligned.gff -o /path/to/output/ --library-type fr-firststrand /path/to/ref_genome/421K17 /path/to/Trimmed/OP1_1_L4.fastq /path/to/Trimmed/OP1_2_L4.fastq

Then I used samtools to sort the accepted hits file and index the sorted file:

samtools sort accepted_hits.bam accepted_hits.sort

samtools index accepted_hits.sort.bam

However when I look at the data on IGV this is what I see (http://imgur.com/mnzm5hn and http://imgur.com/FymviY7)

(settings: view as pairs; color alignments by first of pair strand; sort by mapping quality)

I have also tried using flags with samtools to try and get the strand specific transcripts as follows

samtools view -b -f 99 accepted_hits.bam > acc_hits1.bam
samtools view -b -f 147 accepted_hits.bam > acc_hits2.bam
samtools view -b -f 83 accepted_hits.bam > acc_hits3.bam
samtools view -b -f 163 accepted_hits.bam > acc_hits4.bam

samtools merge file_1.bam acc_hits1.bam acc_hits2.bam
samtools merge file_2.bam acc_hits3.bam acc_hits4.bam

I converted these to the sam format, indexed it and viewed it with IGV and this is what i got (http://imgur.com/6bCWFGq). Essentially file_1.sam and file_2.sam look almost the same, just that one has more reads than the other, but they appear to me mapping to essentially the same place.

Is there something obvious that I have missed, either in the way I have prepared the data, or perhaps the way in which I'm viewing it on IGV?? Any help that you can give me in this would be much appreciated.

Thanks!!

Last edited by weirdkid; 10-31-2015 at 06:52 PM. Reason: images didn't show up
weirdkid is offline   Reply With Quote
Old 11-01-2015, 01:14 AM   #2
Jim Robinson
Member
 
Location: Boston, MA

Join Date: May 2009
Posts: 75
Default

Hi, what protocol was used to prepare the library?
Jim Robinson is offline   Reply With Quote
Old 11-01-2015, 01:32 AM   #3
weirdkid
Member
 
Location: Earth

Join Date: Jun 2013
Posts: 10
Default

Hi Jim,

do you mean the library for the reference genome? If so this is what I used:

bowtie2-build /path/to/421K17.fa /path/to/421K17/421K17

thanks
weirdkid is offline   Reply With Quote
Old 11-01-2015, 01:48 AM   #4
Jim Robinson
Member
 
Location: Boston, MA

Join Date: May 2009
Posts: 75
Default

No, I mean the library prep for the RNA, e.g. dUTP.
Jim Robinson is offline   Reply With Quote
Old 11-01-2015, 01:17 AM   #5
weirdkid
Member
 
Location: Earth

Join Date: Jun 2013
Posts: 10
Default

In that case I'm not sure. I got this data from another lab and I don't think they passed that information on to me.
weirdkid is offline   Reply With Quote
Old 11-01-2015, 08:04 AM   #6
Jim Robinson
Member
 
Location: Boston, MA

Join Date: May 2009
Posts: 75
Default

I ask because it doesn't appear to be strand preserving. I would verify that with the lab you received it from.
Jim Robinson is offline   Reply With Quote
Old 11-01-2015, 02:13 PM   #7
weirdkid
Member
 
Location: Earth

Join Date: Jun 2013
Posts: 10
Default

I'll check this with the lab and come back to you on that then. But what makes you think that it is not strand preserving?
weirdkid is offline   Reply With Quote
Old 11-01-2015, 02:56 PM   #8
Jim Robinson
Member
 
Location: Boston, MA

Join Date: May 2009
Posts: 75
Default

The reads look more or less evenly distributed between strands, when colored by first-in-pair strand. Also, the junctions look more or less even between strands. It just has the look of an un-stranded library, if it is no amount of informatics will recover the strand information. I thought that was the gist of your original post, sorry if I misunderstood.
Jim Robinson is offline   Reply With Quote
Old 11-01-2015, 04:27 PM   #9
weirdkid
Member
 
Location: Earth

Join Date: Jun 2013
Posts: 10
Default

So I asked and from what I was told the samples were sequenced with Illumina's TrueSeq RNA kit (total RNA) and was strand specific.
weirdkid is offline   Reply With Quote
Old 11-02-2015, 01:27 AM   #10
Michael.Ante
Senior Member
 
Location: Vienna

Join Date: Oct 2011
Posts: 123
Default

What is the difference between the two subsets in your IGV example?
In terms of strand-preservation Illumina's TruSeq stranded achieves 90 % - 97 %; so if you see in the correct strand-orientation 950 reads and in the wrong orientation 50 your result is acceptable.
For a sample-wide anaylsis, you can use RSeQC's infer_experiment function on the complete data set (accepted_hits.sort).
Michael.Ante is offline   Reply With Quote
Old 11-02-2015, 10:03 PM   #11
weirdkid
Member
 
Location: Earth

Join Date: Jun 2013
Posts: 10
Default

I've tried using RseQC's infer experiment script, but I'm not sure what it means when it asks for the reference genome in bed format. My reference genome is in fasta format, and converting from fasta to bed doesn't make sense to me...
weirdkid is offline   Reply With Quote
Old 11-03-2015, 12:17 AM   #12
Michael.Ante
Senior Member
 
Location: Vienna

Join Date: Oct 2011
Posts: 123
Default

RSeQc needs gene/transcript information. Thus, you need to convert your gtf/gff (Fgenesh_EST_aligned.gff) into a 12-column bed-file. Either you script it yourself or use for instance a tool like this.
Michael.Ante is offline   Reply With Quote
Old 11-03-2015, 10:44 PM   #13
weirdkid
Member
 
Location: Earth

Join Date: Jun 2013
Posts: 10
Default

Finally managed to do it. I couldn't quite get the gff/gtf to convert to bed so I just did it manually.
Anyway, here is what I got:

This is PairEnd Data
Fraction of reads failed to determine: 0.0000
Fraction of reads explained by "1++,1--,2+-,2-+": 0.1667
Fraction of reads explained by "1+-,1-+,2++,2--": 0.8333

Therefore the data is strand specific paired end data. But I'm not quite sure how to interpret the differences between the two. If the majority of the reads are explained by ""1+-,1-+,2++,2--": 0.8333", does this mean that I should have analyzed the data in tophat with fr-secondstrand instead of fr-firststrand??
weirdkid is offline   Reply With Quote
Old 11-04-2015, 01:46 AM   #14
Michael.Ante
Senior Member
 
Location: Vienna

Join Date: Oct 2011
Posts: 123
Default

According to this post, you should use fr-firststrand. When I analysed TruSeq stranded libraries, I yielded the same orientation, but higher percentages. For non-model organism it might be different due to incomplete annotation.
Michael.Ante is offline   Reply With Quote
Old 11-04-2015, 04:20 PM   #15
weirdkid
Member
 
Location: Earth

Join Date: Jun 2013
Posts: 10
Default

Hmmm, well I used fr-firststrand so I guess that can't be it.

I'll just keep trying a few different things and see where it gets me.
weirdkid is offline   Reply With Quote
Old 11-05-2015, 01:16 AM   #16
Michael.Ante
Senior Member
 
Location: Vienna

Join Date: Oct 2011
Posts: 123
Default

What is your read-length distribution? The older versions of TopHat had problems, in case of too high length differences. Additionally, reads below 20 bp are very hard to map; I discard all reads below that length (denote you need to control for the pairing information).
Otherwise, you can try a different aligner (e.g. STAR or bbmap).
Michael.Ante is offline   Reply With Quote
Old 11-24-2015, 04:35 AM   #17
colindaven
Senior Member
 
Location: Germany

Join Date: Oct 2008
Posts: 415
Default

This was troublesome for me as well.

Using the program bamtools helped no end ! Detailed in this post:

https://www.biostars.org/p/88582/

It looks arcane but it's worked for me on thousands of samples. Hope that helps.
colindaven is offline   Reply With Quote
Reply

Tags
illumina, rna-seq

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:04 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO