![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
mapping strand-specific RNA-seq reads using TopHat | caswater | Bioinformatics | 2 | 03-17-2015 09:27 AM |
Does Strand-specific RNA-seq mean reads from one genome strand? | shangzhong0619 | RNA Sequencing | 4 | 06-16-2014 01:56 AM |
Mixing strand-specific and non strand-specific reads in RNA-seq | alexanderxcy | Bioinformatics | 2 | 08-27-2013 02:39 AM |
Mapping strand-specific data with TopHat | Comosus | RNA Sequencing | 6 | 08-13-2012 02:25 AM |
strand specific RNA-seq mapping | qqtwee | RNA Sequencing | 1 | 07-10-2012 10:23 PM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: Earth Join Date: Jun 2013
Posts: 10
|
![]()
I don't have much of a background in bioinformatics so any help that you can give me in figuring this out would be much appreciated.
I have paired end strand specific RNA seq data (sequenced with Illumina) from a non-model organism (Tasmanian Devil if you want to know), and I have a reference genome against which I can map the reads to. What I want to be able to do with this data is look for novel transcripts and non-coding RNAs, and visualise the data using IGV. This is what I've been doing thus far: I indexed the ref genome and used trimmomatic to trim the RNA reads, then I used tophat as follows: tophat2 -r 60 -p 12 -G /path/to/GFF_file/Fgenesh_EST_aligned.gff -o /path/to/output/ --library-type fr-firststrand /path/to/ref_genome/421K17 /path/to/Trimmed/OP1_1_L4.fastq /path/to/Trimmed/OP1_2_L4.fastq Then I used samtools to sort the accepted hits file and index the sorted file: samtools sort accepted_hits.bam accepted_hits.sort samtools index accepted_hits.sort.bam However when I look at the data on IGV this is what I see (http://imgur.com/mnzm5hn and http://imgur.com/FymviY7) (settings: view as pairs; color alignments by first of pair strand; sort by mapping quality) I have also tried using flags with samtools to try and get the strand specific transcripts as follows samtools view -b -f 99 accepted_hits.bam > acc_hits1.bam samtools view -b -f 147 accepted_hits.bam > acc_hits2.bam samtools view -b -f 83 accepted_hits.bam > acc_hits3.bam samtools view -b -f 163 accepted_hits.bam > acc_hits4.bam samtools merge file_1.bam acc_hits1.bam acc_hits2.bam samtools merge file_2.bam acc_hits3.bam acc_hits4.bam I converted these to the sam format, indexed it and viewed it with IGV and this is what i got (http://imgur.com/6bCWFGq). Essentially file_1.sam and file_2.sam look almost the same, just that one has more reads than the other, but they appear to me mapping to essentially the same place. Is there something obvious that I have missed, either in the way I have prepared the data, or perhaps the way in which I'm viewing it on IGV?? Any help that you can give me in this would be much appreciated. Thanks!! Last edited by weirdkid; 10-31-2015 at 06:52 PM. Reason: images didn't show up |
![]() |
![]() |
![]() |
#2 |
Member
Location: Boston, MA Join Date: May 2009
Posts: 75
|
![]()
Hi, what protocol was used to prepare the library?
|
![]() |
![]() |
![]() |
#3 |
Member
Location: Earth Join Date: Jun 2013
Posts: 10
|
![]()
Hi Jim,
do you mean the library for the reference genome? If so this is what I used: bowtie2-build /path/to/421K17.fa /path/to/421K17/421K17 thanks |
![]() |
![]() |
![]() |
#4 |
Member
Location: Boston, MA Join Date: May 2009
Posts: 75
|
![]()
No, I mean the library prep for the RNA, e.g. dUTP.
|
![]() |
![]() |
![]() |
#5 |
Member
Location: Earth Join Date: Jun 2013
Posts: 10
|
![]()
In that case I'm not sure. I got this data from another lab and I don't think they passed that information on to me.
|
![]() |
![]() |
![]() |
#6 |
Member
Location: Boston, MA Join Date: May 2009
Posts: 75
|
![]()
I ask because it doesn't appear to be strand preserving. I would verify that with the lab you received it from.
|
![]() |
![]() |
![]() |
#7 |
Member
Location: Earth Join Date: Jun 2013
Posts: 10
|
![]()
I'll check this with the lab and come back to you on that then. But what makes you think that it is not strand preserving?
|
![]() |
![]() |
![]() |
#8 |
Member
Location: Boston, MA Join Date: May 2009
Posts: 75
|
![]()
The reads look more or less evenly distributed between strands, when colored by first-in-pair strand. Also, the junctions look more or less even between strands. It just has the look of an un-stranded library, if it is no amount of informatics will recover the strand information. I thought that was the gist of your original post, sorry if I misunderstood.
|
![]() |
![]() |
![]() |
#9 |
Member
Location: Earth Join Date: Jun 2013
Posts: 10
|
![]()
So I asked and from what I was told the samples were sequenced with Illumina's TrueSeq RNA kit (total RNA) and was strand specific.
|
![]() |
![]() |
![]() |
#10 |
Senior Member
Location: Vienna Join Date: Oct 2011
Posts: 123
|
![]()
What is the difference between the two subsets in your IGV example?
In terms of strand-preservation Illumina's TruSeq stranded achieves 90 % - 97 %; so if you see in the correct strand-orientation 950 reads and in the wrong orientation 50 your result is acceptable. For a sample-wide anaylsis, you can use RSeQC's infer_experiment function on the complete data set (accepted_hits.sort). |
![]() |
![]() |
![]() |
#11 |
Member
Location: Earth Join Date: Jun 2013
Posts: 10
|
![]()
I've tried using RseQC's infer experiment script, but I'm not sure what it means when it asks for the reference genome in bed format. My reference genome is in fasta format, and converting from fasta to bed doesn't make sense to me...
|
![]() |
![]() |
![]() |
#13 |
Member
Location: Earth Join Date: Jun 2013
Posts: 10
|
![]()
Finally managed to do it. I couldn't quite get the gff/gtf to convert to bed so I just did it manually.
Anyway, here is what I got: This is PairEnd Data Fraction of reads failed to determine: 0.0000 Fraction of reads explained by "1++,1--,2+-,2-+": 0.1667 Fraction of reads explained by "1+-,1-+,2++,2--": 0.8333 Therefore the data is strand specific paired end data. But I'm not quite sure how to interpret the differences between the two. If the majority of the reads are explained by ""1+-,1-+,2++,2--": 0.8333", does this mean that I should have analyzed the data in tophat with fr-secondstrand instead of fr-firststrand?? |
![]() |
![]() |
![]() |
#14 |
Senior Member
Location: Vienna Join Date: Oct 2011
Posts: 123
|
![]()
According to this post, you should use fr-firststrand. When I analysed TruSeq stranded libraries, I yielded the same orientation, but higher percentages. For non-model organism it might be different due to incomplete annotation.
|
![]() |
![]() |
![]() |
#15 |
Member
Location: Earth Join Date: Jun 2013
Posts: 10
|
![]()
Hmmm, well I used fr-firststrand so I guess that can't be it.
I'll just keep trying a few different things and see where it gets me. |
![]() |
![]() |
![]() |
#16 |
Senior Member
Location: Vienna Join Date: Oct 2011
Posts: 123
|
![]()
What is your read-length distribution? The older versions of TopHat had problems, in case of too high length differences. Additionally, reads below 20 bp are very hard to map; I discard all reads below that length (denote you need to control for the pairing information).
Otherwise, you can try a different aligner (e.g. STAR or bbmap). |
![]() |
![]() |
![]() |
#17 |
Senior Member
Location: Germany Join Date: Oct 2008
Posts: 415
|
![]()
This was troublesome for me as well.
Using the program bamtools helped no end ! Detailed in this post: https://www.biostars.org/p/88582/ It looks arcane but it's worked for me on thousands of samples. Hope that helps. |
![]() |
![]() |
![]() |
Tags |
illumina, rna-seq |
Thread Tools | |
|
|