SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
RNA-seq for Variant calling runnerBio88 RNA Sequencing 3 03-20-2016 10:34 PM
Analysis of RNA-seq data without replicate. Sora Yoon RNA Sequencing 3 04-22-2015 06:30 PM
RNA-Seq Variant Calling Softwares whijae Bioinformatics 2 12-07-2012 11:35 AM
how to Use RNA-seq data of samples in different condition without replicate zinky Bioinformatics 7 12-04-2012 03:58 AM

Reply
 
Thread Tools
Old 02-07-2016, 10:06 PM   #1
ErikFas
Member
 
Location: Sweden

Join Date: Jun 2014
Posts: 86
Default RNA-seq variant calling and merging replicate data

I recently started to do a bit of RNA-seq variant calling (using the GATK Best Practices pipeline), but I'm wondering as to how I should handle replicate data. If I have three replicates for a sample, for example, should I somehow merge that data to get better quality variant calls, and if so, at what level should the merging be done?

I can imagine merging the FASTQ files before alignment, for example, but I'm not sure if that would adversely affect the variant calling step in the end. Or should merging be done after alignment, on the resulting BAM files? I read that for aligners such as BWA the options are (more or less) equivalent, but seeing as the RNA-seq Best Practice workflow using STAR... Or at some other level?

I imagine that merging the data at some level would increase the statistical power of the variant calls because of the added depth of additional data from several replicates, but maybe that's wrong? Is merging of data for variant calling using RNA-seq data something that could or should be done, and if so, at what level?
ErikFas is offline   Reply With Quote
Old 02-08-2016, 12:12 AM   #2
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Whether and when to merge depends on the exact nature of the technical replicates. If you just divided a single library over multiple lanes then go ahead and merge the fastq files (or the BAM files after alignment). If you made multiple libraries from a single sample, then you need to indicate that in the read group information in the BAM headers and only merge the BAM files after having done so. What I wrote is true regardless of aligner or experiment type.
dpryan is offline   Reply With Quote
Old 02-08-2016, 01:43 AM   #3
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

People rarely do variant calling on RNAseq data, since it's somewhat problematic (e.g., your calls will be wrong in regions of allele-specific expression). That's why tophat2 doesn't have a convenient read group option. If STAR or hisat(2) do (I suspect STAR does, it has a million options) then I'd just use one of those...they're much faster anyway. Aside from that, yeah, picard or samtools (I think 1.3 added an "addreplacerg" command) are the typical tools.
dpryan is offline   Reply With Quote
Old 02-08-2016, 02:25 AM   #4
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Tophat2 is so slow that it's faster to modify a pipeline for STAR than to run the pipeline once. Having said that, I question the usefulness of a lot of these filtering steps that you posted. Just some examples:

Code:
samtools view -h -F 1024 $f | awk 'substr($0,1,1)=="@"||match($0, /NH:i:1\t/) {print $0}' | samtools view -hSb - > ${fname}_unique.bam
The alignments have MAPQ scores, so:
Code:
 samtools view -b -F 1024 -q 5 $f > ${fname}_unique.bam
That'll save you some significant time (tophat2/STAR/etc. give MAPQ <=3 for multimappers the last time I looked). BTW, at least the haplotype caller already ignores secondary alignments, so you really don't need to exclude them.

I would also question the utility of writing the names of reads that overlap a repetitive element to a file (please don't tell me that there's some sort of grepping going on with those later on!). If you really want those excluded then just directly intersect the BAM file and write a BAM file as output. Having said that, if the alignments have a reasonable MAPQ then there's no reason to exclude them.

So, in short, if the additional steps that your mentor has been doing might not work properly or would be slow, then start by thinking about whether those steps are even that useful.
dpryan is offline   Reply With Quote
Old 02-08-2016, 02:55 AM   #5
ErikFas
Member
 
Location: Sweden

Join Date: Jun 2014
Posts: 86
Default

Quote:
Originally Posted by dpryan View Post
Whether and when to merge depends on the exact nature of the technical replicates. If you just divided a single library over multiple lanes then go ahead and merge the fastq files (or the BAM files after alignment). If you made multiple libraries from a single sample, then you need to indicate that in the read group information in the BAM headers and only merge the BAM files after having done so. What I wrote is true regardless of aligner or experiment type.
Thanks for your answer! I have biological replicates from cell lines (so normal considerations as to what extent biological replicates for cell lines are "biological" apply), with a single library for each replicate. What would be approriate for my experiment, then?
ErikFas is offline   Reply With Quote
Old 02-08-2016, 03:02 AM   #6
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Quote:
Originally Posted by ErikFas View Post
Thanks for your answer! I have biological replicates from cell lines (so normal considerations as to what extent biological replicates for cell lines are "biological" apply), with a single library for each replicate. What would be approriate for my experiment, then?
Either don't merge them at all, or, if you need to due to using the unified genotyper, merge the BAM files after adding appropriate read groups.
dpryan is offline   Reply With Quote
Old 02-08-2016, 03:07 AM   #7
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Quote:
Originally Posted by umn_bist View Post
I am fairly new but I'll do my best to explain. Our project involves specifically looking at repetitive elements and direct binding between repetitive elements and p53 genes. I think what's going on here (based on your observations as I have no idea what the original code was doing) is that she is actually only interested in the overlaps (?)

So after that coding segment, we normalize and group by element/family/class and then...
Code:
awk -F '\t' 'BEGIN{OFS="\t";} END {for (K in k) print K, k[K]}{k[$1] = k[$1] ? k[$1] FS $3 FS $5 FS $6 : $3 FS $5 FS $6}'  *_class.norm | sort -k 1,1 > all_class.norm;
awk -F '\t' 'BEGIN{OFS="\t";} END {for (K in k) print K, k[K]}{k[$1OFS$2] = k[$1OFS$2] ? k[$1OFS$2] FS $4 FS $6 FS $7 : $4 FS $6 FS $7}'  *_family.norm | sort -k 1,1 > all_family.norm;
awk -F '\t' 'BEGIN{OFS="\t";} END {for (K in k) print K, k[K]}{k[$1OFS$2OFS$3] = k[$1OFS$2OFS$3] ? k[$1OFS$2OFS$3] FS $5 FS $7 FS $8 : $5 FS $7 FS $8}'  *_subfamily.norm | sort -k 1,1 > all_subfamily.norm;
awk -F '\t' 'BEGIN{OFS="\t";} END {for (K in k) print K, k[K]}{k[$1OFS$2OFS$3OFS$4] = k[$1OFS$2OFS$3OFS$4] ? k[$1OFS$2OFS$3OFS$4] FS $6 FS $8 FS $9 : $6 FS $8 FS $9}'  *_element.norm | sort -k 1,1 > all_element.norm;
Might I suggest replacing everything with a single python script? If you use pysam and pybedtools then (A) the performance will be decent and (B) it'd be a heck of a lot easier to debug and update
dpryan is offline   Reply With Quote
Old 02-08-2016, 03:11 AM   #8
ErikFas
Member
 
Location: Sweden

Join Date: Jun 2014
Posts: 86
Default

Quote:
Originally Posted by dpryan View Post
Either don't merge them at all, or, if you need to due to using the unified genotyper, merge the BAM files after adding appropriate read groups.
I'm using the HaplotypeCaller as per the Best Practices. So, you're saying that I shouldn't merge my replicate data, at all? If so, how would you handle different variant calls between the replicates in the resulting VCFs (e.g. when a variant is called in 2/3 replicates but not in the third), i.e. how would you get a "consensus" call across replicates?
ErikFas is offline   Reply With Quote
Old 02-08-2016, 03:14 AM   #9
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Quote:
Originally Posted by ErikFas View Post
I'm using the HaplotypeCaller as per the Best Practices. So, you're saying that I shouldn't merge my replicate data, at all? If so, how would you handle different variant calls between the replicates in the resulting VCFs (e.g. when a variant is called in 2/3 replicates but not in the third), i.e. how would you get a "consensus" call across replicates?
Just use the major allele. The point, however, of multiple samples is exactly to find out where there might be variability in the population, so you'd ideally take this into account with whatever you're doing down-stream.
dpryan is offline   Reply With Quote
Old 02-08-2016, 03:34 AM   #10
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Quote:
Originally Posted by umn_bist View Post
Thank you for the suggestion! Any recommendation is very much appreciated as my first 200+ hrs with NGS has been, for the most part, debugging.
There comes a time when you need to sit down with whoever did things previously and have the, "right, so tell me what you're trying to accomplish with this so I can make a more polished/maintainable version for future use" conversation.
dpryan is offline   Reply With Quote
Old 02-08-2016, 04:21 AM   #11
ErikFas
Member
 
Location: Sweden

Join Date: Jun 2014
Posts: 86
Default

Quote:
Originally Posted by dpryan View Post
Just use the major allele. The point, however, of multiple samples is exactly to find out where there might be variability in the population, so you'd ideally take this into account with whatever you're doing down-stream.
Replicates for DE-analysis and other downstream analyses is of course important, but I was told that for variant calling it might be statistically helpful to regard them as the same sample (i.e. merging their data). What makes this argument bad, then?
ErikFas is offline   Reply With Quote
Old 02-08-2016, 10:59 AM   #12
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

It depends on your coverage. If your per-sample coverage isn't that high then you might be better off merging them before variant calling. Then, of course, there was no point in sequencing multiple samples, you could have saved a few hundred euro and just sequenced one sample deeper.
dpryan is offline   Reply With Quote
Old 02-08-2016, 09:21 PM   #13
ErikFas
Member
 
Location: Sweden

Join Date: Jun 2014
Posts: 86
Default

I have around 30 million reads per sample, so I'd think that's coverage enough, at least for most other RNA-seq applications... And probably for variants calls as well, or?
ErikFas is offline   Reply With Quote
Old 02-08-2016, 11:35 PM   #14
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

It'll depend on the gene. RNAseq has very uneven coverage (for obvious reasons), so you have excellent power on some genes and now power on others. It's better to do exome sequencing unless you really really really need the expression information and can't spare getting DNA and RNA from the same sample.
dpryan is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:56 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO