Seqanswers Leaderboard Ad

**dpryan** · 02-08-2016, 01:12 AM

Whether and when to merge depends on the exact nature of the technical replicates. If you just divided a single library over multiple lanes then go ahead and merge the fastq files (or the BAM files after alignment). If you made multiple libraries from a single sample, then you need to indicate that in the read group information in the BAM headers and only merge the BAM files after having done so. What I wrote is true regardless of aligner or experiment type.

**dpryan** · 02-08-2016, 02:43 AM

People rarely do variant calling on RNAseq data, since it's somewhat problematic (e.g., your calls will be wrong in regions of allele-specific expression). That's why tophat2 doesn't have a convenient read group option. If STAR or hisat(2) do (I suspect STAR does, it has a million options) then I'd just use one of those...they're much faster anyway. Aside from that, yeah, picard or samtools (I think 1.3 added an "addreplacerg" command) are the typical tools.

**dpryan** · 02-08-2016, 03:25 AM

Tophat2 is so slow that it's faster to modify a pipeline for STAR than to run the pipeline once. Having said that, I question the usefulness of a lot of these filtering steps that you posted. Just some examples:

Code:

samtools view -h -F 1024 $f | awk 'substr($0,1,1)=="@"||match($0, /NH:i:1\t/) {print $0}' | samtools view -hSb - > ${fname}_unique.bam

The alignments have MAPQ scores, so:

Code:

 samtools view -b -F 1024 -q 5 $f > ${fname}_unique.bam

That'll save you some significant time (tophat2/STAR/etc. give MAPQ <=3 for multimappers the last time I looked). BTW, at least the haplotype caller already ignores secondary alignments, so you really don't need to exclude them.

I would also question the utility of writing the names of reads that overlap a repetitive element to a file (please don't tell me that there's some sort of grepping going on with those later on!). If you really want those excluded then just directly intersect the BAM file and write a BAM file as output. Having said that, if the alignments have a reasonable MAPQ then there's no reason to exclude them.

So, in short, if the additional steps that your mentor has been doing might not work properly or would be slow, then start by thinking about whether those steps are even that useful.

**ErikFas** · 02-08-2016, 03:55 AM

Originally posted by dpryan View Post

Whether and when to merge depends on the exact nature of the technical replicates. If you just divided a single library over multiple lanes then go ahead and merge the fastq files (or the BAM files after alignment). If you made multiple libraries from a single sample, then you need to indicate that in the read group information in the BAM headers and only merge the BAM files after having done so. What I wrote is true regardless of aligner or experiment type.

Thanks for your answer! I have biological replicates from cell lines (so normal considerations as to what extent biological replicates for cell lines are "biological" apply), with a single library for each replicate. What would be approriate for my experiment, then?

**dpryan** · 02-08-2016, 04:02 AM

Originally posted by ErikFas View Post

Thanks for your answer! I have biological replicates from cell lines (so normal considerations as to what extent biological replicates for cell lines are "biological" apply), with a single library for each replicate. What would be approriate for my experiment, then?

Either don't merge them at all, or, if you need to due to using the unified genotyper, merge the BAM files after adding appropriate read groups.

**dpryan** · 02-08-2016, 04:07 AM

Originally posted by umn_bist

I am fairly new but I'll do my best to explain. Our project involves specifically looking at repetitive elements and direct binding between repetitive elements and p53 genes. I think what's going on here (based on your observations as I have no idea what the original code was doing) is that she is actually only interested in the overlaps (?)

So after that coding segment, we normalize and group by element/family/class and then...

Code:

awk -F '\t' 'BEGIN{OFS="\t";} END {for (K in k) print K, k[K]}{k[$1] = k[$1] ? k[$1] FS $3 FS $5 FS $6 : $3 FS $5 FS $6}'  *_class.norm | sort -k 1,1 > all_class.norm;
awk -F '\t' 'BEGIN{OFS="\t";} END {for (K in k) print K, k[K]}{k[$1OFS$2] = k[$1OFS$2] ? k[$1OFS$2] FS $4 FS $6 FS $7 : $4 FS $6 FS $7}'  *_family.norm | sort -k 1,1 > all_family.norm;
awk -F '\t' 'BEGIN{OFS="\t";} END {for (K in k) print K, k[K]}{k[$1OFS$2OFS$3] = k[$1OFS$2OFS$3] ? k[$1OFS$2OFS$3] FS $5 FS $7 FS $8 : $5 FS $7 FS $8}'  *_subfamily.norm | sort -k 1,1 > all_subfamily.norm;
awk -F '\t' 'BEGIN{OFS="\t";} END {for (K in k) print K, k[K]}{k[$1OFS$2OFS$3OFS$4] = k[$1OFS$2OFS$3OFS$4] ? k[$1OFS$2OFS$3OFS$4] FS $6 FS $8 FS $9 : $6 FS $8 FS $9}'  *_element.norm | sort -k 1,1 > all_element.norm;

Might I suggest replacing everything with a single python script? If you use pysam and pybedtools then (A) the performance will be decent and (B) it'd be a heck of a lot easier to debug and update

**ErikFas** · 02-08-2016, 04:11 AM

Originally posted by dpryan View Post

Either don't merge them at all, or, if you need to due to using the unified genotyper, merge the BAM files after adding appropriate read groups.

I'm using the HaplotypeCaller as per the Best Practices. So, you're saying that I shouldn't merge my replicate data, at all? If so, how would you handle different variant calls between the replicates in the resulting VCFs (e.g. when a variant is called in 2/3 replicates but not in the third), i.e. how would you get a "consensus" call across replicates?

**dpryan** · 02-08-2016, 04:14 AM

Originally posted by ErikFas View Post

I'm using the HaplotypeCaller as per the Best Practices. So, you're saying that I shouldn't merge my replicate data, at all? If so, how would you handle different variant calls between the replicates in the resulting VCFs (e.g. when a variant is called in 2/3 replicates but not in the third), i.e. how would you get a "consensus" call across replicates?

Just use the major allele. The point, however, of multiple samples is exactly to find out where there might be variability in the population, so you'd ideally take this into account with whatever you're doing down-stream.

**dpryan** · 02-08-2016, 04:34 AM

Originally posted by umn_bist

Thank you for the suggestion! Any recommendation is very much appreciated as my first 200+ hrs with NGS has been, for the most part, debugging.

There comes a time when you need to sit down with whoever did things previously and have the, "right, so tell me what you're trying to accomplish with this so I can make a more polished/maintainable version for future use" conversation.

**ErikFas** · 02-08-2016, 05:21 AM

Originally posted by dpryan View Post

Just use the major allele. The point, however, of multiple samples is exactly to find out where there might be variability in the population, so you'd ideally take this into account with whatever you're doing down-stream.

Replicates for DE-analysis and other downstream analyses is of course important, but I was told that for variant calling it might be statistically helpful to regard them as the same sample (i.e. merging their data). What makes this argument bad, then?

**dpryan** · 02-08-2016, 11:59 AM

It depends on your coverage. If your per-sample coverage isn't that high then you might be better off merging them before variant calling. Then, of course, there was no point in sequencing multiple samples, you could have saved a few hundred euro and just sequenced one sample deeper.

**ErikFas** · 02-08-2016, 10:21 PM

I have around 30 million reads per sample, so I'd think that's coverage enough, at least for most other RNA-seq applications... And probably for variants calls as well, or?

**dpryan** · 02-09-2016, 12:35 AM

It'll depend on the gene. RNAseq has very uneven coverage (for obvious reasons), so you have excellent power on some genes and now power on others. It's better to do exome sequencing unless you really really really need the expression information and can't spare getting DNA and RNA from the same sample.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 24 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

RNA-seq variant calling and merging replicate data

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News