Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • RNA-seq variant calling and merging replicate data

    I recently started to do a bit of RNA-seq variant calling (using the GATK Best Practices pipeline), but I'm wondering as to how I should handle replicate data. If I have three replicates for a sample, for example, should I somehow merge that data to get better quality variant calls, and if so, at what level should the merging be done?

    I can imagine merging the FASTQ files before alignment, for example, but I'm not sure if that would adversely affect the variant calling step in the end. Or should merging be done after alignment, on the resulting BAM files? I read that for aligners such as BWA the options are (more or less) equivalent, but seeing as the RNA-seq Best Practice workflow using STAR... Or at some other level?

    I imagine that merging the data at some level would increase the statistical power of the variant calls because of the added depth of additional data from several replicates, but maybe that's wrong? Is merging of data for variant calling using RNA-seq data something that could or should be done, and if so, at what level?

  • #2
    Whether and when to merge depends on the exact nature of the technical replicates. If you just divided a single library over multiple lanes then go ahead and merge the fastq files (or the BAM files after alignment). If you made multiple libraries from a single sample, then you need to indicate that in the read group information in the BAM headers and only merge the BAM files after having done so. What I wrote is true regardless of aligner or experiment type.

    Comment


    • #3
      People rarely do variant calling on RNAseq data, since it's somewhat problematic (e.g., your calls will be wrong in regions of allele-specific expression). That's why tophat2 doesn't have a convenient read group option. If STAR or hisat(2) do (I suspect STAR does, it has a million options) then I'd just use one of those...they're much faster anyway. Aside from that, yeah, picard or samtools (I think 1.3 added an "addreplacerg" command) are the typical tools.

      Comment


      • #4
        Tophat2 is so slow that it's faster to modify a pipeline for STAR than to run the pipeline once. Having said that, I question the usefulness of a lot of these filtering steps that you posted. Just some examples:

        Code:
        samtools view -h -F 1024 $f | awk 'substr($0,1,1)=="@"||match($0, /NH:i:1\t/) {print $0}' | samtools view -hSb - > ${fname}_unique.bam
        The alignments have MAPQ scores, so:
        Code:
         samtools view -b -F 1024 -q 5 $f > ${fname}_unique.bam
        That'll save you some significant time (tophat2/STAR/etc. give MAPQ <=3 for multimappers the last time I looked). BTW, at least the haplotype caller already ignores secondary alignments, so you really don't need to exclude them.

        I would also question the utility of writing the names of reads that overlap a repetitive element to a file (please don't tell me that there's some sort of grepping going on with those later on!). If you really want those excluded then just directly intersect the BAM file and write a BAM file as output. Having said that, if the alignments have a reasonable MAPQ then there's no reason to exclude them.

        So, in short, if the additional steps that your mentor has been doing might not work properly or would be slow, then start by thinking about whether those steps are even that useful.

        Comment


        • #5
          Originally posted by dpryan View Post
          Whether and when to merge depends on the exact nature of the technical replicates. If you just divided a single library over multiple lanes then go ahead and merge the fastq files (or the BAM files after alignment). If you made multiple libraries from a single sample, then you need to indicate that in the read group information in the BAM headers and only merge the BAM files after having done so. What I wrote is true regardless of aligner or experiment type.
          Thanks for your answer! I have biological replicates from cell lines (so normal considerations as to what extent biological replicates for cell lines are "biological" apply), with a single library for each replicate. What would be approriate for my experiment, then?

          Comment


          • #6
            Originally posted by ErikFas View Post
            Thanks for your answer! I have biological replicates from cell lines (so normal considerations as to what extent biological replicates for cell lines are "biological" apply), with a single library for each replicate. What would be approriate for my experiment, then?
            Either don't merge them at all, or, if you need to due to using the unified genotyper, merge the BAM files after adding appropriate read groups.

            Comment


            • #7
              Originally posted by umn_bist
              I am fairly new but I'll do my best to explain. Our project involves specifically looking at repetitive elements and direct binding between repetitive elements and p53 genes. I think what's going on here (based on your observations as I have no idea what the original code was doing) is that she is actually only interested in the overlaps (?)

              So after that coding segment, we normalize and group by element/family/class and then...
              Code:
              awk -F '\t' 'BEGIN{OFS="\t";} END {for (K in k) print K, k[K]}{k[$1] = k[$1] ? k[$1] FS $3 FS $5 FS $6 : $3 FS $5 FS $6}'  *_class.norm | sort -k 1,1 > all_class.norm;
              awk -F '\t' 'BEGIN{OFS="\t";} END {for (K in k) print K, k[K]}{k[$1OFS$2] = k[$1OFS$2] ? k[$1OFS$2] FS $4 FS $6 FS $7 : $4 FS $6 FS $7}'  *_family.norm | sort -k 1,1 > all_family.norm;
              awk -F '\t' 'BEGIN{OFS="\t";} END {for (K in k) print K, k[K]}{k[$1OFS$2OFS$3] = k[$1OFS$2OFS$3] ? k[$1OFS$2OFS$3] FS $5 FS $7 FS $8 : $5 FS $7 FS $8}'  *_subfamily.norm | sort -k 1,1 > all_subfamily.norm;
              awk -F '\t' 'BEGIN{OFS="\t";} END {for (K in k) print K, k[K]}{k[$1OFS$2OFS$3OFS$4] = k[$1OFS$2OFS$3OFS$4] ? k[$1OFS$2OFS$3OFS$4] FS $6 FS $8 FS $9 : $6 FS $8 FS $9}'  *_element.norm | sort -k 1,1 > all_element.norm;
              Might I suggest replacing everything with a single python script? If you use pysam and pybedtools then (A) the performance will be decent and (B) it'd be a heck of a lot easier to debug and update

              Comment


              • #8
                Originally posted by dpryan View Post
                Either don't merge them at all, or, if you need to due to using the unified genotyper, merge the BAM files after adding appropriate read groups.
                I'm using the HaplotypeCaller as per the Best Practices. So, you're saying that I shouldn't merge my replicate data, at all? If so, how would you handle different variant calls between the replicates in the resulting VCFs (e.g. when a variant is called in 2/3 replicates but not in the third), i.e. how would you get a "consensus" call across replicates?

                Comment


                • #9
                  Originally posted by ErikFas View Post
                  I'm using the HaplotypeCaller as per the Best Practices. So, you're saying that I shouldn't merge my replicate data, at all? If so, how would you handle different variant calls between the replicates in the resulting VCFs (e.g. when a variant is called in 2/3 replicates but not in the third), i.e. how would you get a "consensus" call across replicates?
                  Just use the major allele. The point, however, of multiple samples is exactly to find out where there might be variability in the population, so you'd ideally take this into account with whatever you're doing down-stream.

                  Comment


                  • #10
                    Originally posted by umn_bist
                    Thank you for the suggestion! Any recommendation is very much appreciated as my first 200+ hrs with NGS has been, for the most part, debugging.
                    There comes a time when you need to sit down with whoever did things previously and have the, "right, so tell me what you're trying to accomplish with this so I can make a more polished/maintainable version for future use" conversation.

                    Comment


                    • #11
                      Originally posted by dpryan View Post
                      Just use the major allele. The point, however, of multiple samples is exactly to find out where there might be variability in the population, so you'd ideally take this into account with whatever you're doing down-stream.
                      Replicates for DE-analysis and other downstream analyses is of course important, but I was told that for variant calling it might be statistically helpful to regard them as the same sample (i.e. merging their data). What makes this argument bad, then?

                      Comment


                      • #12
                        It depends on your coverage. If your per-sample coverage isn't that high then you might be better off merging them before variant calling. Then, of course, there was no point in sequencing multiple samples, you could have saved a few hundred euro and just sequenced one sample deeper.

                        Comment


                        • #13
                          I have around 30 million reads per sample, so I'd think that's coverage enough, at least for most other RNA-seq applications... And probably for variants calls as well, or?

                          Comment


                          • #14
                            It'll depend on the gene. RNAseq has very uneven coverage (for obvious reasons), so you have excellent power on some genes and now power on others. It's better to do exome sequencing unless you really really really need the expression information and can't spare getting DNA and RNA from the same sample.

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Current Approaches to Protein Sequencing
                              by seqadmin


                              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                              04-04-2024, 04:25 PM
                            • seqadmin
                              Strategies for Sequencing Challenging Samples
                              by seqadmin


                              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                              03-22-2024, 06:39 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, 04-11-2024, 12:08 PM
                            0 responses
                            22 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-10-2024, 10:19 PM
                            0 responses
                            24 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-10-2024, 09:21 AM
                            0 responses
                            19 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-04-2024, 09:00 AM
                            0 responses
                            52 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X