Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Revealing batch effects in RNA-seq data

    Dear all,

    revealing batch effects in sequencing data may be a difficult task.

    We have designed a new R-package (seqTools) which may be able to detect batch effects in compressed Fastq files (and does some other standard QC tasks). The package is currently available on R-forge:



    from where the source and a windows binary can be installed using the standard mechanism:

    install.packages("seqTools", repos="http://R-Forge.R-project.org")

    We analysed 61 RNA-seq samples and found a remarkable prevalence of batch effects. A preprint of the results is available from

    Batch effects, artificial sources of variation due to experimental design, are a widespread phenomenon in high throughput data. Therefore, mechanisms for detection of batch effects are needed requiring comparison of multiple samples. We apply hierarchical clustering (HC) on DNA k-mer counts of multiple RNA-seq derived Fastq files. Ideally, HC generated trees reflect experimental treatment groups and thus may indicate experimental effects, but clustering of preparation groups indicates the presence of batch effects. In order to provide a simple applicable tool we implemented sequential analysis of Fastq reads with low memory usage in an R package (seqTools) available on Bioconductor. DNA k-mer counts were analysed on 61 Fastq files containing RNA-seq data from two cell types (dermal fibroblasts and Jurkat cells) sequenced on 8 different Illumina Flowcells. Results: Pairwise comparison of all Flowcells with hierarchical clustering revealed strong Flowcell based tree separation in 6 (21 %) and detectable Flowcell based clustering in 17 (60.7 %) of 28 Flowcell comparisons. In our samples, batch effects were also present in reads mapped to the human genome. Filtering reads for high quality (Phred >30) did not remove the batch effects. Conclusions: Hierarchical clustering of DNA k-mer counts provides a quality criterion and an unspecific diagnostic tool for RNA-seq experiments.


    It would be great to get feedback or suggestions from anyone who might find this useful.

    Thanks

  • #2
    Is there anything in particular gained from performing the clustering on the k-mers rather than on the per-gene (or whatever) counts/RPKM/TPM/etc.? The latter is pretty computationally simple (you have to calculate the various metrics anyway, so performing the clustering is just a couple commands in R) and would seem to yield more directly usable results (after all, you mention that the batch effects are amplified in the aligned reads).

    Comment


    • #3
      Good quesion.
      k-mer clusters and Gene expression clusters seem not to behave in the same way. So they might give some (additional) information on how to weight data from certain samples.

      The other part of your quesion (why take an indirection when the standard is straight) addresses the value of additional information. We can't give a closing answer to this up to now since our data volumes are not extensive enough.

      You might want to test the method and share your experience. Any result would be valuable (and I think there's still some space in the project...).
      Wolfgang

      Comment


      • #4
        Hi,
        When I think batch effects, I think in terms of the entire experiment from sample collection, RNA extraction, lib prep and sequencing. I do not see any information about upstream batching, ie lib preps are generally in batches of 8-48 samples per batch. The sequencing portion is generally the least likely to introduce a batch effect. If you were to add the pre-sequencing batching information, do you get a different picture? Do you have data for aliquoted replicates of the same cell culture in large enough N to be able to measure the batch effect from processing to sequencing?

        I do see a potential value in this to get an early read on the project, particularly for projects that might need a lengthy assembly. You could do something like this while the assembly is going to get some information to share with the PI.

        Additionally, I am left wondering if the processing was done on the actual raw fastq files or if it was done on the mapped+unmapped transformed bam->fastq data? I have not done this, but am always a bit wary of supposed equivelant data files.

        Finally, why not trim the reads for quality and adaptor prior to doing the clustering? I am generally less concerned with how things cluster with the raw data vs the cleaned data.


        Bob

        Comment


        • #5
          One last thought. From the GA days, there should be a lot of phiX technical replicates across multiple lanes and flow cells, that might be interesing to push through your package.

          Hmm, actually, many sequencing centers do a phiX spike in samples, one could extract out the phiX reads across lanes/flow cells and cluster those. Too bad no one tracks phiX lots, as that would be interesting to see if lot or lane/flow cell is the larger descriminator.

          Comment


          • #6
            Dear Bob,
            thanks thanks for your reply.

            I am quite sure that our samples are otherwise unrelated because there had been quite large time interspaces between the different flowcells (several weeks or months). It is not unlikely that the "flowcell" batch effects actually arise in library preparation.

            The software actually reads compressed Fastq files (the way you usually get them delivered). You don't have to unpack them. The major programming effort was needed for parsing Fastq format in C.

            In order to compare mapped and unmapped reads, I extracted data from BAM using another package of mine (rbamtools). BAM stores the whole information of your reads and therefore, Fastq can 'easily' be restored from BAM. That's what I did. The rest of the analysis followed the standard Fastq procedure.

            I had included a trim-functionality into the package in order to find out whether quality based trimming might remove artificial clusters but it didn't. The data actually looks as when a high percentage of low quality position actually indicates a problem which is also present in the high quality reads of the same sample (is described in the manuscript).

            PhiX was abandoned in our samples since there is no need to include these laned for calibration in human samples.

            Wolfgang

            Comment


            • #7
              I guess I am not getting it. For the study with 50+ samples, were the samples processed, ie lib prepped, in batches corresponding to the flow cells? We avoid this when at all possible, even if samples need to stay in the freezer for a prolonged time. Are you also stating that the samples were run 1 per lane?

              We see some sequencing centers still require a phiX spike for clinical samples.

              Comment


              • #8
                Yes, Bob, the flowcells represent single library preparations. The fibroblasts were short term cultivated and the incoming samples were sequenced when 8 samples (for one flowcell) were full.

                There was one sample per lane (no multiplexing).

                Our first two flowcells also included PhiX (they were excluded due to heavy quality issues) which was abandoned later on. All mentioned flowcells had been run without PhiX.
                Wolfgang

                Comment


                • #9
                  Aren't k-mers indicative of adapter artifacts (dimers mostly)?

                  You can still have decent data from your library if these are filtered out.

                  Comment


                  • #10
                    We haven't looked for adapters in our data and I have not found significant document about adapters in order to give a sufficient answer (do you know one?).

                    A raw method to view k-mer contamination is to look at the distribution of the k-mers (plotKmerCount). We see an abundance of AAAAAAAAA and TTTTTTTTT of about factor 10 more than the mean k-mer counts (poly a).

                    Another noticable effect are the effects of random hexamer priming which can be seen using plotNucFreq( [object] , [i] ,maxx=15). This effect is described in:

                    Biases in Illumina transcriptome sequencing caused by random hexamer priming
                    Nucleic Acids Research, 2010, Vol. 38, No. 12 e131; doi:10.1093/nar/gkq224

                    Beside this I did not notice (but also haven't extensively searched for) other artifacts (except the batch effects...).

                    The adapters assumably are removed during alignmnent since the aligned reads nearly perfectly match.

                    Wolfgang

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Current Approaches to Protein Sequencing
                      by seqadmin


                      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                      04-04-2024, 04:25 PM
                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 04-11-2024, 12:08 PM
                    0 responses
                    29 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 10:19 PM
                    0 responses
                    32 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 09:21 AM
                    0 responses
                    28 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-04-2024, 09:00 AM
                    0 responses
                    52 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X