SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
EdgeR cpm values after batch effects danielfortin86 Bioinformatics 4 08-25-2013 10:12 PM
library prep batch effects chrisbala RNA Sequencing 3 03-06-2013 07:39 PM

Reply
 
Thread Tools
Old 05-15-2014, 12:29 AM   #1
wokai001
Member
 
Location: Düsseldorf

Join Date: Nov 2010
Posts: 20
Default Revealing batch effects in RNA-seq data

Dear all,

revealing batch effects in sequencing data may be a difficult task.

We have designed a new R-package (seqTools) which may be able to detect batch effects in compressed Fastq files (and does some other standard QC tasks). The package is currently available on R-forge:

https://r-forge.r-project.org/R/?group_id=1889

from where the source and a windows binary can be installed using the standard mechanism:

install.packages("seqTools", repos="http://R-Forge.R-project.org")

We analysed 61 RNA-seq samples and found a remarkable prevalence of batch effects. A preprint of the results is available from

http://arxiv.org/abs/1405.0114

It would be great to get feedback or suggestions from anyone who might find this useful.

Thanks
wokai001 is offline   Reply With Quote
Old 05-15-2014, 01:35 AM   #2
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Is there anything in particular gained from performing the clustering on the k-mers rather than on the per-gene (or whatever) counts/RPKM/TPM/etc.? The latter is pretty computationally simple (you have to calculate the various metrics anyway, so performing the clustering is just a couple commands in R) and would seem to yield more directly usable results (after all, you mention that the batch effects are amplified in the aligned reads).
dpryan is offline   Reply With Quote
Old 05-16-2014, 01:32 AM   #3
wokai001
Member
 
Location: Düsseldorf

Join Date: Nov 2010
Posts: 20
Default

Good quesion.
k-mer clusters and Gene expression clusters seem not to behave in the same way. So they might give some (additional) information on how to weight data from certain samples.

The other part of your quesion (why take an indirection when the standard is straight) addresses the value of additional information. We can't give a closing answer to this up to now since our data volumes are not extensive enough.

You might want to test the method and share your experience. Any result would be valuable (and I think there's still some space in the project...).
Wolfgang
wokai001 is offline   Reply With Quote
Old 05-16-2014, 05:43 AM   #4
bioBob
Member
 
Location: Virginia

Join Date: Mar 2011
Posts: 72
Default

Hi,
When I think batch effects, I think in terms of the entire experiment from sample collection, RNA extraction, lib prep and sequencing. I do not see any information about upstream batching, ie lib preps are generally in batches of 8-48 samples per batch. The sequencing portion is generally the least likely to introduce a batch effect. If you were to add the pre-sequencing batching information, do you get a different picture? Do you have data for aliquoted replicates of the same cell culture in large enough N to be able to measure the batch effect from processing to sequencing?

I do see a potential value in this to get an early read on the project, particularly for projects that might need a lengthy assembly. You could do something like this while the assembly is going to get some information to share with the PI.

Additionally, I am left wondering if the processing was done on the actual raw fastq files or if it was done on the mapped+unmapped transformed bam->fastq data? I have not done this, but am always a bit wary of supposed equivelant data files.

Finally, why not trim the reads for quality and adaptor prior to doing the clustering? I am generally less concerned with how things cluster with the raw data vs the cleaned data.


Bob
bioBob is offline   Reply With Quote
Old 05-16-2014, 05:48 AM   #5
bioBob
Member
 
Location: Virginia

Join Date: Mar 2011
Posts: 72
Default

One last thought. From the GA days, there should be a lot of phiX technical replicates across multiple lanes and flow cells, that might be interesing to push through your package.

Hmm, actually, many sequencing centers do a phiX spike in samples, one could extract out the phiX reads across lanes/flow cells and cluster those. Too bad no one tracks phiX lots, as that would be interesting to see if lot or lane/flow cell is the larger descriminator.
bioBob is offline   Reply With Quote
Old 05-16-2014, 06:40 AM   #6
wokai001
Member
 
Location: Düsseldorf

Join Date: Nov 2010
Posts: 20
Default

Dear Bob,
thanks thanks for your reply.

I am quite sure that our samples are otherwise unrelated because there had been quite large time interspaces between the different flowcells (several weeks or months). It is not unlikely that the "flowcell" batch effects actually arise in library preparation.

The software actually reads compressed Fastq files (the way you usually get them delivered). You don't have to unpack them. The major programming effort was needed for parsing Fastq format in C.

In order to compare mapped and unmapped reads, I extracted data from BAM using another package of mine (rbamtools). BAM stores the whole information of your reads and therefore, Fastq can 'easily' be restored from BAM. That's what I did. The rest of the analysis followed the standard Fastq procedure.

I had included a trim-functionality into the package in order to find out whether quality based trimming might remove artificial clusters but it didn't. The data actually looks as when a high percentage of low quality position actually indicates a problem which is also present in the high quality reads of the same sample (is described in the manuscript).

PhiX was abandoned in our samples since there is no need to include these laned for calibration in human samples.

Wolfgang
wokai001 is offline   Reply With Quote
Old 05-16-2014, 06:53 AM   #7
bioBob
Member
 
Location: Virginia

Join Date: Mar 2011
Posts: 72
Default

I guess I am not getting it. For the study with 50+ samples, were the samples processed, ie lib prepped, in batches corresponding to the flow cells? We avoid this when at all possible, even if samples need to stay in the freezer for a prolonged time. Are you also stating that the samples were run 1 per lane?

We see some sequencing centers still require a phiX spike for clinical samples.
bioBob is offline   Reply With Quote
Old 05-16-2014, 08:27 AM   #8
wokai001
Member
 
Location: Düsseldorf

Join Date: Nov 2010
Posts: 20
Default

Yes, Bob, the flowcells represent single library preparations. The fibroblasts were short term cultivated and the incoming samples were sequenced when 8 samples (for one flowcell) were full.

There was one sample per lane (no multiplexing).

Our first two flowcells also included PhiX (they were excluded due to heavy quality issues) which was abandoned later on. All mentioned flowcells had been run without PhiX.
Wolfgang
wokai001 is offline   Reply With Quote
Old 05-20-2014, 07:56 AM   #9
NextGenSeq
Senior Member
 
Location: USA

Join Date: Apr 2009
Posts: 482
Default

Aren't k-mers indicative of adapter artifacts (dimers mostly)?

You can still have decent data from your library if these are filtered out.
NextGenSeq is offline   Reply With Quote
Old 05-26-2014, 11:18 PM   #10
wokai001
Member
 
Location: Düsseldorf

Join Date: Nov 2010
Posts: 20
Default

We haven't looked for adapters in our data and I have not found significant document about adapters in order to give a sufficient answer (do you know one?).

A raw method to view k-mer contamination is to look at the distribution of the k-mers (plotKmerCount). We see an abundance of AAAAAAAAA and TTTTTTTTT of about factor 10 more than the mean k-mer counts (poly a).

Another noticable effect are the effects of random hexamer priming which can be seen using plotNucFreq( [object] , [i] ,maxx=15). This effect is described in:

Biases in Illumina transcriptome sequencing caused by random hexamer priming
Nucleic Acids Research, 2010, Vol. 38, No. 12 e131; doi:10.1093/nar/gkq224

Beside this I did not notice (but also haven't extensively searched for) other artifacts (except the batch effects...).

The adapters assumably are removed during alignmnent since the aligned reads nearly perfectly match.

Wolfgang
wokai001 is offline   Reply With Quote
Reply

Tags
rna-seq batch effect

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:28 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO