Seqanswers Leaderboard Ad

**dpryan** · 05-15-2014, 12:35 AM

Is there anything in particular gained from performing the clustering on the k-mers rather than on the per-gene (or whatever) counts/RPKM/TPM/etc.? The latter is pretty computationally simple (you have to calculate the various metrics anyway, so performing the clustering is just a couple commands in R) and would seem to yield more directly usable results (after all, you mention that the batch effects are amplified in the aligned reads).

**wokai001** · 05-16-2014, 12:32 AM

Good quesion.
k-mer clusters and Gene expression clusters seem not to behave in the same way. So they might give some (additional) information on how to weight data from certain samples.

The other part of your quesion (why take an indirection when the standard is straight) addresses the value of additional information. We can't give a closing answer to this up to now since our data volumes are not extensive enough.

You might want to test the method and share your experience. Any result would be valuable (and I think there's still some space in the project...).
Wolfgang

**bioBob** · 05-16-2014, 04:43 AM

Hi,
When I think batch effects, I think in terms of the entire experiment from sample collection, RNA extraction, lib prep and sequencing. I do not see any information about upstream batching, ie lib preps are generally in batches of 8-48 samples per batch. The sequencing portion is generally the least likely to introduce a batch effect. If you were to add the pre-sequencing batching information, do you get a different picture? Do you have data for aliquoted replicates of the same cell culture in large enough N to be able to measure the batch effect from processing to sequencing?

I do see a potential value in this to get an early read on the project, particularly for projects that might need a lengthy assembly. You could do something like this while the assembly is going to get some information to share with the PI.

Additionally, I am left wondering if the processing was done on the actual raw fastq files or if it was done on the mapped+unmapped transformed bam->fastq data? I have not done this, but am always a bit wary of supposed equivelant data files.

Finally, why not trim the reads for quality and adaptor prior to doing the clustering? I am generally less concerned with how things cluster with the raw data vs the cleaned data.

Bob

**bioBob** · 05-16-2014, 04:48 AM

One last thought. From the GA days, there should be a lot of phiX technical replicates across multiple lanes and flow cells, that might be interesing to push through your package.

Hmm, actually, many sequencing centers do a phiX spike in samples, one could extract out the phiX reads across lanes/flow cells and cluster those. Too bad no one tracks phiX lots, as that would be interesting to see if lot or lane/flow cell is the larger descriminator.

**wokai001** · 05-16-2014, 05:40 AM

Dear Bob,
thanks thanks for your reply.

I am quite sure that our samples are otherwise unrelated because there had been quite large time interspaces between the different flowcells (several weeks or months). It is not unlikely that the "flowcell" batch effects actually arise in library preparation.

The software actually reads compressed Fastq files (the way you usually get them delivered). You don't have to unpack them. The major programming effort was needed for parsing Fastq format in C.

In order to compare mapped and unmapped reads, I extracted data from BAM using another package of mine (rbamtools). BAM stores the whole information of your reads and therefore, Fastq can 'easily' be restored from BAM. That's what I did. The rest of the analysis followed the standard Fastq procedure.

I had included a trim-functionality into the package in order to find out whether quality based trimming might remove artificial clusters but it didn't. The data actually looks as when a high percentage of low quality position actually indicates a problem which is also present in the high quality reads of the same sample (is described in the manuscript).

PhiX was abandoned in our samples since there is no need to include these laned for calibration in human samples.

Wolfgang

**bioBob** · 05-16-2014, 05:53 AM

I guess I am not getting it. For the study with 50+ samples, were the samples processed, ie lib prepped, in batches corresponding to the flow cells? We avoid this when at all possible, even if samples need to stay in the freezer for a prolonged time. Are you also stating that the samples were run 1 per lane?

We see some sequencing centers still require a phiX spike for clinical samples.

**wokai001** · 05-16-2014, 07:27 AM

Yes, Bob, the flowcells represent single library preparations. The fibroblasts were short term cultivated and the incoming samples were sequenced when 8 samples (for one flowcell) were full.

There was one sample per lane (no multiplexing).

Our first two flowcells also included PhiX (they were excluded due to heavy quality issues) which was abandoned later on. All mentioned flowcells had been run without PhiX.
Wolfgang

**NextGenSeq** · 05-20-2014, 06:56 AM

Aren't k-mers indicative of adapter artifacts (dimers mostly)?

You can still have decent data from your library if these are filtered out.

**wokai001** · 05-26-2014, 10:18 PM

We haven't looked for adapters in our data and I have not found significant document about adapters in order to give a sufficient answer (do you know one?).

A raw method to view k-mer contamination is to look at the distribution of the k-mers (plotKmerCount). We see an abundance of AAAAAAAAA and TTTTTTTTT of about factor 10 more than the mean k-mer counts (poly a).

Another noticable effect are the effects of random hexamer priming which can be seen using plotNucFreq( [object] , [i] ,maxx=15). This effect is described in:

Biases in Illumina transcriptome sequencing caused by random hexamer priming
Nucleic Acids Research, 2010, Vol. 38, No. 12 e131; doi:10.1093/nar/gkq224

Beside this I did not notice (but also haven't extensively searched for) other artifacts (except the batch effects...).

The adapters assumably are removed during alignmnent since the aligned reads nearly perfectly match.

Wolfgang

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 29 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Revealing batch effects in RNA-seq data

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News