Seqanswers Leaderboard Ad

**Brian Bushnell** · 08-26-2015, 10:44 AM

Since your reads are variable length, they should have already been trimmed. But perhaps the trimming did not work very well, due to (for example) low quality. If the reads have adapter contamination, you can find it using BBMerge:

bbmerge.sh in1=read1.fq in2=read2.fq ihist=ihist.txt outadapter=adapter.fa reads=4m

The insert size histogram will also be informative (insert sizes shorter than read length indicate adapter contamination). Ideally (in this case), very few of your reads will even overlap, so they won't merge. Once this finishes, you can try trimming the sequences with BBDuk like this:

bbduk.sh in1=read1.fq in2=read2.fq ref=adapter.fa ktrim=r k=23 mink=11 hdist=1 tbo tpe

...which will report the number of reads with adapter sequence. You can alternately (or additionally) use the adapter sequence file distributed with BBDuk since it has all standard Illumina adapters, but you never know what a random provider used.

Since your target fragment lengths were, at a minimum, 300bp, there should be virtually zero adapter sequence present in 125bp reads. If there is, it indicates that your target insert sizes were probably not hit, or short fragments were not correctly removed. If you see adapter contamination in these trimmed reads, there was probably a serious problem upstream and you may need to request the sequencing to be redone, or else trim them correctly starting with the raw, untrimmed reads.

**henriettevdz** · 08-26-2015, 09:33 PM

Thanks for your reply Brian! I really appreciate your help.

I forgot to add that the PE libraries are all fixed length of 125bp and it is only the LJD libraries that are variable length and trimmed by the service provider.

The biggest problem we have is that the over represented Kmers are found in the middle of the read and it isn't the whole adapter sequence, but only a part of it. The regions before and after the over represented sequence is of good quality.

Thanks!
Henriette

**Brian Bushnell** · 08-27-2015, 09:28 AM

Well, just because the over-represented kmers are reported as being shorter then the adapter sequence does not mean the entire adapter is not present. I suggest you try adapter-trimming the reads using the adapter set included with BBDuk and see if that resolves the problem.

**kmcarr** · 08-28-2015, 04:28 AM

Originally posted by henriettevdz View Post

The biggest problem we have is that the over represented Kmers are found in the middle of the read and it isn't the whole adapter sequence, but only a part of it.
Henriette

The FastQC Kmer plot shows only the top 6 most abundant kmers. It is very likely that all kmers for the full adapters are over represented it just so happens that it is that spot in the middle is most abundant. Examine the full FastQC report (fastqc_data.txt) and you will likely be able to reconstruct all/most of the adapter from the full list of abundant kmers.

**henriettevdz** · 08-30-2015, 08:23 AM

BBMerge trim

Dear Brian,

We have trimmed the adapter sequences and I've attached the two FastQC kmer content files of the same runs. (only the one sample and from the 300 and 550 libraries). The number of sequences were reduced from around 50 000 000 to 4 000 000. Is this what we could expect from the data?

Thanks!
Henriette

Attached Files

**Brian Bushnell** · 08-31-2015, 08:35 AM

To be honest, I tend to find the overrepresented kmer graphs fairly confusing and rely more on the base frequency by position. If the total number of sequences was reduced from 50 million to 4 million then you have a major problem with the raw data and it needs to be re-run (possibly with higher molecular weight input DNA). Or did you mean 40 million?

It would be helpful to see the mapping results to an assembly, but with 4m reads, you won't get much of an assembly. So...

First off, can you post the stderr (console) output of BBDuk?

Second, it would be useful if you could run BBMerge on the raw input and post the console output, and attach the insert size histogram, for the 300bp and 550bp libraries, like this:

bbmerge.sh in1=read1.fq in2=read2.fq ihist=ihist.txt reads=4m

Also, the entire FastQC report from before trimming (in PDF, or, at least, the base composition histogram) would be useful. It seems like maybe you have a huge number of adapter-dimers, or very short inserts.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 18 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 17 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 49 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Advice needed on De novo sequences Kmer content

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News