Seqanswers Leaderboard Ad

**GenoMax** · 08-14-2014, 11:17 AM

In general adapters should not be present in your reads unless you have a not so good quality library/have adapter dimers. But I suppose you may have determined that adapters are present in your reads based on FastQC analysis.

http://seqanswers.com/forums/showthread.php?t=42776 describes about the simplest tool you can use to trim adapters. Trimmomatic/cutadapt (or Trim galore its wrapper) are other good options but will require a bit of a learning curve with the command line parameters. There are separate threads for those tools.

**ronton** · 08-15-2014, 11:21 AM

I guess I should try to rephrase my question. Filtering out and/or trimming as much as possible that is not sample DNA would be a logical first step with the files from the sequencer, wouldn't it?

Would e.g. Trim galore or BBDuk be a good way to accomplish this?

You said that in general adapters should not be present. What would you recommend, a size selection step to get rid of the short fragments? The actual fragment size going into the sequencer has a peak right around 350bp and doesn't appear to be 'too broad,' using 100bp paired end Illumina rapid runs.

I posted examples of the FastQC Adapter and Kmer graphs in a FastQC thread. Your advice is appreciated.

**Brian Bushnell** · 08-15-2014, 11:47 AM

BBDuk can run in trimming mode or filtering mode. Adapters should be trimmed, while other artifacts such as spike-ins should be filtered.

bbduk.sh in=reads.fq out=trimmed.fq ref=adapters.fa ktrim=r

...will trim adapters to the right (3' end), while

bbduk.sh in=trimmed.fq out=filtered.fq ref=contam.fa stats=statistics.txt

...will filter out sequences that share kmers with that reference, and write a file "statistics.txt" telling you what was detected. For greater sensitivity you can add 'hdist=1' to allow up to 1 mismatch (or a higher value, if you want). Normally I trim adapters from fragment libraries like this:

bbduk.sh -Xmx1g in=reads.fq out=trimmed.fq ref=adapters.fa ktrim=r k=28 mink=13 hdist=1 tbo tpe

The extra flags adjust the sensitivity and are documented in the shellscript.

If you have paired reads in two files, you should trim both at the same time using the in1, in2, out1, and out2 flags, to prevent the loss of pairing information. From looking at your pictures, you probably DO have adapter contamination.

You do need to provide files with contaminant sequences for filtering, and it's best to provide adapter sequences for trimming, though the 'tbo' flag will allow most adapters to be trimmed even without specifying what they are. The BBMap package includes Illumina's truseq adapters in the /resources/ directory.

I also like to remove human contamination (when working with non-mammalian data), which is very common.

**GenoMax** · 08-15-2014, 11:47 AM

Both of your questions have a simple answer of "yes". Your libraries look normal (most libraries have some adapter contamination due to short inserts, dimers etc) since the process for selecting the fragments is not perfect about selecting only 300 bp+ fragments.

Use any of the trimming programs you feel comfortable with and check the results with FastQC afterwards.

**ronton** · 08-15-2014, 01:05 PM

The data we receive is actually several .fastq.gz files per sample, FastQC calls this Casava. As in, several .fastq.gz files that are 'left' for the pair, and matching 'right' .fastq.gz files, per sample. So there may be 3 left .fastq.gz files and 3 right .fastq.gz files for one sample.

I would have to unzip them first, correct? I know which are the left and right files of each pair so I can enter that information into BBDuk. I would not need to merge all of the 'lefts' into one file though right? Rather, I could just run BBDuk on one pair of files, and then on the next pair of files, and so on.

**Brian Bushnell** · 08-15-2014, 01:13 PM

BBDuk will accept gzipped input and output. And yes, you can just run 3 times, one pair at a time; no need to merge ahead of time.

**ronton** · 08-20-2014, 10:39 AM

I tried a few step wise passes with BBDuk as a kind of experiment and it seems to be a definite improvement.

The original FastQC report for my sample:

After BBDuk to trim the read length to 100bp from 101 and to trim adapters from the reads:

/path/to/bbmap/bbduk.sh -Xmx2g in1=/path/to/sample_R1.fastq in2=/path/to/sample_R2.fastq out1=/path/to/sample_R1_no_adapters.fastq out2=/path/to/sample_R2_no_adapters.fastq ref=/path/to/bbmap/resources/adapter_sequences.fa.gz ktrim=r k=28 mink=13 hdist=1 stats=/path/to/sample_stats.txt

FastQC report:

Next, BBDuk to filter contaminants:

/path/to/bbmap/bbduk.sh -Xmx2g in1=/path/to/sample_R1.fastq in2=/path/to/sample_R2.fastq out1=/path/to/sample_R1_clean.fastq out2=/path/to/sample_R2_no_clean.fastq ref=/path/to/bbmap/resources/artifacts.fa.gz k=24 hdist=1 stats=/path/to/sample_stats2.txt

FastQC report:

Overall this looks much better, although there still appears to be some kmer content. One more pass to filter phix removed a tiny fraction of the total reads but did not seem to change much.

Any ideas on the kmer content? We can try to address this as well as the duplication in the sample preparation also.

So, what would be a next logical step to analyze the sample? Quality trimming, deduplication, mapping, and/or quality recalibration?

**Brian Bushnell** · 08-20-2014, 12:11 PM

If you know what organism this is, and have a reference, you can try mapping to the reference and BLASTing the unmapped reads to something like nt or some database of synthetic oligos to see what they are, then filter them. Alternately, you can assemble and BLAST the contigs to see what those are; potentially some will be contaminants, which you can then remove from the reads.

BBTools does include a deduplication tool, Dedupe, that does reference-free pair-based deduplication, but it requires a lot of memory (1kb per read). Whether that would help you is unclear, but you can try it like this:

dedupe.sh in1=r1.fq in2=r2.fq out1=dd1.fq out2=dd2.fq -Xmx30g

I also have a quality recalibration tool, but I don't see how that would help you; and you can use BBDuk to do quality-trimming or just remove the last few bases, which seem to have unusual kmer frequencies. But before doing additional preprocessing, I think it's important to know your goal - what kind of organism is this, what kind of data, what are you going to use it for, and do you have a reference? Even deduplication is inadvisable in many cases (like quantification), and it's possible that the remaining FastQC anomalies are not important, or perhaps expected from your data type. Also, posting the per-base quality profile and base frequencies would be useful.

**ronton** · 08-21-2014, 10:18 AM

This is a human sample, normal tissue to be compared with tumor, I figure why not start right with the easy stuff =)

We have a pipeline that we use, but I want to try going through the analysis to have a better understanding of what's going on.

Here are the quality and base profiles before trimming:

And after:

The goal is to call variants, and ultimately identify whatever anomalies are responsible or driving the mutations.

**Brian Bushnell** · 08-21-2014, 11:10 AM

If you want to remove more contaminants, you can try mapping to human and blasting some of the unmapped reads. Depending on what you discover, it may be prudent to to another filtering step.

The quality looks excellent and probably does not need any quality-trimming; for mapping + variant calling I think it's best to do quality-trimming after mapping to allow maximal information for the mapper, though that operation is slightly trickier. There is still a drift in the base frequencies toward the read tails, and that's probably due to residual adapter sequence. You can try to get rid of it by adding "tbo" and "tpe" to your adapter-trimming command:

/path/to/bbmap/bbduk.sh -Xmx2g in1=/path/to/sample_R1.fastq in2=/path/to/sample_R2.fastq out1=/path/to/sample_R1_no_adapters.fastq out2=/path/to/sample_R2_no_adapters.fastq ref=/path/to/bbmap/resources/adapter_sequences.fa.gz ktrim=r k=28 mink=13 hdist=1 stats=/path/to/sample_stats.txt tbo tpe

The problem is that currently adapters of length under 13bp were not trimmed, because BBDuk was run with 'mink=13'. It's not good to go much shorter than that as you will incur false positives. But if you have a paired-end fragment library, the 'tbo' flag will allow you to additionally trim by overlapping the two reads; this can catch adapter sequence down to 1bp long and gets rid of virtually all adapters, if the reads are high quality. The 'tpe' flag means "trim pairs evenly", so if an adapter is detected on one, it will be assumed to be in the same place on the other. These flags are not on by default because they are library-specific and should only be used with paired-end fragment libraries, not (for example) long-mate-pair libraries. Sorry for not mentioning them before!

**ronton** · 08-21-2014, 02:25 PM

Thank you so much for all of your help Brian. I am wondring this too, if preprocessing is something that is necessary and/or how much of a difference it will make. Hopefully I can compare the two after calling variants for example and see any difference. I am going to try and read through Best Practices For Variant Calling With The GATK.

**Brian Bushnell** · 08-21-2014, 03:04 PM

There's no way of telling how much of a difference it will make without trying both ways. At a minimum, it should give you higher coverage and lower file sizes while allowing the mapping and variant-calling to go faster, but hopefully it will give better results, as well. A single false-positive variant due to a contaminant can waste hundreds of hours of analysis if it is in the wrong place.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Best adapter trimming?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News