SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Process to remove primers, adapters, etc. from Illumina data LizBent Bioinformatics 6 05-14-2012 04:08 AM
pre-process RNA-seq? xy6699 RNA Sequencing 1 01-27-2012 01:54 AM
Can pre-filtering reads affect your analysis results? PFS Bioinformatics 0 03-24-2011 10:08 AM
Comparative genome analysis (using illumina reads) R diggity General 12 08-05-2010 08:13 PM
Tophat: Is it necessary to pre-filter reads swarbre Bioinformatics 1 09-10-2009 02:48 PM

Reply
 
Thread Tools
Old 03-10-2011, 05:50 PM   #1
PFS
Member
 
Location: USA

Join Date: Mar 2010
Posts: 55
Default when do you pre-process Illumina reads before analysis?

I have some PE Illumina reads that I want to analyze with TopHat.
By looking at the quality plot, I see some deterioration of quality at the 3' end.

Is it advisable to trim the reads before feeding them to TopHat? If so, what criteria do I use to decide where to trim? Do I trim all reads at the same length?

Thanks
PFS
PFS is offline   Reply With Quote
Old 03-10-2011, 11:07 PM   #2
tonybolger
Senior Member
 
Location: berlin

Join Date: Feb 2010
Posts: 156
Default

Quote:
Originally Posted by PFS View Post
Is it advisable to trim the reads before feeding them to TopHat? If so, what criteria do I use to decide where to trim? Do I trim all reads at the same length
Although alignments are less likely to break than denovo assembly, i'd still recommend trimming reads (unless the alignment tool itself does it).

Each read should be trimmed on its own merits, based on the quality score.

Typically i use an adapter removal step, a hard trim of all 'B' quality bases from the tail, removal of N calls from both ends, and a multi-base sliding window, typically cutting off when the average score per base drops below 10-20, depending on the application.

I also usually drop reads which have below a certain minimal length after this process (typically something like 36 bases, to give a 40-base read a reasonable chance of survival), since shorter reads are not usually informative. This gives me both paired reads and unpaired reads, where the partner has not survived the cull.
tonybolger is offline   Reply With Quote
Old 03-11-2011, 06:32 PM   #3
PFS
Member
 
Location: USA

Join Date: Mar 2010
Posts: 55
Default

Quote:
Originally Posted by tonybolger View Post
This gives me both paired reads and unpaired reads, where the partner has not survived the cull.
Thanks tonybolger!

One more question: when you are left with unpaired reads, do you try to remove them or do you keep them in the analysis and maybe use SAM flags to identify them?

THANKS
PFS
PFS is offline   Reply With Quote
Old 03-17-2011, 08:36 AM   #4
tonybolger
Senior Member
 
Location: berlin

Join Date: Feb 2010
Posts: 156
Default

Quote:
Originally Posted by PFS View Post
One more question: when you are left with unpaired reads, do you try to remove them or do you keep them in the analysis and maybe use SAM flags to identify them?
After filtering, i have 4 fastq files per lane, forward paired, reverse paired, forward unpaired and reverse unpaired.

The pipeline from then on generally treats the paired / unpaired data differently, e.g with alignment tools i'd use paired mode vs single mode, but depending on the purpose, it might not make sense to use the unpaired data at all (e.g. scaffolding). On the other hand, sometimes i treat all the reads as single ended (e.g. verifying denovo assembly, where i don't want the bias of assuming the pairing is correct to force a non-optimal alignment).

If i'm creating SAM files against a reference, i'll typically end up with 3 - one for the paired data, and one for each of the unpaired data files.
tonybolger is offline   Reply With Quote
Old 04-01-2011, 02:36 AM   #5
Anelda
Member
 
Location: Cape Town, South Africa

Join Date: May 2010
Posts: 30
Default

Quote:
Originally Posted by tonybolger View Post
After filtering, i have 4 fastq files per lane, forward paired, reverse paired, forward unpaired and reverse unpaired.

The pipeline from then on generally treats the paired / unpaired data differently, e.g with alignment tools i'd use paired mode vs single mode, but depending on the purpose, it might not make sense to use the unpaired data at all (e.g. scaffolding). On the other hand, sometimes i treat all the reads as single ended (e.g. verifying denovo assembly, where i don't want the bias of assuming the pairing is correct to force a non-optimal alignment).

If i'm creating SAM files against a reference, i'll typically end up with 3 - one for the paired data, and one for each of the unpaired data files.
Hi TonyBolger,

Please can you tell me what software you use to do the trimming with? And did you write custom scripts to separate the paired vs unpaired into different files?

Thanks!
Anelda

Last edited by Anelda; 04-01-2011 at 02:37 AM. Reason: Wrong person addressed
Anelda is offline   Reply With Quote
Old 04-01-2011, 11:51 PM   #6
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Ideally we'd like to be able to leave the data alone and let the aligners use the quality values to determine how best to align the sequences. However in practice we usually just trim off really bad sequence (where the majority of the library has dropped to somewhere close to Q0) since this means we can use more stringent parameters when mapping - which can greatly reduce the time taken to do the mapping. Fortunately these days most runs stay at high quality past 50bp which is enough for the types of experiment we run.
simonandrews is offline   Reply With Quote
Old 04-02-2011, 10:47 PM   #7
reut
Member
 
Location: Israel

Join Date: Oct 2010
Posts: 19
Default

Quote:
Originally Posted by tonybolger View Post
Typically i use an adapter removal step, a hard trim of all 'B' quality bases from the tail, removal of N calls from both ends, and a multi-base sliding window, typically cutting off when the average score per base drops below 10-20, depending on the application.
Can you please elaborate a little on the sliding window stage?
What size of window do you use and do you use any existing tool to do it?
thanks!
reut is offline   Reply With Quote
Old 04-03-2011, 11:48 PM   #8
tonybolger
Senior Member
 
Location: berlin

Join Date: Feb 2010
Posts: 156
Default

Quote:
Originally Posted by Anelda View Post
Hi TonyBolger,

Please can you tell me what software you use to do the trimming with? And did you write custom scripts to separate the paired vs unpaired into different files?
It's an all-in-one custom app - which i plan to make publically available (this week if i can get the time) since many people seem to want it.

You give it the input file(s), and a set of filtering steps, and it creates paired and unpaired output files with the appropriate trimming done.
tonybolger is offline   Reply With Quote
Old 04-03-2011, 11:50 PM   #9
Jenzo
Member
 
Location: Bad Nauheim, Germany

Join Date: Feb 2011
Posts: 31
Default

Quote:
Originally Posted by tonybolger View Post
It's an all-in-one custom app - which i plan to make publically available (this week if i can get the time) since many people seem to want it.

You give it the input file(s), and a set of filtering steps, and it creates paired and unpaired output files with the appropriate trimming done.
Would be great :-))
Jenzo is offline   Reply With Quote
Old 04-04-2011, 12:25 AM   #10
tonybolger
Senior Member
 
Location: berlin

Join Date: Feb 2010
Posts: 156
Default

Quote:
Originally Posted by reut View Post
Can you please elaborate a little on the sliding window stage?
What size of window do you use and do you use any existing tool to do it?
Normally i use 4 bases window width, and between 10-20 average quality per base within the window. It's a custom written tool, soon to be made publicly available.
tonybolger is offline   Reply With Quote
Old 04-04-2011, 12:49 AM   #11
reut
Member
 
Location: Israel

Join Date: Oct 2010
Posts: 19
Default thanks

thanks, please let us know when you publish the tool, it will be useful for us as well.
reut is offline   Reply With Quote
Old 04-14-2011, 10:24 AM   #12
MQ-BCBB
Member
 
Location: Maryland

Join Date: May 2009
Posts: 25
Default

@Tonybolger
Yes, such tool would be nice to have! Thanks in advance!
MQ-BCBB is offline   Reply With Quote
Old 04-27-2011, 01:39 AM   #13
tonybolger
Senior Member
 
Location: berlin

Join Date: Feb 2010
Posts: 156
Default

Quote:
Originally Posted by tonybolger View Post
It's an all-in-one custom app - which i plan to make publically available (this week if i can get the time) since many people seem to want it.

You give it the input file(s), and a set of filtering steps, and it creates paired and unpaired output files with the appropriate trimming done.
Ah look, it's been almost a month already

Anyway, the Trimmomatic is ready for release.

Just one issue, does anyone know if Illumina adapter and other sequences can be included in such a tool? I assume i would need to get specific clearance for this. Otherwise each user would need to find / organise the clipping sequences themselves, which is a bit of a pain.
tonybolger is offline   Reply With Quote
Old 04-27-2011, 03:01 AM   #14
reut
Member
 
Location: Israel

Join Date: Oct 2010
Posts: 19
Default FastQC includes adapters

I don't know if you can use the Illumina adapters in your tool,
but I do know the FastQC tool by Simon Andrews includes a library of adapters and possible contaminators.
If it's of any help...
reut is offline   Reply With Quote
Old 04-27-2011, 06:35 AM   #15
lletourn
Member
 
Location: Montreal

Join Date: Oct 2009
Posts: 63
Default

BTW, we've been using fastx for the adapter clipping, N removal and 3' trimming (no window though). works fast and well. The only part missing, that we wrote in-house, is to pass over the files afterwards to see which pairs aren't pairs anymore.

Like tonybolger, when reads fall below ~30bp we discard them so some pairs don't stay paired.

Our script creates 3 files pair1, pair2 and singles.
lletourn is offline   Reply With Quote
Old 04-28-2011, 03:06 PM   #16
Adjuvant
Member
 
Location: Chicago, IL

Join Date: Sep 2010
Posts: 13
Default

While I've been waiting for the Trimmomatic's eventual release, I took it upon myself to crank out my own perl script that seems to do much of what the Trimmomatic promises (except for adaptor screening). My question for tonybolger is essentially this: how do you run the sliding window? The way I've got mine working right now is it starts from the back end of the read and averages the quality of the last 4 bases (say bases 97, 98, 99, and 100). If this average quality is below the threshold, it clips the last base (100), the window moves backward by one base, averages the quality of the next 4 bases (96, 97, 98, 99) until the average of four bases is above the threshold. It then does the same sliding window deal from the front end.

Does this sound similar to what you're doing? Would it make sense to slide the window up by half its size each time (i.e. clip bases 99 and 100, then average bases 95 - 98), or even by a full window size?

Jes' wonderin' how you (or others) have approached it.

- A

P.S. Even in its current half-baked form, my trimmer script resulted in significant decreases in my contig numbers and increases in my max contig size and N50s doing de novo assembly in Ray (with SSPACE) and improved, but less spectacular numbers in Velvet. I think I'd been working under the illusion that quality trimming was more of a factor in aligning, but that coverage-based assemblers weren't as perturbed by low quality bases so long as coverage was deep enough (which mine has been). It's certainly looking like I was wrong about that...
Adjuvant is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:47 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO