SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
RSEM with HISAT2 Sbamo RNA Sequencing 8 01-26-2016 05:19 AM
explanation about hard clip and soft clip reads neha SOLiD 0 06-24-2013 04:26 AM
Clip truseq adapter sequence from fastq uniportdb RNA Sequencing 1 01-29-2013 12:28 AM
Is it "bwa aln -q" a soft clip or hard clip? yzhou11 Bioinformatics 0 08-30-2012 01:14 PM
clip 454's adapter? louis7781x 454 Pyrosequencing 4 03-21-2011 05:24 AM

Reply
 
Thread Tools
Old 02-19-2016, 06:47 AM   #1
guilhem
Member
 
Location: USA

Join Date: Feb 2016
Posts: 10
Default Clip adapter Hisat2

Hi,

I am processing analysis on reads obtained from ribosome profiling experiments.
I need first to clip adapter before mapping my reads.
This step, however, is very time consuming with fastx_clipper.
I am wondering if there is any other way to it faster, for instance directly into hisat2 would be awesome.
Thanks for your advises,

G
guilhem is offline   Reply With Quote
Old 02-19-2016, 07:15 AM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,794
Default

Look at BBDuk.sh from BBMap. It should be intuitive to use and fast. You would want to process paired-end data files together if you have that kind of data.
GenoMax is offline   Reply With Quote
Old 02-19-2016, 07:21 AM   #3
guilhem
Member
 
Location: USA

Join Date: Feb 2016
Posts: 10
Default

Thank GenoMax.
I saw this interesting post before posting. I was just wondering before trying this if hisat2 can have natively this function since I saw it can trim and 'soft clip' -- which I thought was similar to clipping adapters.
guilhem is offline   Reply With Quote
Old 02-19-2016, 07:27 AM   #4
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,794
Default

Soft-clipping won't actually remove data. In that sense it is not the same thing as clipping adapter sequences using a dedicated trimming program.
GenoMax is offline   Reply With Quote
Old 02-19-2016, 07:36 AM   #5
guilhem
Member
 
Location: USA

Join Date: Feb 2016
Posts: 10
Default

Thanks a lot for the link. So, it seems that I would need to clip in an additional step before mapping with hisat2. I am gonna use BBDuk.sh thanks GenoMax!
guilhem is offline   Reply With Quote
Old 02-19-2016, 07:47 AM   #6
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,794
Default

You don't have to trim but if you need clean sequence files then a pass through the trimming program would keep that data available.
GenoMax is offline   Reply With Quote
Old 02-19-2016, 07:51 AM   #7
guilhem
Member
 
Location: USA

Join Date: Feb 2016
Posts: 10
Default

But if I do not clip the adapters, mapping will be biased by the adapter sequence, won't it?
guilhem is offline   Reply With Quote
Old 02-19-2016, 08:05 AM   #8
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,794
Default

If the adapter contamination is short/minimal then the aligner should be able to manage but if you know you have short inserts/adapter dimers etc then it would be best to trim independently. I like to pass all data through a trimming program. If there is no contamination then only thing invested is a bit of time.

Last edited by GenoMax; 02-19-2016 at 08:10 AM.
GenoMax is offline   Reply With Quote
Old 02-19-2016, 08:09 AM   #9
guilhem
Member
 
Location: USA

Join Date: Feb 2016
Posts: 10
Default

My adapter is CTGTAGGCACCATCAAT -- quite long I think. My reads are about 30 nt after clipping. And I need to perfect mapping (no mismatch) so I think clipping is necessary here. I am trying the software you adviced me, thanks GenoMax!
guilhem is offline   Reply With Quote
Old 02-19-2016, 08:13 AM   #10
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,794
Default

What was the original read length (if post clip is 30 bp)? Is this miRNA data?
GenoMax is offline   Reply With Quote
Old 02-19-2016, 12:03 PM   #11
guilhem
Member
 
Location: USA

Join Date: Feb 2016
Posts: 10
Default

After clipping the read length is around 30nt. This is ribo-seq data (ribosomal footprint: RNA-seq covered by ribosome).
guilhem is offline   Reply With Quote
Old 02-19-2016, 06:19 PM   #12
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

If you need perfect mapping, then absolutely, adapter-trimming is crucial. In general, requiring perfect mapping will incur sequence-dependent bias (as sequencing error rates are sequence-dependent), but that's more of an issue with long reads and may not matter with 30bp reads. Still, it also might matter since ribosomal sequences are typically low-diversity which makes them especially susceptible to sequence-dependent errors.

So... why are you requiring perfect mapping?
Brian Bushnell is offline   Reply With Quote
Old 02-19-2016, 06:31 PM   #13
guilhem
Member
 
Location: USA

Join Date: Feb 2016
Posts: 10
Default

Thanks Brian.
I am not very familiar with NGS data analysis so I tried to apply the exact protocol described in the original paper: Ribosome profiling is a technique to track the translation pausing (Ingolia 2009). In fact, we freeze the translation at a t time and digest the uncover messenger RNA. Then, we obtained only footprint of the ribosome -- part of the messenger covers by the ribosome. These footprints are sequenced and I use the SRA data from these sequencing.
In the original method introduced by Ingolia et al. 2009, they clipped the adapter, mapped to the genome assembly and they keep only reads with a perfect match (retains only NM tag = 0).

I am not very familiar with NGS data so, I tried to respect closely the original protocol. I have just switched to hisat2 since I found bowtie2 and tophat rather slow.
guilhem is offline   Reply With Quote
Old 02-19-2016, 06:41 PM   #14
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,794
Default

You should add BBMap alignment as well. I wonder what fraction of your reads would be straight alignment and what fraction would have a splice site, with just 30 nt to work with. @Brian may have a suggestion about parameters to use with BBMap.
GenoMax is offline   Reply With Quote
Old 02-19-2016, 07:05 PM   #15
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by GenoMax View Post
@Brian may have a suggestion about parameters to use with BBMap.
Normally, I use the defaults But for 30bp ribosomal reads, you could add "maxindel=10" (just a random small number I picked). Searching for long indels (which BBMap does by default) is not necessary when aligning to ribosomes (which as far as I know are never spliced); it decreases both speed and sensitivity. BBMap does have a "perfectmode" flag which allows only perfect alignments, but I do not really think it is appropriate in this case (or most situations, especially those involving quantification).

There are a lot of papers written by people who do not fully understand all aspects of what they are doing - who can, these days, in any paper that is not purely theoretical? Often people try to make choices they think are safer and more conservative, overriding the suggested defaults, to minimize risk of a paper being rejected because something was hard to describe or explain. Particularly, in bioinformatics, it is common for people to throw out all reads with any mismatches, or quality-trim to Q30 prior to mapping, etc. These are almost never good ideas! They are typically devised by biologists on the assumption that "My data has variable quality, and is annotated with its actual quality. Therefore, if I throw away low-quality data, my results will be strictly better."

This is absolutely wrong, as it relies on a lot of implicit assumptions (that quality is unrelated to sequence, that quality scores are correct, that trimming low-quality bases yields better mapping, that differences between a read and the reference are due to errors, etc) which may seem obvious, but are false.

I am not trying to slam biologists here - they are experts in their field. It's just important to understand that being an expert in biology does not make one also an expert in statistics, or photonics, or any of the other numerous areas that go in to bioinformatics. So, bioinformatics papers written, reviewed, and published solely by biologists will often have subtle errors in the non-biological part of the methodology - as in this case.
Brian Bushnell is offline   Reply With Quote
Old 02-20-2016, 06:56 AM   #16
guilhem
Member
 
Location: USA

Join Date: Feb 2016
Posts: 10
Default

Thanks for all of your advices!
I did not know about BBMap software, thank you!
Is it faster as Hisat2? I have billions of read to map, although I think I will try for now to restrict my mapping to the transcriptome (especially for Eukarotic genome) -- full genome is very slow.
guilhem is offline   Reply With Quote
Old 02-20-2016, 07:18 AM   #17
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,794
Default

If you have a multi-core machine, BBMap will be fast. Use the threads=N option to start with available threads.
GenoMax is offline   Reply With Quote
Reply

Tags
bioinformactics, clipping, hisat2

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:00 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO