SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
tophat/bowtie find no alignment on long (> 250 bp) reads andrewsssa Bioinformatics 6 01-05-2015 09:46 AM
Problem with alignment: I can align only half of the reads (bwa) ADseq Bioinformatics 4 11-22-2013 02:07 AM
Bowtie: Reads fail to align sasignor Bioinformatics 24 08-01-2012 01:34 AM
Tophat problem: failing reads alignment Annibal RNA Sequencing 6 05-18-2012 12:34 PM
Bowtie and reads that failed to align: (100.00%) michy Bioinformatics 7 02-08-2011 06:42 PM

Reply
 
Thread Tools
Old 11-07-2015, 09:54 PM   #1
610617109
Member
 
Location: Beijing

Join Date: Nov 2015
Posts: 10
Default Problem with alignment: I can only align 10% of reads(CLIP data, tophat/bowtie)

Dear all,

I'm new to CLIP analysis, so I want to go through the CLIP data processing pipeline to get the knowledge how to process it and maybe in the future improve part of the pipeline.
I got the data from GEO:GSE41288. It's a HITS-CLIP dataset where the author want to revealing miR-155-dependent AGO protein binding sites. But when I tried to align the reads to the genome mm9, I found I can only map 10% of reads back to genome using bowite or tophat. The command I use is as followed.

tophat -p 8 --read-mismatches 5 --read-edit-dist 5 -o /output/MapResult/${name} /data/mm9/mm9 /data/miR155/FASTQ/${i}

bowtie -n 3 -e 150 -l 20 -p 8 /data/mm9/mm9 /data/miR155/FASTQ/${i} --un /output/BowtieResult_new/${name}/${name}.not_hit.fastq > /output/BowtieResult_new/${name}/${name}.hit.sam

I think I already set the threshold of mismatches quite high. Could someone give me some suggestions?

Thanks

Yue
610617109 is offline   Reply With Quote
Old 11-07-2015, 10:19 PM   #2
610617109
Member
 
Location: Beijing

Join Date: Nov 2015
Posts: 10
Default

I'm sorry, I found I should trim 6 nucleotides at 5 prime of the sequence.
610617109 is offline   Reply With Quote
Old 11-07-2015, 11:24 PM   #3
610617109
Member
 
Location: Beijing

Join Date: Nov 2015
Posts: 10
Default

Update:
After adapt the parameter on the datasets webpage, still I could get about 10% reads mapped to the genome. Is it normal for CLIP data?
610617109 is offline   Reply With Quote
Old 11-08-2015, 04:38 AM   #4
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,978
Default

If this is a published data set have you tried to follow the method authors describe in their publication?
GenoMax is offline   Reply With Quote
Old 11-08-2015, 04:43 AM   #5
610617109
Member
 
Location: Beijing

Join Date: Nov 2015
Posts: 10
Default

Quote:
Originally Posted by GenoMax View Post
If this is a published data set have you tried to follow the method authors describe in their publication?
Yes, I use the parameter they said. They just discard 6 nucletides length barcode at the 5 prime.
610617109 is offline   Reply With Quote
Old 11-08-2015, 04:53 AM   #6
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,978
Default

This is a perpetual bioinformatics data reproducibility issue (assuming the directions/settings are clear and you are exactly following them).

You are probably using the latest tophat/bowtie etc, which may not match what the authors used at the time of publication. You could go down the path of exactly matching the versions but not sure if that would be worth the trouble.

Looks like you are going to have to re-do the analysis again.
GenoMax is offline   Reply With Quote
Old 11-08-2015, 05:09 AM   #7
610617109
Member
 
Location: Beijing

Join Date: Nov 2015
Posts: 10
Default

Quote:
Originally Posted by GenoMax View Post
This is a perpetual bioinformatics data reproducibility issue (assuming the directions/settings are clear and you are exactly following them).

You are probably using the latest tophat/bowtie etc, which may not match what the authors used at the time of publication. You could go down the path of exactly matching the versions but not sure if that would be worth the trouble.

Looks like you are going to have to re-do the analysis again.
Ok, I'll try. Thanks.
610617109 is offline   Reply With Quote
Old 11-08-2015, 05:29 AM   #8
blancha
Senior Member
 
Location: Montreal

Join Date: May 2013
Posts: 367
Default

You could try running fastqc, to check for the presence of any remaining adapter sequences or very low quality bases that should be trimmed before aligning.
blancha is offline   Reply With Quote
Old 11-08-2015, 10:28 PM   #9
610617109
Member
 
Location: Beijing

Join Date: Nov 2015
Posts: 10
Default

Update:
After trim the fist 6 nucleotides, I try to use tophat/novoalign which is able to map junction reads. But their result is quite different. For one replicate, Tophat finds only 2 million mapped reads while novoalign will report about 15 million. So which should I believe? I use default parameter for both of them.
610617109 is offline   Reply With Quote
Old 11-09-2015, 04:29 AM   #10
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,978
Default

You should be using parameters described in the original paper otherwise there is no chance of replicating the result.

Since you are going to do an independent analysis with your samples you should set a pipeline up that works for you. Remember to adequately describe (version numbers, settings) when you publish.

As an outside chance it is always possible that the original publication has an error in the analysis. You could correspond with the authors (making it clear that you are only trying to adapt their pipeline for your use) and see if they can provide some additional clarification on what is going on.
GenoMax is offline   Reply With Quote
Old 11-09-2015, 04:31 AM   #11
610617109
Member
 
Location: Beijing

Join Date: Nov 2015
Posts: 10
Default

Quote:
Originally Posted by GenoMax View Post
You should be using parameters described in the original paper otherwise there is no chance of replicating the result.

Since you are going to do an independent analysis with your samples you should set a pipeline up that works for you. Remember to adequately describe (version numbers, settings) when you publish.

As an outside chance it is always possible that the original publication has an error in the analysis. You could correspond with the authors (making it clear that you are only trying to adapt their pipeline for your use) and see if they can provide some additional clarification on what is going on.
Thanks for your suggestions.
I'll re-read the paper again and do exactly they do.
610617109 is offline   Reply With Quote
Old 11-09-2015, 04:35 AM   #12
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,978
Default

Sounds like you have spent enough time working on this data so no harm in checking with the authors. Most will be more than happy to help as long as you ask nicely.
GenoMax is offline   Reply With Quote
Old 11-09-2015, 04:38 AM   #13
610617109
Member
 
Location: Beijing

Join Date: Nov 2015
Posts: 10
Default

Quote:
Originally Posted by GenoMax View Post
Sounds like you have spent enough time working on this data so no harm in checking with the authors. Most will be more than happy to help as long as you ask nicely.
Yes, I thought about it...but I'm afraid the problem is too naiive.
I'm e-mail to the author if I fail to map most of reads again.
Thank you. You're very kind.
610617109 is offline   Reply With Quote
Old 11-09-2015, 07:09 AM   #14
SylvainL
Senior Member
 
Location: Geneva

Join Date: Feb 2012
Posts: 177
Default

Hi,

are you sure you have to discard only the first 6 nucleotides? Usually for CLIP, people put more nucleotides, meaning 4 N (which allow the colony recognition if it was sequenced with Illumina tech), and then the barcode...

Quite easy to check: just take the first 10 nucleotides of all the reads and count the different sequences you get...

edit: I just checked, it was sequenced with Illumina tech...

Last edited by SylvainL; 11-09-2015 at 07:11 AM.
SylvainL is offline   Reply With Quote
Reply

Tags
alignment, bowtie, clip-seq, tophat

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:20 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO