SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
PRINSEQ and paired-end data Rockx Bioinformatics 1 03-10-2012 10:02 AM
Paired end reads of unexpected sizes?? kjsalimian Bioinformatics 3 01-05-2012 08:28 AM
Does Cufflinks support single-end and paired end data together ? ersenkavak Bioinformatics 1 10-22-2010 07:26 AM
PubMed: High-throughput sequencing of microdissected chromosomal regions. Newsbot! Literature Watch 0 11-06-2009 02:00 AM
Paired-end Illumina data mchaisso Bioinformatics 7 07-17-2008 11:52 AM

Reply
 
Thread Tools
Old 12-29-2009, 05:35 AM   #1
ramouz87
Member
 
Location: Doha, Qatar

Join Date: Oct 2009
Posts: 35
Exclamation unexpected high number of chromosomal translocation from paired-end data

Hi,

Iím carrying some analysis using paired end human cancer data (50b reads / 200-500b gap) generated by GA II sequencer to find out fusion genes.

for this dataset
∑ I Align single reads using bowtie (-m1 --best --strata) to the hg19 reference by keeping only the best (unique) mapping for each read.
∑ Filter Poly T/A with length higher than 20
∑ Match pairs of reads based on their ID
∑ Remove duplicates
∑ Keep pairs belonging to different chromosomes



Iget the the attached contingency table reporting to which chromosome belongs each read.

What is observed from the tables is that the number of chromosomal translocations is higher than what is expected so further filtering should be done to get rid of artifacts. But Iím unable to understand what are the reasons behind having these artifacts.

Can you help me with understanding why there's a high number of artifacts ?

Thanks in advance.

Regards,
Ramzi
Attached Files
File Type: txt SV_table_PE_hg19.m1bs.txt (2.4 KB, 67 views)
ramouz87 is offline   Reply With Quote
Old 12-29-2009, 07:59 AM   #2
Zigster
(Jeremy Leipzig)
 
Location: Philadelphia, PA

Join Date: May 2009
Posts: 116
Default

Do you have any data regarding the number of multiple hits/ambiguous alignments you are seeing? You say you are taking unique best hits but what if the next best one (e.g. with one mismatch) is where it should be relative to its pair mate? How many unmated pairs are you seeing (one read aligns but its mate does not at default bowtie parameters)

Have you tried doing a paired-end alignment using Bowtie and just substract those reads that align from the pool before doing your analysis?

Have you tried this against refseq sequences instead of the genome?
__________________
--
Jeremy Leipzig
Bioinformatics Programmer
--
My blog
Twitter

Last edited by Zigster; 12-29-2009 at 08:33 AM.
Zigster is offline   Reply With Quote
Old 12-29-2009, 09:43 AM   #3
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

I bet most of these translocations are misalignments. To find SVs, I would suggest two-phase alignment:

1) Fast alignment: align PE reads with bowtie/bwa in the paired-end mode.

2) Accurate alignment: align aberrant read pairs and singletons with a more accurate aligner such as novoalign. The aligner in use should be able to produce mapping quality.

If you are mainly interested in translocations where both ends mapped to unique regions, you should set a high threshold on mapping quality (e.g. 35-40). I am not sure how people will do if repeats are involved. See this figure for why mapping quality helps to greatly reduce false alignments.
lh3 is offline   Reply With Quote
Old 12-29-2009, 03:19 PM   #4
quinlana
Senior Member
 
Location: Charlottesville

Join Date: Sep 2008
Posts: 119
Default

I second lh3's suggestion. This is nearly identical to the approach I use. One further caveat I should mention is that even after using BWA and Novoalign, there can remain pairs that appear to be aberrant owing to misalignment or chimeric molecule. To mitigate the latter, I cluster aberrant pairs (say having two or more supporting pairs) under the assumption that chimeras occur randomly. I then realign the supporting pairs in all clusters with megablast or something similar (using ridiculously sensitive settings).


Also, are you sure they are suggesting translocations? They can also be retrotransposon insertions that have occurred in your test DNA, but are not present in the reference genome. AluYs, LINEs and SVAs are still active.

Aaron
quinlana is offline   Reply With Quote
Old 12-31-2009, 01:12 AM   #5
ramouz87
Member
 
Location: Doha, Qatar

Join Date: Oct 2009
Posts: 35
Default

Quote:
Originally Posted by Zigster View Post
Do you have any data regarding the number of multiple hits/ambiguous alignments you are seeing? You say you are taking unique best hits but what if the next best one (e.g. with one mismatch) is where it should be relative to its pair mate? How many unmated pairs are you seeing (one read aligns but its mate does not at default bowtie parameters)

Have you tried doing a paired-end alignment using Bowtie and just substract those reads that align from the pool before doing your analysis?

Have you tried this against refseq sequences instead of the genome?
Hi Zigster,
Thanks for you answer.
I'm new in the field so still experimenting aligner and trying to get how they work.
I've changed the setting to -m1 -n0 with these option we keep only reads that align to a unique position in the reference with no mismatch. And we have the following statistics
for s_1_1_sequence.fq
# reads processed: 16479658

# reads with at least one reported alignment: 10592189 (64.27%)

# reads that failed to align: 3406969 (20.67%)

# reads with alignments suppressed due to -m: 2480500 (15.05%)

for s_1_2_sequence.fq
# reads processed: 16479673

# reads with at least one reported alignment: 10372746 (62.94%)

# reads that failed to align: 3704063 (22.48%)

# reads with alignments suppressed due to -m: 2402864 (14.58%)

when aligning in paired-end mode -m1 -n0 -X1000 (X max gap size between reads) I got very poor alignment
# reads processed: 16479658

# reads with at least one reported alignment: 947283 (5.75%)

# reads that failed to align: 15495410 (94.03%)

# reads with alignments suppressed due to -m: 36965 (0.22%)

This is surprising because if I take the single reads and match them by their ids the number of matching read is higher than 3.2 million reads after all the filtring of duplicates and Poly(A/T)
attached 2 plots about gap between reads
anyhow the reads that are positioned at the normal range are automatically put aside and also all reads mapping to the same chromosome.

Quote:
Originally Posted by lh3 View Post
I bet most of these translocations are misalignments. To find SVs, I would suggest two-phase alignment:

1) Fast alignment: align PE reads with bowtie/bwa in the paired-end mode.

2) Accurate alignment: align aberrant read pairs and singletons with a more accurate aligner such as novoalign. The aligner in use should be able to produce mapping quality.

If you are mainly interested in translocations where both ends mapped to unique regions, you should set a high threshold on mapping quality (e.g. 35-40). I am not sure how people will do if repeats are involved. See this figure for why mapping quality helps to greatly reduce false alignments.
Hi Lh3,
thanks for your reply,
As mentioned above, for some reasons the paired-end alignment with bowtie is giving an unexpected result.
I was thinking of shortcuting step one by taking only the Id of reads mapping in different chromosome from my analysis, extract the data from fastq for these id and run novoalign on that selection. Do you thing it's a good idea ?
For the mapping quality, is it -l parameter in novoalign that should be set to 35-40?
The default option is Log4(hg size/ 2)+5=20.xx


Quote:
Originally Posted by quinlana View Post
I second lh3's suggestion. This is nearly identical to the approach I use. One further caveat I should mention is that even after using BWA and Novoalign, there can remain pairs that appear to be aberrant owing to misalignment or chimeric molecule. To mitigate the latter, I cluster aberrant pairs (say having two or more supporting pairs) under the assumption that chimeras occur randomly. I then realign the supporting pairs in all clusters with megablast or something similar (using ridiculously sensitive settings).


Also, are you sure they are suggesting translocations? They can also be retrotransposon insertions that have occurred in your test DNA, but are not present in the reference genome. AluYs, LINEs and SVAs are still active.

Aaron
Hi Aaron,
Thanks for your comment,
I just landed in the field of NGS two month ago, so my experience is limited as I used to work with microarray before.
Could you give me more detail about your clustering approach to overcome chimeric DNA ? That could be helpful as I have some experience with machine learning and could try to findout if that could be Improved.

Thanks to all of you and best wishes

Regards,
Ramzi
ramouz87 is offline   Reply With Quote
Old 01-05-2010, 04:28 AM   #6
ramouz87
Member
 
Location: Doha, Qatar

Join Date: Oct 2009
Posts: 35
Thumbs up

Hi
Thanks for your suggestions
I've run the analysis and now a considerable number of artefact is discarded (98%) by applying novoalign, but still have 5861 PE showing translocations.
i've attached the contengency table so you can have an idea.
Any other way to filter further this data ?
Thanks again.

Regards,
Ramzi
Attached Files
File Type: txt after_novo_SV_table_PE_hg19.m1bs.txt (1.5 KB, 21 views)
ramouz87 is offline   Reply With Quote
Old 01-06-2010, 04:04 AM   #7
ramouz87
Member
 
Location: Doha, Qatar

Join Date: Oct 2009
Posts: 35
Exclamation Bug in code (still high number of artefact even after novoalign)

Quote:
Originally Posted by ramouz87 View Post
Hi
Thanks for your suggestions
I've run the analysis and now a considerable number of artefact is discarded (98%) by applying novoalign, but still have 5861 PE showing translocations.
i've attached the contengency table so you can have an idea.
Any other way to filter further this data ?
Thanks again.

Regards,
Ramzi
Hi
There was a small bug in data fetching and after correcting that it turn out that the number of artefact decrease from 303318 to 251374 (18% less) but still very high number of artefact.
I've attached the contingency table so you can have an overview of the mapping of reads in chromosomes.
Thanks in advance for suggestions..

Regards,
Ramzi
Attached Files
File Type: txt sv_S1_cancer_1.chimera.out.bedPE_CT_table_NOVO.txt (1.9 KB, 6 views)
ramouz87 is offline   Reply With Quote
Old 01-06-2010, 05:31 AM   #8
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

Many people will cluster aberrant reads with high mapping quality. But probably you should start to dig into literatures (e.g. breakdancer) and use a proper software package if SVs are your main interest.
lh3 is offline   Reply With Quote
Old 01-07-2010, 01:59 AM   #9
ramouz87
Member
 
Location: Doha, Qatar

Join Date: Oct 2009
Posts: 35
Thumbs up

Hi Heng,
I've wanted to use Breakdancer 2 month ago but there were a problem with converting bam file (using bwa then samtool) to cfg using the bam2cfg script, hopefully there's a new version of Breakdancer were the script was updated hope I can be able to run it.
Thanks for your suggestions.
Regards,
Ramzi
ramouz87 is offline   Reply With Quote
Old 07-19-2010, 02:35 AM   #10
KevinLam
Senior Member
 
Location: SEA

Join Date: Nov 2009
Posts: 197
Default

I have just used breakdancer with bwa and it works 'fine'
illumina 76bp PE reads (just plugged in solexa reads direct into bwa they are already in fastq)

one thing the documentation skipped is that you need to use sorted bams for breakdancer to work.

cheers
KevinLam is offline   Reply With Quote
Reply

Tags
artefacts, paired-end, structure variation, translocation

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:22 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO