Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • unexpected high number of chromosomal translocation from paired-end data

    Hi,

    I’m carrying some analysis using paired end human cancer data (50b reads / 200-500b gap) generated by GA II sequencer to find out fusion genes.

    for this dataset
    · I Align single reads using bowtie (-m1 --best --strata) to the hg19 reference by keeping only the best (unique) mapping for each read.
    · Filter Poly T/A with length higher than 20
    · Match pairs of reads based on their ID
    · Remove duplicates
    · Keep pairs belonging to different chromosomes



    Iget the the attached contingency table reporting to which chromosome belongs each read.

    What is observed from the tables is that the number of chromosomal translocations is higher than what is expected so further filtering should be done to get rid of artifacts. But I’m unable to understand what are the reasons behind having these artifacts.

    Can you help me with understanding why there's a high number of artifacts ?

    Thanks in advance.

    Regards,
    Ramzi
    Attached Files
    Research Scientist - Bioinformatics
    Sidra Medical and Research Center

  • #2
    Do you have any data regarding the number of multiple hits/ambiguous alignments you are seeing? You say you are taking unique best hits but what if the next best one (e.g. with one mismatch) is where it should be relative to its pair mate? How many unmated pairs are you seeing (one read aligns but its mate does not at default bowtie parameters)

    Have you tried doing a paired-end alignment using Bowtie and just substract those reads that align from the pool before doing your analysis?

    Have you tried this against refseq sequences instead of the genome?
    Last edited by Zigster; 12-29-2009, 09:33 AM.
    --
    Jeremy Leipzig
    Bioinformatics Programmer
    --
    My blog
    Twitter

    Comment


    • #3
      I bet most of these translocations are misalignments. To find SVs, I would suggest two-phase alignment:

      1) Fast alignment: align PE reads with bowtie/bwa in the paired-end mode.

      2) Accurate alignment: align aberrant read pairs and singletons with a more accurate aligner such as novoalign. The aligner in use should be able to produce mapping quality.

      If you are mainly interested in translocations where both ends mapped to unique regions, you should set a high threshold on mapping quality (e.g. 35-40). I am not sure how people will do if repeats are involved. See this figure for why mapping quality helps to greatly reduce false alignments.

      Comment


      • #4
        I second lh3's suggestion. This is nearly identical to the approach I use. One further caveat I should mention is that even after using BWA and Novoalign, there can remain pairs that appear to be aberrant owing to misalignment or chimeric molecule. To mitigate the latter, I cluster aberrant pairs (say having two or more supporting pairs) under the assumption that chimeras occur randomly. I then realign the supporting pairs in all clusters with megablast or something similar (using ridiculously sensitive settings).


        Also, are you sure they are suggesting translocations? They can also be retrotransposon insertions that have occurred in your test DNA, but are not present in the reference genome. AluYs, LINEs and SVAs are still active.

        Aaron

        Comment


        • #5
          Originally posted by Zigster View Post
          Do you have any data regarding the number of multiple hits/ambiguous alignments you are seeing? You say you are taking unique best hits but what if the next best one (e.g. with one mismatch) is where it should be relative to its pair mate? How many unmated pairs are you seeing (one read aligns but its mate does not at default bowtie parameters)

          Have you tried doing a paired-end alignment using Bowtie and just substract those reads that align from the pool before doing your analysis?

          Have you tried this against refseq sequences instead of the genome?
          Hi Zigster,
          Thanks for you answer.
          I'm new in the field so still experimenting aligner and trying to get how they work.
          I've changed the setting to -m1 -n0 with these option we keep only reads that align to a unique position in the reference with no mismatch. And we have the following statistics
          for s_1_1_sequence.fq
          # reads processed: 16479658

          # reads with at least one reported alignment: 10592189 (64.27%)

          # reads that failed to align: 3406969 (20.67%)

          # reads with alignments suppressed due to -m: 2480500 (15.05%)

          for s_1_2_sequence.fq
          # reads processed: 16479673

          # reads with at least one reported alignment: 10372746 (62.94%)

          # reads that failed to align: 3704063 (22.48%)

          # reads with alignments suppressed due to -m: 2402864 (14.58%)

          when aligning in paired-end mode -m1 -n0 -X1000 (X max gap size between reads) I got very poor alignment
          # reads processed: 16479658

          # reads with at least one reported alignment: 947283 (5.75%)

          # reads that failed to align: 15495410 (94.03%)

          # reads with alignments suppressed due to -m: 36965 (0.22%)

          This is surprising because if I take the single reads and match them by their ids the number of matching read is higher than 3.2 million reads after all the filtring of duplicates and Poly(A/T)
          attached 2 plots about gap between reads
          anyhow the reads that are positioned at the normal range are automatically put aside and also all reads mapping to the same chromosome.

          Originally posted by lh3 View Post
          I bet most of these translocations are misalignments. To find SVs, I would suggest two-phase alignment:

          1) Fast alignment: align PE reads with bowtie/bwa in the paired-end mode.

          2) Accurate alignment: align aberrant read pairs and singletons with a more accurate aligner such as novoalign. The aligner in use should be able to produce mapping quality.

          If you are mainly interested in translocations where both ends mapped to unique regions, you should set a high threshold on mapping quality (e.g. 35-40). I am not sure how people will do if repeats are involved. See this figure for why mapping quality helps to greatly reduce false alignments.
          Hi Lh3,
          thanks for your reply,
          As mentioned above, for some reasons the paired-end alignment with bowtie is giving an unexpected result.
          I was thinking of shortcuting step one by taking only the Id of reads mapping in different chromosome from my analysis, extract the data from fastq for these id and run novoalign on that selection. Do you thing it's a good idea ?
          For the mapping quality, is it -l parameter in novoalign that should be set to 35-40?
          The default option is Log4(hg size/ 2)+5=20.xx


          Originally posted by quinlana View Post
          I second lh3's suggestion. This is nearly identical to the approach I use. One further caveat I should mention is that even after using BWA and Novoalign, there can remain pairs that appear to be aberrant owing to misalignment or chimeric molecule. To mitigate the latter, I cluster aberrant pairs (say having two or more supporting pairs) under the assumption that chimeras occur randomly. I then realign the supporting pairs in all clusters with megablast or something similar (using ridiculously sensitive settings).


          Also, are you sure they are suggesting translocations? They can also be retrotransposon insertions that have occurred in your test DNA, but are not present in the reference genome. AluYs, LINEs and SVAs are still active.

          Aaron
          Hi Aaron,
          Thanks for your comment,
          I just landed in the field of NGS two month ago, so my experience is limited as I used to work with microarray before.
          Could you give me more detail about your clustering approach to overcome chimeric DNA ? That could be helpful as I have some experience with machine learning and could try to findout if that could be Improved.

          Thanks to all of you and best wishes

          Regards,
          Ramzi
          Attached Files
          Research Scientist - Bioinformatics
          Sidra Medical and Research Center

          Comment


          • #6
            Hi
            Thanks for your suggestions
            I've run the analysis and now a considerable number of artefact is discarded (98%) by applying novoalign, but still have 5861 PE showing translocations.
            i've attached the contengency table so you can have an idea.
            Any other way to filter further this data ?
            Thanks again.

            Regards,
            Ramzi
            Attached Files
            Research Scientist - Bioinformatics
            Sidra Medical and Research Center

            Comment


            • #7
              Bug in code (still high number of artefact even after novoalign)

              Originally posted by ramouz87 View Post
              Hi
              Thanks for your suggestions
              I've run the analysis and now a considerable number of artefact is discarded (98%) by applying novoalign, but still have 5861 PE showing translocations.
              i've attached the contengency table so you can have an idea.
              Any other way to filter further this data ?
              Thanks again.

              Regards,
              Ramzi
              Hi
              There was a small bug in data fetching and after correcting that it turn out that the number of artefact decrease from 303318 to 251374 (18% less) but still very high number of artefact.
              I've attached the contingency table so you can have an overview of the mapping of reads in chromosomes.
              Thanks in advance for suggestions..

              Regards,
              Ramzi
              Attached Files
              Research Scientist - Bioinformatics
              Sidra Medical and Research Center

              Comment


              • #8
                Many people will cluster aberrant reads with high mapping quality. But probably you should start to dig into literatures (e.g. breakdancer) and use a proper software package if SVs are your main interest.

                Comment


                • #9
                  Hi Heng,
                  I've wanted to use Breakdancer 2 month ago but there were a problem with converting bam file (using bwa then samtool) to cfg using the bam2cfg script, hopefully there's a new version of Breakdancer were the script was updated hope I can be able to run it.
                  Thanks for your suggestions.
                  Regards,
                  Ramzi
                  Research Scientist - Bioinformatics
                  Sidra Medical and Research Center

                  Comment


                  • #10
                    I have just used breakdancer with bwa and it works 'fine'
                    illumina 76bp PE reads (just plugged in solexa reads direct into bwa they are already in fastq)

                    one thing the documentation skipped is that you need to use sorted bams for breakdancer to work.

                    cheers
                    http://kevin-gattaca.blogspot.com/

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM
                    • seqadmin
                      Techniques and Challenges in Conservation Genomics
                      by seqadmin



                      The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                      Avian Conservation
                      Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                      03-08-2024, 10:41 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Yesterday, 06:37 PM
                    0 responses
                    8 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, Yesterday, 06:07 PM
                    0 responses
                    8 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-22-2024, 10:03 AM
                    0 responses
                    49 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-21-2024, 07:32 AM
                    0 responses
                    67 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X