Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem with alignment: I can only align 10% of reads(CLIP data, tophat/bowtie)

    Dear all,

    I'm new to CLIP analysis, so I want to go through the CLIP data processing pipeline to get the knowledge how to process it and maybe in the future improve part of the pipeline.
    I got the data from GEO:GSE41288. It's a HITS-CLIP dataset where the author want to revealing miR-155-dependent AGO protein binding sites. But when I tried to align the reads to the genome mm9, I found I can only map 10% of reads back to genome using bowite or tophat. The command I use is as followed.

    tophat -p 8 --read-mismatches 5 --read-edit-dist 5 -o /output/MapResult/${name} /data/mm9/mm9 /data/miR155/FASTQ/${i}

    bowtie -n 3 -e 150 -l 20 -p 8 /data/mm9/mm9 /data/miR155/FASTQ/${i} --un /output/BowtieResult_new/${name}/${name}.not_hit.fastq > /output/BowtieResult_new/${name}/${name}.hit.sam

    I think I already set the threshold of mismatches quite high. Could someone give me some suggestions?

    Thanks

    Yue

  • #2
    I'm sorry, I found I should trim 6 nucleotides at 5 prime of the sequence.

    Comment


    • #3
      Update:
      After adapt the parameter on the datasets webpage, still I could get about 10% reads mapped to the genome. Is it normal for CLIP data?

      Comment


      • #4
        If this is a published data set have you tried to follow the method authors describe in their publication?

        Comment


        • #5
          Originally posted by GenoMax View Post
          If this is a published data set have you tried to follow the method authors describe in their publication?
          Yes, I use the parameter they said. They just discard 6 nucletides length barcode at the 5 prime.

          Comment


          • #6
            This is a perpetual bioinformatics data reproducibility issue (assuming the directions/settings are clear and you are exactly following them).

            You are probably using the latest tophat/bowtie etc, which may not match what the authors used at the time of publication. You could go down the path of exactly matching the versions but not sure if that would be worth the trouble.

            Looks like you are going to have to re-do the analysis again.

            Comment


            • #7
              Originally posted by GenoMax View Post
              This is a perpetual bioinformatics data reproducibility issue (assuming the directions/settings are clear and you are exactly following them).

              You are probably using the latest tophat/bowtie etc, which may not match what the authors used at the time of publication. You could go down the path of exactly matching the versions but not sure if that would be worth the trouble.

              Looks like you are going to have to re-do the analysis again.
              Ok, I'll try. Thanks.

              Comment


              • #8
                You could try running fastqc, to check for the presence of any remaining adapter sequences or very low quality bases that should be trimmed before aligning.

                Comment


                • #9
                  Update:
                  After trim the fist 6 nucleotides, I try to use tophat/novoalign which is able to map junction reads. But their result is quite different. For one replicate, Tophat finds only 2 million mapped reads while novoalign will report about 15 million. So which should I believe? I use default parameter for both of them.

                  Comment


                  • #10
                    You should be using parameters described in the original paper otherwise there is no chance of replicating the result.

                    Since you are going to do an independent analysis with your samples you should set a pipeline up that works for you. Remember to adequately describe (version numbers, settings) when you publish.

                    As an outside chance it is always possible that the original publication has an error in the analysis. You could correspond with the authors (making it clear that you are only trying to adapt their pipeline for your use) and see if they can provide some additional clarification on what is going on.

                    Comment


                    • #11
                      Originally posted by GenoMax View Post
                      You should be using parameters described in the original paper otherwise there is no chance of replicating the result.

                      Since you are going to do an independent analysis with your samples you should set a pipeline up that works for you. Remember to adequately describe (version numbers, settings) when you publish.

                      As an outside chance it is always possible that the original publication has an error in the analysis. You could correspond with the authors (making it clear that you are only trying to adapt their pipeline for your use) and see if they can provide some additional clarification on what is going on.
                      Thanks for your suggestions.
                      I'll re-read the paper again and do exactly they do.

                      Comment


                      • #12
                        Sounds like you have spent enough time working on this data so no harm in checking with the authors. Most will be more than happy to help as long as you ask nicely.

                        Comment


                        • #13
                          Originally posted by GenoMax View Post
                          Sounds like you have spent enough time working on this data so no harm in checking with the authors. Most will be more than happy to help as long as you ask nicely.
                          Yes, I thought about it...but I'm afraid the problem is too naiive.
                          I'm e-mail to the author if I fail to map most of reads again.
                          Thank you. You're very kind.

                          Comment


                          • #14
                            Hi,

                            are you sure you have to discard only the first 6 nucleotides? Usually for CLIP, people put more nucleotides, meaning 4 N (which allow the colony recognition if it was sequenced with Illumina tech), and then the barcode...

                            Quite easy to check: just take the first 10 nucleotides of all the reads and count the different sequences you get...

                            edit: I just checked, it was sequenced with Illumina tech...
                            Last edited by SylvainL; 11-09-2015, 08:11 AM.

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Techniques and Challenges in Conservation Genomics
                              by seqadmin



                              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                              Avian Conservation
                              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                              03-08-2024, 10:41 AM
                            • seqadmin
                              The Impact of AI in Genomic Medicine
                              by seqadmin



                              Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                              02-26-2024, 02:07 PM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, 03-14-2024, 06:13 AM
                            0 responses
                            32 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-08-2024, 08:03 AM
                            0 responses
                            71 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-07-2024, 08:13 AM
                            0 responses
                            80 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-06-2024, 09:51 AM
                            0 responses
                            68 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X