Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • question on SoLiD peak length

    Hi,

    I am new to analyzing SoLiD sequencing data. In our data, which is a ChIP-SEQ for histone modifications,
    the maximum peak length detected was about 150bp and;
    there were only about 18 peaks that are more than 100 bp;
    about 3500 peaks that are less than 100 bp, of which 3400 peaks are about 51bp.

    I think this is unusual since histone pull down data should give wider peaks.
    Am I correct, or is this kind of result possible in certain cases.

    Thanks for your suggestions .

    sridhar

  • #2
    I think you would need to give us more details about your experiment. Especially the size distribution of the SOLiD fragment library created for sequencing and the size of the chromatin you used for your IP.

    --
    Phillip

    Comment


    • #3
      Hi Phillip,

      Thanks for the reply. It took some time to get the details.
      ---The size of the chromatin used for IP is about 300 to 500
      ---And the size of SOLiD fragment library created for sequencing is between 150 to 250 bp (including the length of adapters (50bp))

      Comment


      • #4
        What method did you use to fragment your chromatin to 300-500 bp? Also, after IP, what method (if any) was used to further fragment the IPed DNA down into the size range you give above?

        --
        Phillip

        Comment


        • #5
          Hi Phillip,

          For both the steps, the method used was shearing by sonication.

          Specifically information for the second step (fragmenting of IPed DNA)

          Sonication by, Covaris S2 System, Intensity: 5; Cycles/burst: 100; Time: 60 seconds

          Comment


          • #6
            Seems to me that if the sizes of your IPed chromatin were 300-500 bp, then your SOLiD peak widths should be a minimum of 600-1000 bp. But this is not from personal experience, so I could be missing a key point here.

            How many reads and how many starts does you data set contain?

            --
            Phillip

            Comment


            • #7
              Hi Phillip,

              There were,

              Number of reads: 15,277,642
              Matched : 1,794,556 (11.72 %)

              There were 3491 starts.

              about 95% of the peaks were of length 51

              Comment


              • #8
                By "starts", I mean start points of mapping on your reference sequence. So you can't have fewer starts than you have peaks.

                Any idea why your mapping % was so low? Is your reference sequence the same species as what was sequenced?

                --
                Phillip

                Comment


                • #9
                  Yes, by 3491 starts, I was referring to the number of peaks. Yes the reference was same (human).
                  For the other runs (Input -genomic DNA control, the peaks look normal).

                  Comment


                  • #10
                    So, what I was getting at with the "starts" question is the possibility of duplicate reads. Any read that aligns to exactly the same position in the genome as another, is giving you duplicate data at that position. So one stat the SOLiD pipeline usually provides are the total number of "starts" , that is the total number of starting points in the reference sequence where a read started and alignment.

                    If you have large numbers of mapped reads but low numbers of starts, then that could be an indication that your library was bottle-necked at some step. Via PCR you can end up massively amplifying a small number of amplicons.

                    So, could you find the number of starts for the mapping.

                    Also, again, any idea what the deal is with the ~90% of your reads that do not map to your reference sequence?
                    --
                    Phillip

                    Comment


                    • #11
                      No idea, about the non-mapping reads. We have outsourced the analysis, so I don't have a clear idea about the details.

                      Do you know of any R/Bioconductor package for SOLiD data analysis. Basically can we do SOLiD data analysis in Bioconductor?

                      Comment


                      • #12
                        You don't get mapping rates that low unless there is an issue somewhere. Either with the mapping itself, or with the lab work. I suggest taking 100, or even 10 of the highest quality reads, convert them to base space (something I would normally not recommend) and just blast them.

                        I don't know if there is an R/Bioconductor package for SOLiD data analysis. But if you look here for applications:



                        --
                        Phillip

                        Comment


                        • #13
                          Hi Phillip,

                          Thanks for your inputs. I could run Corona-Lite matching pipeline on the csfasta file. Could you through some light on some of these aspects.

                          What is the difference between a bead and a read. There were only 1788027 (12%) mapped reads out of 15030084 beads.

                          The number of starts points within Uniquely placed tags were 6684, with average number of reads per start = 210.

                          There is another value (number of starts points within uniquely and Randomly placed tags = 77681 with 23 reads per start point.

                          What is the difference between these two? Is the second one the start points for reads with matching at multiple locations?

                          I paste below the output of the stats file.
                          ######################################################
                          15030084 total beads found

                          1788027 Mapped Reads using parameter settings listed below.
                          Mapped Reads at Read Length 50
                          0 mismatches 758839 ( 42.44%)
                          1 mismatches 234002 ( 13.09%) 992841 ( 55.53%)
                          2 mismatches 267037 ( 14.93%) 1259878 ( 70.46%)
                          3 mismatches 150956 ( 8.44%) 1410834 ( 78.90%)
                          4 mismatches 148757 ( 8.32%) 1559591 ( 87.22%)
                          5 mismatches 112738 ( 6.31%) 1672329 ( 93.53%)
                          6 mismatches 115698 ( 6.47%) 1788027 (100.00%)


                          Uniquely Placed Beads
                          0 mismatches 628667 ( 35.16%)
                          1 mismatches 195373 ( 10.93%) 824040 ( 46.09%)
                          2 mismatches 176500 ( 9.87%) 1000540 ( 55.96%)
                          3 mismatches 115120 ( 6.44%) 1115660 ( 62.40%)
                          4 mismatches 105651 ( 5.91%) 1221311 ( 68.30%)
                          5 mismatches 89627 ( 5.01%) 1310938 ( 73.32%)
                          6 mismatches 92050 ( 5.15%) 1402988 ( 78.47%)


                          Valid Adjacents within Uniquely Placed Beads
                          0 valid adjacents 1200996 ( 7.99%)
                          1 valid adjacents 177447 ( 1.18%) 1378443 ( 9.17%)
                          2 valid adjacents 21377 ( 0.14%) 1399820 ( 9.31%)
                          3 valid adjacents 3168 ( 0.02%) 1402988 ( 9.33%)



                          Errors within Uniquely Placed Tags
                          Total Errors 2316772
                          Single Errors 1525950 (65.87% of Total)
                          Adjacent Errors 395411 (34.13% of total)
                          Valid 229705 (19.83% of Total) (58.09% of Adjacent Errors)
                          Invalid 165706 (14.30% of Total) (41.91% of Adjacent Errors)



                          Starting Points within Uniquely Placed Tags
                          Number of Starting Points 6684
                          Average Number of Reads per Start Point 209.90

                          Starting Points within Uniquely and Randomly Placed Tags
                          Number of Starting Points 77681
                          Average Number of Reads per Start Point 23.02



                          Coverage of Uniquely Placed Tags
                          Bases Not Covered 3095367910(99.99%)
                          ###########################################################

                          Comment


                          • #14
                            Originally posted by sridharacharya View Post
                            Hi Phillip,

                            Thanks for your inputs. I could run Corona-Lite matching pipeline on the csfasta file. Could you through some light on some of these aspects.

                            What is the difference between a bead and a read. There were only 1788027 (12%) mapped reads out of 15030084 beads.
                            In this context, a bead is a read. No difference.

                            Originally posted by sridharacharya View Post
                            The number of starts points within Uniquely placed tags were 6684, with average number of reads per start = 210.
                            The simplest explanation here is that either the amount of DNA used for the experiment was limiting, or there was a bottleneck in library construction caused by poor yields at one or more steps. As a result a small number of amplicons came to compose nearly all of your mappable reads.


                            Originally posted by sridharacharya View Post
                            There is another value (number of starts points within uniquely and Randomly placed tags = 77681 with 23 reads per start point.

                            What is the difference between these two? Is the second one the start points for reads with matching at multiple locations?
                            "Uniquely" placed means that the read mapped to a single location in your reference sequence. If a read maps to more than one position in your reference sequence -- as it would if it mapped to repetitive DNA, for example -- then it could have originated from multiple places in the genome and there is no way to place it uniquely. It may be placed in one of the multiple positions it maps to nevertheless, by choosing randomly among the possible mapping positions.

                            You still have one unresolved issue. Where do the other 88% of your reads derive? Here are some possibilities:

                            (1) Your CHiP DNA was contaminated with DNA from another source. Commonly this might be yeast because yeast tRNA's were used as a co-precipitant in some step and the tRNAs were contaminated with yeast genomic DNA. If you did not use yeast tRNA's as a co-precipitant, this is unlikely.

                            (2) The sequence was of very low quality and therefore most of the reads could not be mapped. From the distribution of mapping stats, this does not seem likely to me. But it is possible. You might ask your sequencing facility for the cycle scan report for your reads -- specifically the % of good plus best beads for each of the 50 ligations.

                            (3) The amount of CHiP DNA was substantially less than the 10 ng minimum called for in the frag library construction protocol. And/or what DNA that was there was non-ligatable/non-replicatable due to some sort of damage to the DNA. Various issues result in these circumstances. Those fragments that do happen to be usable come to compose the majority of the library molecules because they are all that will amplify via PCR.

                            (4) Alternatively to (3), most biologically-derived agents (eg, enzymes) are contaminated with DNA/RNA from their host strain. Normally this small amount of contamination is swamped out by the sample DNA, but in cases where the sample DNA is limiting, the small amount of contaminating DNA ends up being a significant part of the library.

                            (5) During library preparation your DNA sample became contaminated with SOLiD amplicons from a previous experiment from another organism.


                            If you can run the Corona-lite pipeline I would suggest running it with E. coli as a reference sequence. If this does not result in a high number of hits, I would suggest choosing 10-100 of your highest quality reads and converting them to base-space. Normally one does not want to do this, because a single sequence error will result in all bases downstream from that error being incorrect as well. But in this circumstance it is warranted. Blast these sequences against a large database, like "nt". This should help you determine what went wrong.

                            Finally, as a general note: sometimes it is better to move on than to spend weeks or months figuring out what went wrong with an experiment. It is very rare that you can publish "what did not work" experiments. But this can be a difficult decision to make. If you have an adviser or mentor, you might want to consult them.

                            One possibility is just to use the limited data you have to move onto a validation experiment. But again, there are strategic issues to consider before making this sort of decision.

                            --
                            Phillip
                            Last edited by pmiguel; 07-27-2010, 03:43 AM.

                            Comment


                            • #15
                              Phillip,

                              Thanks a lot for your suggestions, which have pointed the way to go for me. Yes, I have to discuss with my advisor and collaborators what the best action would be.

                              Thanks again. I highly appreciate the insight given by you.

                              sridhar

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              9 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X