![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
how to trim solid reads length? | lei | Bioinformatics | 7 | 12-14-2012 08:55 AM |
ChIP-seq peak length varies from 0 to 381 | rahulr1485 | Bioinformatics | 0 | 08-11-2011 10:37 AM |
Long peak length from ChIP-seq data | Chiper | Epigenetics | 12 | 03-17-2010 04:08 PM |
Long peak length from ChIP-seq data | Chiper | SOLiD | 1 | 12-25-2009 11:20 PM |
read length of SOLiD and Solexa | seqAll | General | 8 | 12-16-2009 05:50 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: Institute, WV Join Date: May 2010
Posts: 24
|
![]()
Hi,
I am new to analyzing SoLiD sequencing data. In our data, which is a ChIP-SEQ for histone modifications, the maximum peak length detected was about 150bp and; there were only about 18 peaks that are more than 100 bp; about 3500 peaks that are less than 100 bp, of which 3400 peaks are about 51bp. I think this is unusual since histone pull down data should give wider peaks. ![]() Am I correct, or is this kind of result possible in certain cases. Thanks for your suggestions ![]() sridhar |
![]() |
![]() |
![]() |
#2 |
Senior Member
Location: Purdue University, West Lafayette, Indiana Join Date: Aug 2008
Posts: 2,317
|
![]()
I think you would need to give us more details about your experiment. Especially the size distribution of the SOLiD fragment library created for sequencing and the size of the chromatin you used for your IP.
-- Phillip |
![]() |
![]() |
![]() |
#3 |
Member
Location: Institute, WV Join Date: May 2010
Posts: 24
|
![]()
Hi Phillip,
Thanks for the reply. It took some time to get the details. ---The size of the chromatin used for IP is about 300 to 500 ---And the size of SOLiD fragment library created for sequencing is between 150 to 250 bp (including the length of adapters (50bp)) |
![]() |
![]() |
![]() |
#4 |
Senior Member
Location: Purdue University, West Lafayette, Indiana Join Date: Aug 2008
Posts: 2,317
|
![]()
What method did you use to fragment your chromatin to 300-500 bp? Also, after IP, what method (if any) was used to further fragment the IPed DNA down into the size range you give above?
-- Phillip |
![]() |
![]() |
![]() |
#5 |
Member
Location: Institute, WV Join Date: May 2010
Posts: 24
|
![]()
Hi Phillip,
For both the steps, the method used was shearing by sonication. Specifically information for the second step (fragmenting of IPed DNA) Sonication by, Covaris S2 System, Intensity: 5; Cycles/burst: 100; Time: 60 seconds |
![]() |
![]() |
![]() |
#6 |
Senior Member
Location: Purdue University, West Lafayette, Indiana Join Date: Aug 2008
Posts: 2,317
|
![]()
Seems to me that if the sizes of your IPed chromatin were 300-500 bp, then your SOLiD peak widths should be a minimum of 600-1000 bp. But this is not from personal experience, so I could be missing a key point here.
How many reads and how many starts does you data set contain? -- Phillip |
![]() |
![]() |
![]() |
#7 |
Member
Location: Institute, WV Join Date: May 2010
Posts: 24
|
![]()
Hi Phillip,
There were, Number of reads: 15,277,642 Matched : 1,794,556 (11.72 %) There were 3491 starts. about 95% of the peaks were of length 51 |
![]() |
![]() |
![]() |
#8 |
Senior Member
Location: Purdue University, West Lafayette, Indiana Join Date: Aug 2008
Posts: 2,317
|
![]()
By "starts", I mean start points of mapping on your reference sequence. So you can't have fewer starts than you have peaks.
Any idea why your mapping % was so low? Is your reference sequence the same species as what was sequenced? -- Phillip |
![]() |
![]() |
![]() |
#9 |
Member
Location: Institute, WV Join Date: May 2010
Posts: 24
|
![]()
Yes, by 3491 starts, I was referring to the number of peaks. Yes the reference was same (human).
For the other runs (Input -genomic DNA control, the peaks look normal). |
![]() |
![]() |
![]() |
#10 |
Senior Member
Location: Purdue University, West Lafayette, Indiana Join Date: Aug 2008
Posts: 2,317
|
![]()
So, what I was getting at with the "starts" question is the possibility of duplicate reads. Any read that aligns to exactly the same position in the genome as another, is giving you duplicate data at that position. So one stat the SOLiD pipeline usually provides are the total number of "starts" , that is the total number of starting points in the reference sequence where a read started and alignment.
If you have large numbers of mapped reads but low numbers of starts, then that could be an indication that your library was bottle-necked at some step. Via PCR you can end up massively amplifying a small number of amplicons. So, could you find the number of starts for the mapping. Also, again, any idea what the deal is with the ~90% of your reads that do not map to your reference sequence? -- Phillip |
![]() |
![]() |
![]() |
#11 |
Member
Location: Institute, WV Join Date: May 2010
Posts: 24
|
![]()
No idea, about the non-mapping reads. We have outsourced the analysis, so I don't have a clear idea about the details.
Do you know of any R/Bioconductor package for SOLiD data analysis. Basically can we do SOLiD data analysis in Bioconductor? |
![]() |
![]() |
![]() |
#12 |
Senior Member
Location: Purdue University, West Lafayette, Indiana Join Date: Aug 2008
Posts: 2,317
|
![]()
You don't get mapping rates that low unless there is an issue somewhere. Either with the mapping itself, or with the lab work. I suggest taking 100, or even 10 of the highest quality reads, convert them to base space (something I would normally not recommend) and just blast them.
I don't know if there is an R/Bioconductor package for SOLiD data analysis. But if you look here for applications: http://seqanswers.com/wiki/Software -- Phillip |
![]() |
![]() |
![]() |
#13 |
Member
Location: Institute, WV Join Date: May 2010
Posts: 24
|
![]()
Hi Phillip,
Thanks for your inputs. I could run Corona-Lite matching pipeline on the csfasta file. Could you through some light on some of these aspects. What is the difference between a bead and a read. There were only 1788027 (12%) mapped reads out of 15030084 beads. The number of starts points within Uniquely placed tags were 6684, with average number of reads per start = 210. There is another value (number of starts points within uniquely and Randomly placed tags = 77681 with 23 reads per start point. What is the difference between these two? Is the second one the start points for reads with matching at multiple locations? I paste below the output of the stats file. ###################################################### 15030084 total beads found 1788027 Mapped Reads using parameter settings listed below. Mapped Reads at Read Length 50 0 mismatches 758839 ( 42.44%) 1 mismatches 234002 ( 13.09%) 992841 ( 55.53%) 2 mismatches 267037 ( 14.93%) 1259878 ( 70.46%) 3 mismatches 150956 ( 8.44%) 1410834 ( 78.90%) 4 mismatches 148757 ( 8.32%) 1559591 ( 87.22%) 5 mismatches 112738 ( 6.31%) 1672329 ( 93.53%) 6 mismatches 115698 ( 6.47%) 1788027 (100.00%) Uniquely Placed Beads 0 mismatches 628667 ( 35.16%) 1 mismatches 195373 ( 10.93%) 824040 ( 46.09%) 2 mismatches 176500 ( 9.87%) 1000540 ( 55.96%) 3 mismatches 115120 ( 6.44%) 1115660 ( 62.40%) 4 mismatches 105651 ( 5.91%) 1221311 ( 68.30%) 5 mismatches 89627 ( 5.01%) 1310938 ( 73.32%) 6 mismatches 92050 ( 5.15%) 1402988 ( 78.47%) Valid Adjacents within Uniquely Placed Beads 0 valid adjacents 1200996 ( 7.99%) 1 valid adjacents 177447 ( 1.18%) 1378443 ( 9.17%) 2 valid adjacents 21377 ( 0.14%) 1399820 ( 9.31%) 3 valid adjacents 3168 ( 0.02%) 1402988 ( 9.33%) Errors within Uniquely Placed Tags Total Errors 2316772 Single Errors 1525950 (65.87% of Total) Adjacent Errors 395411 (34.13% of total) Valid 229705 (19.83% of Total) (58.09% of Adjacent Errors) Invalid 165706 (14.30% of Total) (41.91% of Adjacent Errors) Starting Points within Uniquely Placed Tags Number of Starting Points 6684 Average Number of Reads per Start Point 209.90 Starting Points within Uniquely and Randomly Placed Tags Number of Starting Points 77681 Average Number of Reads per Start Point 23.02 Coverage of Uniquely Placed Tags Bases Not Covered 3095367910(99.99%) ########################################################### |
![]() |
![]() |
![]() |
#14 | |||
Senior Member
Location: Purdue University, West Lafayette, Indiana Join Date: Aug 2008
Posts: 2,317
|
![]() Quote:
Quote:
Quote:
You still have one unresolved issue. Where do the other 88% of your reads derive? Here are some possibilities: (1) Your CHiP DNA was contaminated with DNA from another source. Commonly this might be yeast because yeast tRNA's were used as a co-precipitant in some step and the tRNAs were contaminated with yeast genomic DNA. If you did not use yeast tRNA's as a co-precipitant, this is unlikely. (2) The sequence was of very low quality and therefore most of the reads could not be mapped. From the distribution of mapping stats, this does not seem likely to me. But it is possible. You might ask your sequencing facility for the cycle scan report for your reads -- specifically the % of good plus best beads for each of the 50 ligations. (3) The amount of CHiP DNA was substantially less than the 10 ng minimum called for in the frag library construction protocol. And/or what DNA that was there was non-ligatable/non-replicatable due to some sort of damage to the DNA. Various issues result in these circumstances. Those fragments that do happen to be usable come to compose the majority of the library molecules because they are all that will amplify via PCR. (4) Alternatively to (3), most biologically-derived agents (eg, enzymes) are contaminated with DNA/RNA from their host strain. Normally this small amount of contamination is swamped out by the sample DNA, but in cases where the sample DNA is limiting, the small amount of contaminating DNA ends up being a significant part of the library. (5) During library preparation your DNA sample became contaminated with SOLiD amplicons from a previous experiment from another organism. If you can run the Corona-lite pipeline I would suggest running it with E. coli as a reference sequence. If this does not result in a high number of hits, I would suggest choosing 10-100 of your highest quality reads and converting them to base-space. Normally one does not want to do this, because a single sequence error will result in all bases downstream from that error being incorrect as well. But in this circumstance it is warranted. Blast these sequences against a large database, like "nt". This should help you determine what went wrong. Finally, as a general note: sometimes it is better to move on than to spend weeks or months figuring out what went wrong with an experiment. It is very rare that you can publish "what did not work" experiments. But this can be a difficult decision to make. If you have an adviser or mentor, you might want to consult them. One possibility is just to use the limited data you have to move onto a validation experiment. But again, there are strategic issues to consider before making this sort of decision. -- Phillip Last edited by pmiguel; 07-27-2010 at 04:43 AM. |
|||
![]() |
![]() |
![]() |
#15 |
Member
Location: Institute, WV Join Date: May 2010
Posts: 24
|
![]()
Phillip,
Thanks a lot for your suggestions, which have pointed the way to go for me. Yes, I have to discuss with my advisor and collaborators what the best action would be. Thanks again. I highly appreciate the insight given by you. sridhar |
![]() |
![]() |
![]() |
Tags |
peak length, solid |
Thread Tools | |
|
|