SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > SOLiD



Similar Threads
Thread Thread Starter Forum Replies Last Post
how to trim solid reads length? lei Bioinformatics 7 12-14-2012 08:55 AM
ChIP-seq peak length varies from 0 to 381 rahulr1485 Bioinformatics 0 08-11-2011 10:37 AM
Long peak length from ChIP-seq data Chiper Epigenetics 12 03-17-2010 04:08 PM
Long peak length from ChIP-seq data Chiper SOLiD 1 12-25-2009 11:20 PM
read length of SOLiD and Solexa seqAll General 8 12-16-2009 05:50 AM

Reply
 
Thread Tools
Old 07-13-2010, 07:31 AM   #1
sridharacharya
Member
 
Location: Institute, WV

Join Date: May 2010
Posts: 24
Exclamation question on SoLiD peak length

Hi,

I am new to analyzing SoLiD sequencing data. In our data, which is a ChIP-SEQ for histone modifications,
the maximum peak length detected was about 150bp and;
there were only about 18 peaks that are more than 100 bp;
about 3500 peaks that are less than 100 bp, of which 3400 peaks are about 51bp.

I think this is unusual since histone pull down data should give wider peaks.
Am I correct, or is this kind of result possible in certain cases.

Thanks for your suggestions .

sridhar
sridharacharya is offline   Reply With Quote
Old 07-15-2010, 04:56 AM   #2
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,317
Default

I think you would need to give us more details about your experiment. Especially the size distribution of the SOLiD fragment library created for sequencing and the size of the chromatin you used for your IP.

--
Phillip
pmiguel is offline   Reply With Quote
Old 07-15-2010, 02:17 PM   #3
sridharacharya
Member
 
Location: Institute, WV

Join Date: May 2010
Posts: 24
Default

Hi Phillip,

Thanks for the reply. It took some time to get the details.
---The size of the chromatin used for IP is about 300 to 500
---And the size of SOLiD fragment library created for sequencing is between 150 to 250 bp (including the length of adapters (50bp))
sridharacharya is offline   Reply With Quote
Old 07-16-2010, 04:33 AM   #4
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,317
Default

What method did you use to fragment your chromatin to 300-500 bp? Also, after IP, what method (if any) was used to further fragment the IPed DNA down into the size range you give above?

--
Phillip
pmiguel is offline   Reply With Quote
Old 07-18-2010, 09:13 PM   #5
sridharacharya
Member
 
Location: Institute, WV

Join Date: May 2010
Posts: 24
Default

Hi Phillip,

For both the steps, the method used was shearing by sonication.

Specifically information for the second step (fragmenting of IPed DNA)

Sonication by, Covaris S2 System, Intensity: 5; Cycles/burst: 100; Time: 60 seconds
sridharacharya is offline   Reply With Quote
Old 07-19-2010, 04:38 AM   #6
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,317
Default

Seems to me that if the sizes of your IPed chromatin were 300-500 bp, then your SOLiD peak widths should be a minimum of 600-1000 bp. But this is not from personal experience, so I could be missing a key point here.

How many reads and how many starts does you data set contain?

--
Phillip
pmiguel is offline   Reply With Quote
Old 07-19-2010, 06:58 AM   #7
sridharacharya
Member
 
Location: Institute, WV

Join Date: May 2010
Posts: 24
Default

Hi Phillip,

There were,

Number of reads: 15,277,642
Matched : 1,794,556 (11.72 %)

There were 3491 starts.

about 95% of the peaks were of length 51
sridharacharya is offline   Reply With Quote
Old 07-19-2010, 07:14 AM   #8
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,317
Default

By "starts", I mean start points of mapping on your reference sequence. So you can't have fewer starts than you have peaks.

Any idea why your mapping % was so low? Is your reference sequence the same species as what was sequenced?

--
Phillip
pmiguel is offline   Reply With Quote
Old 07-19-2010, 08:01 AM   #9
sridharacharya
Member
 
Location: Institute, WV

Join Date: May 2010
Posts: 24
Default

Yes, by 3491 starts, I was referring to the number of peaks. Yes the reference was same (human).
For the other runs (Input -genomic DNA control, the peaks look normal).
sridharacharya is offline   Reply With Quote
Old 07-19-2010, 09:45 AM   #10
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,317
Default

So, what I was getting at with the "starts" question is the possibility of duplicate reads. Any read that aligns to exactly the same position in the genome as another, is giving you duplicate data at that position. So one stat the SOLiD pipeline usually provides are the total number of "starts" , that is the total number of starting points in the reference sequence where a read started and alignment.

If you have large numbers of mapped reads but low numbers of starts, then that could be an indication that your library was bottle-necked at some step. Via PCR you can end up massively amplifying a small number of amplicons.

So, could you find the number of starts for the mapping.

Also, again, any idea what the deal is with the ~90% of your reads that do not map to your reference sequence?
--
Phillip
pmiguel is offline   Reply With Quote
Old 07-19-2010, 10:59 AM   #11
sridharacharya
Member
 
Location: Institute, WV

Join Date: May 2010
Posts: 24
Default

No idea, about the non-mapping reads. We have outsourced the analysis, so I don't have a clear idea about the details.

Do you know of any R/Bioconductor package for SOLiD data analysis. Basically can we do SOLiD data analysis in Bioconductor?
sridharacharya is offline   Reply With Quote
Old 07-19-2010, 12:21 PM   #12
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,317
Default

You don't get mapping rates that low unless there is an issue somewhere. Either with the mapping itself, or with the lab work. I suggest taking 100, or even 10 of the highest quality reads, convert them to base space (something I would normally not recommend) and just blast them.

I don't know if there is an R/Bioconductor package for SOLiD data analysis. But if you look here for applications:

http://seqanswers.com/wiki/Software

--
Phillip
pmiguel is offline   Reply With Quote
Old 07-26-2010, 08:04 AM   #13
sridharacharya
Member
 
Location: Institute, WV

Join Date: May 2010
Posts: 24
Default

Hi Phillip,

Thanks for your inputs. I could run Corona-Lite matching pipeline on the csfasta file. Could you through some light on some of these aspects.

What is the difference between a bead and a read. There were only 1788027 (12%) mapped reads out of 15030084 beads.

The number of starts points within Uniquely placed tags were 6684, with average number of reads per start = 210.

There is another value (number of starts points within uniquely and Randomly placed tags = 77681 with 23 reads per start point.

What is the difference between these two? Is the second one the start points for reads with matching at multiple locations?

I paste below the output of the stats file.
######################################################
15030084 total beads found

1788027 Mapped Reads using parameter settings listed below.
Mapped Reads at Read Length 50
0 mismatches 758839 ( 42.44%)
1 mismatches 234002 ( 13.09%) 992841 ( 55.53%)
2 mismatches 267037 ( 14.93%) 1259878 ( 70.46%)
3 mismatches 150956 ( 8.44%) 1410834 ( 78.90%)
4 mismatches 148757 ( 8.32%) 1559591 ( 87.22%)
5 mismatches 112738 ( 6.31%) 1672329 ( 93.53%)
6 mismatches 115698 ( 6.47%) 1788027 (100.00%)


Uniquely Placed Beads
0 mismatches 628667 ( 35.16%)
1 mismatches 195373 ( 10.93%) 824040 ( 46.09%)
2 mismatches 176500 ( 9.87%) 1000540 ( 55.96%)
3 mismatches 115120 ( 6.44%) 1115660 ( 62.40%)
4 mismatches 105651 ( 5.91%) 1221311 ( 68.30%)
5 mismatches 89627 ( 5.01%) 1310938 ( 73.32%)
6 mismatches 92050 ( 5.15%) 1402988 ( 78.47%)


Valid Adjacents within Uniquely Placed Beads
0 valid adjacents 1200996 ( 7.99%)
1 valid adjacents 177447 ( 1.18%) 1378443 ( 9.17%)
2 valid adjacents 21377 ( 0.14%) 1399820 ( 9.31%)
3 valid adjacents 3168 ( 0.02%) 1402988 ( 9.33%)



Errors within Uniquely Placed Tags
Total Errors 2316772
Single Errors 1525950 (65.87% of Total)
Adjacent Errors 395411 (34.13% of total)
Valid 229705 (19.83% of Total) (58.09% of Adjacent Errors)
Invalid 165706 (14.30% of Total) (41.91% of Adjacent Errors)



Starting Points within Uniquely Placed Tags
Number of Starting Points 6684
Average Number of Reads per Start Point 209.90

Starting Points within Uniquely and Randomly Placed Tags
Number of Starting Points 77681
Average Number of Reads per Start Point 23.02



Coverage of Uniquely Placed Tags
Bases Not Covered 3095367910(99.99%)
###########################################################
sridharacharya is offline   Reply With Quote
Old 07-27-2010, 04:39 AM   #14
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,317
Default

Quote:
Originally Posted by sridharacharya View Post
Hi Phillip,

Thanks for your inputs. I could run Corona-Lite matching pipeline on the csfasta file. Could you through some light on some of these aspects.

What is the difference between a bead and a read. There were only 1788027 (12%) mapped reads out of 15030084 beads.
In this context, a bead is a read. No difference.

Quote:
Originally Posted by sridharacharya View Post
The number of starts points within Uniquely placed tags were 6684, with average number of reads per start = 210.
The simplest explanation here is that either the amount of DNA used for the experiment was limiting, or there was a bottleneck in library construction caused by poor yields at one or more steps. As a result a small number of amplicons came to compose nearly all of your mappable reads.


Quote:
Originally Posted by sridharacharya View Post
There is another value (number of starts points within uniquely and Randomly placed tags = 77681 with 23 reads per start point.

What is the difference between these two? Is the second one the start points for reads with matching at multiple locations?
"Uniquely" placed means that the read mapped to a single location in your reference sequence. If a read maps to more than one position in your reference sequence -- as it would if it mapped to repetitive DNA, for example -- then it could have originated from multiple places in the genome and there is no way to place it uniquely. It may be placed in one of the multiple positions it maps to nevertheless, by choosing randomly among the possible mapping positions.

You still have one unresolved issue. Where do the other 88% of your reads derive? Here are some possibilities:

(1) Your CHiP DNA was contaminated with DNA from another source. Commonly this might be yeast because yeast tRNA's were used as a co-precipitant in some step and the tRNAs were contaminated with yeast genomic DNA. If you did not use yeast tRNA's as a co-precipitant, this is unlikely.

(2) The sequence was of very low quality and therefore most of the reads could not be mapped. From the distribution of mapping stats, this does not seem likely to me. But it is possible. You might ask your sequencing facility for the cycle scan report for your reads -- specifically the % of good plus best beads for each of the 50 ligations.

(3) The amount of CHiP DNA was substantially less than the 10 ng minimum called for in the frag library construction protocol. And/or what DNA that was there was non-ligatable/non-replicatable due to some sort of damage to the DNA. Various issues result in these circumstances. Those fragments that do happen to be usable come to compose the majority of the library molecules because they are all that will amplify via PCR.

(4) Alternatively to (3), most biologically-derived agents (eg, enzymes) are contaminated with DNA/RNA from their host strain. Normally this small amount of contamination is swamped out by the sample DNA, but in cases where the sample DNA is limiting, the small amount of contaminating DNA ends up being a significant part of the library.

(5) During library preparation your DNA sample became contaminated with SOLiD amplicons from a previous experiment from another organism.


If you can run the Corona-lite pipeline I would suggest running it with E. coli as a reference sequence. If this does not result in a high number of hits, I would suggest choosing 10-100 of your highest quality reads and converting them to base-space. Normally one does not want to do this, because a single sequence error will result in all bases downstream from that error being incorrect as well. But in this circumstance it is warranted. Blast these sequences against a large database, like "nt". This should help you determine what went wrong.

Finally, as a general note: sometimes it is better to move on than to spend weeks or months figuring out what went wrong with an experiment. It is very rare that you can publish "what did not work" experiments. But this can be a difficult decision to make. If you have an adviser or mentor, you might want to consult them.

One possibility is just to use the limited data you have to move onto a validation experiment. But again, there are strategic issues to consider before making this sort of decision.

--
Phillip

Last edited by pmiguel; 07-27-2010 at 04:43 AM.
pmiguel is offline   Reply With Quote
Old 07-27-2010, 06:25 AM   #15
sridharacharya
Member
 
Location: Institute, WV

Join Date: May 2010
Posts: 24
Default

Phillip,

Thanks a lot for your suggestions, which have pointed the way to go for me. Yes, I have to discuss with my advisor and collaborators what the best action would be.

Thanks again. I highly appreciate the insight given by you.

sridhar
sridharacharya is offline   Reply With Quote
Reply

Tags
peak length, solid

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 02:53 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO