Seqanswers Leaderboard Ad

**pmiguel** · 07-15-2010, 03:56 AM

I think you would need to give us more details about your experiment. Especially the size distribution of the SOLiD fragment library created for sequencing and the size of the chromatin you used for your IP.

--
Phillip

**sridharacharya** · 07-15-2010, 01:17 PM

Hi Phillip,

Thanks for the reply. It took some time to get the details.
---The size of the chromatin used for IP is about 300 to 500
---And the size of SOLiD fragment library created for sequencing is between 150 to 250 bp (including the length of adapters (50bp))

**pmiguel** · 07-16-2010, 03:33 AM

What method did you use to fragment your chromatin to 300-500 bp? Also, after IP, what method (if any) was used to further fragment the IPed DNA down into the size range you give above?

--
Phillip

**sridharacharya** · 07-18-2010, 08:13 PM

Hi Phillip,

For both the steps, the method used was shearing by sonication.

Specifically information for the second step (fragmenting of IPed DNA)

Sonication by, Covaris S2 System, Intensity: 5; Cycles/burst: 100; Time: 60 seconds

**pmiguel** · 07-19-2010, 03:38 AM

Seems to me that if the sizes of your IPed chromatin were 300-500 bp, then your SOLiD peak widths should be a minimum of 600-1000 bp. But this is not from personal experience, so I could be missing a key point here.

How many reads and how many starts does you data set contain?

--
Phillip

**sridharacharya** · 07-19-2010, 05:58 AM

Hi Phillip,

There were,

Number of reads: 15,277,642
Matched : 1,794,556 (11.72 %)

There were 3491 starts.

about 95% of the peaks were of length 51

**pmiguel** · 07-19-2010, 06:14 AM

By "starts", I mean start points of mapping on your reference sequence. So you can't have fewer starts than you have peaks.

Any idea why your mapping % was so low? Is your reference sequence the same species as what was sequenced?

--
Phillip

**sridharacharya** · 07-19-2010, 07:01 AM

Yes, by 3491 starts, I was referring to the number of peaks. Yes the reference was same (human).
For the other runs (Input -genomic DNA control, the peaks look normal).

**pmiguel** · 07-19-2010, 08:45 AM

So, what I was getting at with the "starts" question is the possibility of duplicate reads. Any read that aligns to exactly the same position in the genome as another, is giving you duplicate data at that position. So one stat the SOLiD pipeline usually provides are the total number of "starts" , that is the total number of starting points in the reference sequence where a read started and alignment.

If you have large numbers of mapped reads but low numbers of starts, then that could be an indication that your library was bottle-necked at some step. Via PCR you can end up massively amplifying a small number of amplicons.

So, could you find the number of starts for the mapping.

Also, again, any idea what the deal is with the ~90% of your reads that do not map to your reference sequence?
--
Phillip

**sridharacharya** · 07-19-2010, 09:59 AM

No idea, about the non-mapping reads. We have outsourced the analysis, so I don't have a clear idea about the details.

Do you know of any R/Bioconductor package for SOLiD data analysis. Basically can we do SOLiD data analysis in Bioconductor?

**pmiguel** · 07-19-2010, 11:21 AM

You don't get mapping rates that low unless there is an issue somewhere. Either with the mapping itself, or with the lab work. I suggest taking 100, or even 10 of the highest quality reads, convert them to base space (something I would normally not recommend) and just blast them.

I don't know if there is an R/Bioconductor package for SOLiD data analysis. But if you look here for applications:

SEQanswers

http://seqanswers.com/wiki/Software

--
Phillip

**sridharacharya** · 07-26-2010, 07:04 AM

Hi Phillip,

Thanks for your inputs. I could run Corona-Lite matching pipeline on the csfasta file. Could you through some light on some of these aspects.

What is the difference between a bead and a read. There were only 1788027 (12%) mapped reads out of 15030084 beads.

The number of starts points within Uniquely placed tags were 6684, with average number of reads per start = 210.

There is another value (number of starts points within uniquely and Randomly placed tags = 77681 with 23 reads per start point.

What is the difference between these two? Is the second one the start points for reads with matching at multiple locations?

I paste below the output of the stats file.
######################################################
15030084 total beads found

1788027 Mapped Reads using parameter settings listed below.
Mapped Reads at Read Length 50
0 mismatches 758839 ( 42.44%)
1 mismatches 234002 ( 13.09%) 992841 ( 55.53%)
2 mismatches 267037 ( 14.93%) 1259878 ( 70.46%)
3 mismatches 150956 ( 8.44%) 1410834 ( 78.90%)
4 mismatches 148757 ( 8.32%) 1559591 ( 87.22%)
5 mismatches 112738 ( 6.31%) 1672329 ( 93.53%)
6 mismatches 115698 ( 6.47%) 1788027 (100.00%)

Uniquely Placed Beads
0 mismatches 628667 ( 35.16%)
1 mismatches 195373 ( 10.93%) 824040 ( 46.09%)
2 mismatches 176500 ( 9.87%) 1000540 ( 55.96%)
3 mismatches 115120 ( 6.44%) 1115660 ( 62.40%)
4 mismatches 105651 ( 5.91%) 1221311 ( 68.30%)
5 mismatches 89627 ( 5.01%) 1310938 ( 73.32%)
6 mismatches 92050 ( 5.15%) 1402988 ( 78.47%)

Valid Adjacents within Uniquely Placed Beads
0 valid adjacents 1200996 ( 7.99%)
1 valid adjacents 177447 ( 1.18%) 1378443 ( 9.17%)
2 valid adjacents 21377 ( 0.14%) 1399820 ( 9.31%)
3 valid adjacents 3168 ( 0.02%) 1402988 ( 9.33%)

Errors within Uniquely Placed Tags
Total Errors 2316772
Single Errors 1525950 (65.87% of Total)
Adjacent Errors 395411 (34.13% of total)
Valid 229705 (19.83% of Total) (58.09% of Adjacent Errors)
Invalid 165706 (14.30% of Total) (41.91% of Adjacent Errors)

Starting Points within Uniquely Placed Tags
Number of Starting Points 6684
Average Number of Reads per Start Point 209.90

Starting Points within Uniquely and Randomly Placed Tags
Number of Starting Points 77681
Average Number of Reads per Start Point 23.02

Coverage of Uniquely Placed Tags
Bases Not Covered 3095367910(99.99%)
###########################################################

**pmiguel** · 07-27-2010, 03:39 AM

Originally posted by sridharacharya View Post

Hi Phillip,

Thanks for your inputs. I could run Corona-Lite matching pipeline on the csfasta file. Could you through some light on some of these aspects.

What is the difference between a bead and a read. There were only 1788027 (12%) mapped reads out of 15030084 beads.

In this context, a bead is a read. No difference.

Originally posted by sridharacharya View Post

The number of starts points within Uniquely placed tags were 6684, with average number of reads per start = 210.

The simplest explanation here is that either the amount of DNA used for the experiment was limiting, or there was a bottleneck in library construction caused by poor yields at one or more steps. As a result a small number of amplicons came to compose nearly all of your mappable reads.

Originally posted by sridharacharya View Post

There is another value (number of starts points within uniquely and Randomly placed tags = 77681 with 23 reads per start point.

What is the difference between these two? Is the second one the start points for reads with matching at multiple locations?

"Uniquely" placed means that the read mapped to a single location in your reference sequence. If a read maps to more than one position in your reference sequence -- as it would if it mapped to repetitive DNA, for example -- then it could have originated from multiple places in the genome and there is no way to place it uniquely. It may be placed in one of the multiple positions it maps to nevertheless, by choosing randomly among the possible mapping positions.

You still have one unresolved issue. Where do the other 88% of your reads derive? Here are some possibilities:

(1) Your CHiP DNA was contaminated with DNA from another source. Commonly this might be yeast because yeast tRNA's were used as a co-precipitant in some step and the tRNAs were contaminated with yeast genomic DNA. If you did not use yeast tRNA's as a co-precipitant, this is unlikely.

(2) The sequence was of very low quality and therefore most of the reads could not be mapped. From the distribution of mapping stats, this does not seem likely to me. But it is possible. You might ask your sequencing facility for the cycle scan report for your reads -- specifically the % of good plus best beads for each of the 50 ligations.

(3) The amount of CHiP DNA was substantially less than the 10 ng minimum called for in the frag library construction protocol. And/or what DNA that was there was non-ligatable/non-replicatable due to some sort of damage to the DNA. Various issues result in these circumstances. Those fragments that do happen to be usable come to compose the majority of the library molecules because they are all that will amplify via PCR.

(4) Alternatively to (3), most biologically-derived agents (eg, enzymes) are contaminated with DNA/RNA from their host strain. Normally this small amount of contamination is swamped out by the sample DNA, but in cases where the sample DNA is limiting, the small amount of contaminating DNA ends up being a significant part of the library.

(5) During library preparation your DNA sample became contaminated with SOLiD amplicons from a previous experiment from another organism.

If you can run the Corona-lite pipeline I would suggest running it with E. coli as a reference sequence. If this does not result in a high number of hits, I would suggest choosing 10-100 of your highest quality reads and converting them to base-space. Normally one does not want to do this, because a single sequence error will result in all bases downstream from that error being incorrect as well. But in this circumstance it is warranted. Blast these sequences against a large database, like "nt". This should help you determine what went wrong.

Finally, as a general note: sometimes it is better to move on than to spend weeks or months figuring out what went wrong with an experiment. It is very rare that you can publish "what did not work" experiments. But this can be a difficult decision to make. If you have an adviser or mentor, you might want to consult them.

One possibility is just to use the limited data you have to move onto a validation experiment. But again, there are strategic issues to consider before making this sort of decision.

--
Phillip

**sridharacharya** · 07-27-2010, 05:25 AM

Phillip,

Thanks a lot for your suggestions, which have pointed the way to go for me. Yes, I have to discuss with my advisor and collaborators what the best action would be.

Thanks again. I highly appreciate the insight given by you.

sridhar

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Today, 08:47 AM	0 responses 12 views 0 likes	Last Post by seqadmin Today, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

question on SoLiD peak length

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News