SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
low signal for one base sequencingfan Sanger/Dye Terminator 1 11-25-2014 03:35 PM
low intensity signal with peaks under peaks mohd2b Sanger/Dye Terminator 5 11-19-2014 07:11 PM
Microarray Signal Intensity values Published data priya Bioinformatics 5 10-27-2014 08:04 AM
Is there any public raw intensity data from RNA-Seq endether Bioinformatics 0 10-10-2011 08:06 AM
Where can I find public dataset of NGS? xinwu Bioinformatics 3 09-08-2010 12:11 PM

Reply
 
Thread Tools
Old 01-06-2015, 12:42 PM   #1
Mark2
Junior Member
 
Location: Ohio

Join Date: Dec 2014
Posts: 8
Default Where to find public sequencing data with signal intensity for each base?

Hello. I was uncertain about whether to put this in 'General' or 'Bioinformatics' but erred on the side of the latter because it seems to get more traffic.

I am rather new to analyzing sequencing data, and am looking for a publicly available sequencing data (can be whole genome, exome, or targeted gene sequencing data) that contains information about signal intensity for each base. I do not know however what format such data would be in. I have already tried getting data from 1000genomes, specifically .bam files, and viewed them using IGV. However, this does not contain the information I'm looking for (or if it does, I couldn't find it).

After reading something I thought maybe I need data in .bcl format? In any case, how might I find (and view) public data with signal intensity at the base level? Thanks.
Mark2 is offline   Reply With Quote
Old 01-06-2015, 04:30 PM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,978
Default

See #2 and 3 posts in this thread: http://seqanswers.com/forums/showthread.php?t=20248. What you need is the .cif files, which are not saved by majority of people who run sequencers for last 2-3 years. I am not sure why you need the intensity files but your best bet would be to ask someone who owns a MiSeq to see if they would be willing to save them for a run.

I found this where a cif file simulator has been discussed (see section 6.1): http://www.wpi.edu/Pubs/E-project/Av...Correction.pdf If you can use simulated data then you may want to contact these authors.
GenoMax is offline   Reply With Quote
Old 01-07-2015, 08:46 AM   #3
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,315
Default

What use is the signal intensity data to you, Mark?

--
Phillip
pmiguel is offline   Reply With Quote
Old 01-07-2015, 09:39 AM   #4
Mark2
Junior Member
 
Location: Ohio

Join Date: Dec 2014
Posts: 8
Default

Thanks for your responses. I am interested in detecting heterogeneity in cell populations. More specifically, I am thinking about when one sequences cancer cells from a tumor in which some cells have a certain mutation, and others do not (there may be multiple subclones, or it may just be that there are some normal cells mixed in with the cancer cells, especially if it's a solid tumor).

For example, if you have a population of cells in which half of the cells have a G at a given locus and the other half have a C, due to a mutation in an an 'ancestor.' How well would one be able to detect this sort of heterogeneity at the base level with sequencing data? In any event, this is why I am interested in base signal intensity.
Mark2 is offline   Reply With Quote
Old 01-07-2015, 09:55 AM   #5
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,315
Default

Quote:
Originally Posted by Mark2 View Post
Thanks for your responses. I am interested in detecting heterogeneity in cell populations. More specifically, I am thinking about when one sequences cancer cells from a tumor in which some cells have a certain mutation, and others do not (there may be multiple subclones, or it may just be that there are some normal cells mixed in with the cancer cells, especially if it's a solid tumor).

For example, if you have a population of cells in which half of the cells have a G at a given locus and the other half have a C, due to a mutation in an an 'ancestor.' How well would one be able to detect this sort of heterogeneity at the base level with sequencing data? In any event, this is why I am interested in base signal intensity.
This will not be detectable via intensity files of next gen sequencing data for reasons I won't go into at the moment.

I guess you are thinking about Sanger sequencing intensity files. These are .ab1 files, for example. For Sanger sequencing each base intensity reading is a summation of all the signal from thousands or millions of sequence product strands. Importantly, these product strands potentially derive from a mixed population of templates.

Usage of Sanger sequencing has fallen off dramatically as the price per base of Nextgen sequence is many orders of magnitude less to obtain.

To obtain the equivalent of Sanger intensity values from next gen data sets you would count the numbers of bases at each position of interest in the .bam file. This is arguably more accurate than Sanger for this purpose.

There are, of course, caveats to using either method depending on details of the samples and assays used.

--
Phillip
pmiguel is offline   Reply With Quote
Old 01-07-2015, 11:15 AM   #6
Mark2
Junior Member
 
Location: Ohio

Join Date: Dec 2014
Posts: 8
Default

Thanks pmiguel. Would counting numbers of bases at each position be simple to do in IGV? (I ask about IGV because it's the only tool for viewing bam files I'm aware of, feel free to suggest another if preferable).

Edit: actually, can one just use R to view bam files? I just discovered the Rsamtools package. This might be easier as I'm more familiar with R.

Last edited by Mark2; 01-07-2015 at 11:20 AM.
Mark2 is offline   Reply With Quote
Old 01-07-2015, 11:48 AM   #7
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

You could use the coverage histogram in IGV, which would be somewhat simpler than manual counting. An even simpler method would be to just do variant calling with a tool that's intended for complex samples (just google "variant call admixture" or "variant call heterogenous"). Such tools are more likely to directly do what it is you want.

I would generally recommend against processing BAM files in R. Rsamtools works fine, but the R model for this sort of thing generally involves reading the whole BAM file into memory and then processing it...which is often not desireable.
dpryan is offline   Reply With Quote
Old 01-08-2015, 09:55 AM   #8
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,315
Default

Quote:
Originally Posted by Mark2 View Post
Thanks pmiguel. Would counting numbers of bases at each position be simple to do in IGV? (I ask about IGV because it's the only tool for viewing bam files I'm aware of, feel free to suggest another if preferable).

Edit: actually, can one just use R to view bam files? I just discovered the Rsamtools package. This might be easier as I'm more familiar with R.
It's simple but not scalable. In IGV, IIRC, you just mouse over the position of interest in the coverage histogram and you get the percentage of each possible base at that position. If you wanted to check a few positions, then IGV might be your tool.
I am unfamiliar with Rsamtools.
I agree with dpryan that a variant caller of some sort is the way to go if you want to assess a large number of positions.

--
Phillip
pmiguel is offline   Reply With Quote
Old 01-08-2015, 10:26 AM   #9
Mark2
Junior Member
 
Location: Ohio

Join Date: Dec 2014
Posts: 8
Default

Thanks for the suggestions. I am currently looking at a public data set in IGV and am pleasantly surprised at how easy it was to see the coverage histogram.

It would be useful to be able to find all loci at which one base doesn't get 100% of the reads, as opposed to just checking specified loci for this condition. Would a variant caller allow me to do this?

Edit: actually, following dpryan's suggested google search I found a few variant callers that claim to be able to detect this sort of heterogeneity, including one from illumina: http://www.illumina.com/documents/pr...ant_caller.pdf

Anyone familiar with any particular variant callers of this sort?

dpryan: would using python for this necessarily have the same problem you describe regarding R?

Thanks.
Mark2 is offline   Reply With Quote
Old 01-08-2015, 10:53 AM   #10
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

No, python wouldn't suffer from the same issues. The simplest route would be to use pysam and just make a pileup of a sorted and indexed BAM file that way (you could also simply use "samtools mpileup" and pipe the output into a python script).

I'm not personally familiar with variant callers for this use case, I just knew they existed. You might post a new question asking about that.
dpryan is offline   Reply With Quote
Old 01-08-2015, 11:30 AM   #11
Mark2
Junior Member
 
Location: Ohio

Join Date: Dec 2014
Posts: 8
Default

Quote:
Originally Posted by dpryan View Post
No, python wouldn't suffer from the same issues. The simplest route would be to use pysam and just make a pileup of a sorted and indexed BAM file that way (you could also simply use "samtools mpileup" and pipe the output into a python script).

I'm not personally familiar with variant callers for this use case, I just knew they existed. You might post a new question asking about that.
Ok, thanks, I'll try it with python.
Mark2 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:09 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO