Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Where to find public sequencing data with signal intensity for each base?

    Hello. I was uncertain about whether to put this in 'General' or 'Bioinformatics' but erred on the side of the latter because it seems to get more traffic.

    I am rather new to analyzing sequencing data, and am looking for a publicly available sequencing data (can be whole genome, exome, or targeted gene sequencing data) that contains information about signal intensity for each base. I do not know however what format such data would be in. I have already tried getting data from 1000genomes, specifically .bam files, and viewed them using IGV. However, this does not contain the information I'm looking for (or if it does, I couldn't find it).

    After reading something I thought maybe I need data in .bcl format? In any case, how might I find (and view) public data with signal intensity at the base level? Thanks.

  • #2
    See #2 and 3 posts in this thread: http://seqanswers.com/forums/showthread.php?t=20248. What you need is the .cif files, which are not saved by majority of people who run sequencers for last 2-3 years. I am not sure why you need the intensity files but your best bet would be to ask someone who owns a MiSeq to see if they would be willing to save them for a run.

    I found this where a cif file simulator has been discussed (see section 6.1): http://www.wpi.edu/Pubs/E-project/Av...Correction.pdf If you can use simulated data then you may want to contact these authors.

    Comment


    • #3
      What use is the signal intensity data to you, Mark?

      --
      Phillip

      Comment


      • #4
        Thanks for your responses. I am interested in detecting heterogeneity in cell populations. More specifically, I am thinking about when one sequences cancer cells from a tumor in which some cells have a certain mutation, and others do not (there may be multiple subclones, or it may just be that there are some normal cells mixed in with the cancer cells, especially if it's a solid tumor).

        For example, if you have a population of cells in which half of the cells have a G at a given locus and the other half have a C, due to a mutation in an an 'ancestor.' How well would one be able to detect this sort of heterogeneity at the base level with sequencing data? In any event, this is why I am interested in base signal intensity.

        Comment


        • #5
          Originally posted by Mark2 View Post
          Thanks for your responses. I am interested in detecting heterogeneity in cell populations. More specifically, I am thinking about when one sequences cancer cells from a tumor in which some cells have a certain mutation, and others do not (there may be multiple subclones, or it may just be that there are some normal cells mixed in with the cancer cells, especially if it's a solid tumor).

          For example, if you have a population of cells in which half of the cells have a G at a given locus and the other half have a C, due to a mutation in an an 'ancestor.' How well would one be able to detect this sort of heterogeneity at the base level with sequencing data? In any event, this is why I am interested in base signal intensity.
          This will not be detectable via intensity files of next gen sequencing data for reasons I won't go into at the moment.

          I guess you are thinking about Sanger sequencing intensity files. These are .ab1 files, for example. For Sanger sequencing each base intensity reading is a summation of all the signal from thousands or millions of sequence product strands. Importantly, these product strands potentially derive from a mixed population of templates.

          Usage of Sanger sequencing has fallen off dramatically as the price per base of Nextgen sequence is many orders of magnitude less to obtain.

          To obtain the equivalent of Sanger intensity values from next gen data sets you would count the numbers of bases at each position of interest in the .bam file. This is arguably more accurate than Sanger for this purpose.

          There are, of course, caveats to using either method depending on details of the samples and assays used.

          --
          Phillip

          Comment


          • #6
            Thanks pmiguel. Would counting numbers of bases at each position be simple to do in IGV? (I ask about IGV because it's the only tool for viewing bam files I'm aware of, feel free to suggest another if preferable).

            Edit: actually, can one just use R to view bam files? I just discovered the Rsamtools package. This might be easier as I'm more familiar with R.
            Last edited by Mark2; 01-07-2015, 12:20 PM.

            Comment


            • #7
              You could use the coverage histogram in IGV, which would be somewhat simpler than manual counting. An even simpler method would be to just do variant calling with a tool that's intended for complex samples (just google "variant call admixture" or "variant call heterogenous"). Such tools are more likely to directly do what it is you want.

              I would generally recommend against processing BAM files in R. Rsamtools works fine, but the R model for this sort of thing generally involves reading the whole BAM file into memory and then processing it...which is often not desireable.

              Comment


              • #8
                Originally posted by Mark2 View Post
                Thanks pmiguel. Would counting numbers of bases at each position be simple to do in IGV? (I ask about IGV because it's the only tool for viewing bam files I'm aware of, feel free to suggest another if preferable).

                Edit: actually, can one just use R to view bam files? I just discovered the Rsamtools package. This might be easier as I'm more familiar with R.
                It's simple but not scalable. In IGV, IIRC, you just mouse over the position of interest in the coverage histogram and you get the percentage of each possible base at that position. If you wanted to check a few positions, then IGV might be your tool.
                I am unfamiliar with Rsamtools.
                I agree with dpryan that a variant caller of some sort is the way to go if you want to assess a large number of positions.

                --
                Phillip

                Comment


                • #9
                  Thanks for the suggestions. I am currently looking at a public data set in IGV and am pleasantly surprised at how easy it was to see the coverage histogram.

                  It would be useful to be able to find all loci at which one base doesn't get 100% of the reads, as opposed to just checking specified loci for this condition. Would a variant caller allow me to do this?

                  Edit: actually, following dpryan's suggested google search I found a few variant callers that claim to be able to detect this sort of heterogeneity, including one from illumina: http://www.illumina.com/documents/pr...ant_caller.pdf

                  Anyone familiar with any particular variant callers of this sort?

                  dpryan: would using python for this necessarily have the same problem you describe regarding R?

                  Thanks.

                  Comment


                  • #10
                    No, python wouldn't suffer from the same issues. The simplest route would be to use pysam and just make a pileup of a sorted and indexed BAM file that way (you could also simply use "samtools mpileup" and pipe the output into a python script).

                    I'm not personally familiar with variant callers for this use case, I just knew they existed. You might post a new question asking about that.

                    Comment


                    • #11
                      Originally posted by dpryan View Post
                      No, python wouldn't suffer from the same issues. The simplest route would be to use pysam and just make a pileup of a sorted and indexed BAM file that way (you could also simply use "samtools mpileup" and pipe the output into a python script).

                      I'm not personally familiar with variant callers for this use case, I just knew they existed. You might post a new question asking about that.
                      Ok, thanks, I'll try it with python.

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Essential Discoveries and Tools in Epitranscriptomics
                        by seqadmin


                        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
                        Yesterday, 07:01 AM
                      • seqadmin
                        Current Approaches to Protein Sequencing
                        by seqadmin


                        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                        04-04-2024, 04:25 PM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, 04-11-2024, 12:08 PM
                      0 responses
                      45 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 10:19 PM
                      0 responses
                      46 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 09:21 AM
                      0 responses
                      39 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-04-2024, 09:00 AM
                      0 responses
                      55 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X