Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • introduction to BED peak files

    I am new to bioinformatics, coming from a applied math background. I need some help understanding a couple of files and interpreting their data.

    I had Chip-Seq data from a paper I read. The data and abstract is on GEO.

    As you can see its got 22 "samples" which if I interpret correctly means the data from the experiment. These files are peak files from the Chip-Seq experiment for particular histone modifications. If we pick one, lets say GSM721288 H3K4me1_MB this file is supposed to tell me the peaks of the chip seq experiment. In other words, its supposed to tell me where the genome is enriched for this particular modification.

    If I look at the BED file, it has lines like this:
    Code:
    chr1    4696042 4696748
    chr1    4735272 4735958
    chr1    4736192 4736368
    chr1    4736438 4736693
    This obviously dosnt show me the peak. It dosnt show me how high the peak was or the number of tags/reads in the peak. I have two questions:

    1) Does the line chr1 4696042 4696748 imply that between those nucleosomes (706bp) of them, they all have this histone modification? If so, do I have any knowledge of the number of reads in this region? Would I have to go back to the raw data for this?

    2) OR does the data mean that EACH LINE is a read. If so, should I see overlapping regions? Furthermore, taking the first line, how can a read be 706 bp long?

    If anyone could explain to me a simple workflow to what happens after a chip-seq experiment and what exactly do the BED files mean.

    Thanks.

  • #2
    Hey,

    The 22 samples are really 22 different experiments, in that each one refers to a different histone marker or transcription factor in a specific cell type.

    You should watch out for mixing up "nucleosome" and "nucleotide" by the way... The 706 bp are wrapped around the nucleosomes.

    There are a bunch of software that can be used to call peaks based on an alignment of the reads to a reference genome (one common one is MACS: http://liulab.dfci.harvard.edu/MACS/). What you're looking at are those calls, which doesn't include the number of reads in this case, just the genomic coordinates.

    So "chr1 4696042 4696748" are the coordinates of a single peak call stretching over that region. The nucleosomes in that region are enriched for the modification relative to background, but I don't think every nucleosome necessarily has the modification.

    This is a good paper on ChIP-seq analysis pipelines here:
    Mapping the chromosomal locations of transcription factors, nucleosomes, histone modifications, chromatin remodeling enzymes, chaperones, and polymerases is one of the key tasks of modern biology, as evidenced by the Encyclopedia of DNA Elements (ENCODE) Project. To this end, chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) is the standard methodology. Mapping such protein-DNA interactions in vivo using ChIP-seq presents multiple challenges not only in sample preparation and sequencing but also for computational analysis. Here, we present step-by-step guidelines for the computational analysis of ChIP-seq data. We address all the major steps in the analysis of ChIP-seq data: sequencing depth selection, quality checking, mapping, data normalization, assessment of reproducibility, peak calling, differential binding analysis, controlling the false discovery rate, peak annotation, visualization, and motif analysis. At each step in our guidelines we discuss some of the software tools most frequently used. We also highlight the challenges and problems associated with each step in ChIP-seq data analysis. We present a concise workflow for the analysis of ChIP-seq data in Figure 1 that complements and expands on the recommendations of the ENCODE and modENCODE projects. Each step in the workflow is described in detail in the following sections.


    Cheers,
    Gavin

    Comment


    • #3
      Thank you. I understand it a lot better now. From the GEO website, is it possible to know the number of reads that were in this region? In other words, I was digging through the GEO website and got to this: http://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR202856

      On the reads tab, I see stuff like
      Code:
      >gnl|SRA|SRR202856.1 ILLUMINA-EA6B0C_1:6:1:995:16675
      NGAGGTGTGGAAGGTGAGACTCTGAGCCCAGTGTCA
      And I've read that paper. I am going to read it again but I don't think it explained what the file formats are and stuff like that.
      Last edited by masfenix; 09-01-2014, 11:41 AM.

      Comment


      • #4
        You would need files with the positions of the aligned reads (usually bam or sam files). It doesn't look like they have them. The text files (e.g."GSM721286_MB_PolII_processed_replicate1.txt.gz") look like they might be what you need though; hopefully you can find a description of those files somewhere...

        Gavin

        Comment


        • #5
          That looks like a raw (unmapped to a reference genome) read to me, although I'm not sure what the numbers correspond to, but they don't look like coordinates... Probably some sort of ID

          If you want to convince yourself you can use BLAT, which is a useful tool if you want to figure out where a sequence maps to a reference genome: http://genome.ucsc.edu/cgi-bin/hgBlat

          Gavin

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM
          • seqadmin
            Techniques and Challenges in Conservation Genomics
            by seqadmin



            The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

            Avian Conservation
            Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
            03-08-2024, 10:41 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 03-27-2024, 06:37 PM
          0 responses
          12 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-27-2024, 06:07 PM
          0 responses
          11 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-22-2024, 10:03 AM
          0 responses
          53 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-21-2024, 07:32 AM
          0 responses
          69 views
          0 likes
          Last Post seqadmin  
          Working...
          X