Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • problem on file format in ChIP-Seq data analysis

    Hi all,
    I am wondering a problem about the file format in ChIP-Seq data analysis.
    While I only have aligned data in BED format, what should be done if I want to run the data by a software which could not recognize the BED format such as PeakSeq or QuEST? Is there any way to convert the BED file to ELAND or likeness format file?
    Thanks a lot.

  • #2
    You could largely convert a BED format file to ELAND format. BED format files don't usually contain anything about mismatches to the reference sequence, so you'd have to fudge that. Also, you'd have to look up the sequence for each read, though that's trivial. Frankly, those are the biggest differences in the formats and I doubt that any of the peak finders actually care about those fields. So, in short, yeah, you could probably convert the file type enough to work with a one line command using awk.

    Comment


    • #3
      Hi, dpryan
      It really do make sense. I tried to fudge those data and found it do have no effect on the called peaks.
      Thx very much.

      Comment


      • #4
        Dear sir

        i am working with chip-seq data.sir i have tried with SISSRS,QuEST,MACS,SICER.

        Sir my problme is like ,i am not able to recognize files...like there are several file formats with me..all are chip-seq data...but i don't know whether this all files can i used with all softwares what i mentioned above ..sir please let me know what kind of data is this???

        i know chip-seq data always present in following format

        chr4 130135336 130135360 U0 0 -
        chr1 110547319 110547343 U0 0 -
        chr10 63922216 63922240 U0 0 -
        chr2 71081880 71081904 U0 0 +

        I used SISSRS for such files (bed files)


        now there are other formats also like

        1 E2H2.aligned.txt




        chr13 81419432 81419468 + 205E9.6.559265 2
        chr11 44462781 44462817 + 205E9.6.559267 0
        chr1 89426606 89426642 - 205E9.6.559270 3
        chr12 103518323 103518359 - 205E9.6.559271 0
        chrX 128953935 128953971 - 205E9.6.559272 2
        chr19 4888146 4888182 - 205E9.6.559274 5
        chr4 137770387 137770423 + 205E9.6.559275 1

        2.densities.txt

        chr1 25 -1
        chr1 50 -1
        chr1 75 -1
        chr1 100 -1
        chr1 125 -1
        chr1 150 -1
        chr1 175 -1
        chr1 200 -1

        3.chip3034_multi_hg18.txt

        AGAGTGTTTCAAACCTGCTCCATGAA 13000 13
        AGACGAAGTCTCACTCTGTCACCCAG 13000 164
        ATTCCATTCCACTCTGTTCCATTCCA 11953 24
        AGTAACCCTTATTCTACTTAATAATG 13000 2
        ATGGTAGTTCACACCTATAATCCCCG 11953 11
        ATTGGCCAGATGCAGAGGCTCACACC 11953 9
        ATAGCACAAAGGCAATAACACTTAAT 10906 3

        i used this file format for QuEST

        4.bed file

        chr1 454 489 CCTAACCCTAACCCTCGCGGTACCCTCAGCCGGCC 0 + - - 0,0,255
        chr1 512 547 TTTCGGTGGTACTCTGAAGGCGGAGCACAGTTCTC 0 - - - 255,0,0
        chr1 512 547 TTTCGGTGGTACTCTGAAGGCGGAGCACAGCTCTC 0 - - - 255,0,0
        chr1 512 547 TTTCGGTGGTACTCTGAAGGCGGAGCACAGTTCTC 0 - - - 255,0,0


        5.bam files(these files are not opening in my system)

        6.bed files .



        6 38662156 38662189 +
        8 102050882 102050916 +
        16 16805607 16805640 -
        10 18950674 18950708 -
        4 52586623 52586657 -
        8 126508725 126508748 -
        5 83713731 83713758 +
        1 217224630 217224664 -
        2 234129500 234129531 -
        5 116295091 116295124 -
        17 36024302 36024336 -



        7..bed files

        chr1 564621 564687 . 0 . 5.575970 3.58854 -1
        chr1 569893 569962 . 0 . 7.441230 6.19321 -1
        chr1 712868 713455 . 0 . 11.857200 11.4429 -1
        chr1 713653 713670 . 0 . 7.278470 4.21542 -1
        chr1 713880 714756 . 0 . 87.115402 246.909 -1
        chr1 715081 715443 . 0 . 18.861601 21.5467 -1
        chr1 761030 763152 . 0 . 99.675797 201.571 -1


        8.peaks.txt

        chr1 6216808 6219103 985 186 5.29979577395856 799 1.34744732317805e-129
        chr6 158010381 158011325 686 65 10.5893955160332 621 1.43057401891788e-129
        chr5 33110401 33111074 644 51 12.7903624851984 593 1.50406065933793e-129
        chr3 197589215 197590103 652 54 12.2534188623185 598 3.17417576226315e-129
        chr3 150539977 150541729 852 129 6.62571157437829 723 3.84605198529492e-129

        9.bed file

        chr1 5319 6069
        chr1 15612 16329
        chr1 81077 82406
        chr1 227508 228733
        chr1 456299 456770
        chr1 477582 478232
        chr1 501635 501985
        chr1 584463 586213

        10.bed file


        chr14 68535052 68535087 Neg2 1 - 68535052 68535087 153,255,153
        chr10 72774109 72774144 Neg3 1 - 72774109 72774144 153,255,153
        chr6 163049829 163049864 Pos4 14 + 163049829 163049864 0,0,102
        chr7 144599649 144599684 Neg5 1 - 144599649 144599684 153,255,153
        chr9 106823345 106823380 Pos6 1 + 106823345 106823380 153,153,255

        Comment


        • #5
          Originally posted by Pravara_@bioinformatics View Post
          i know chip-seq data always present in following format

          chr4 130135336 130135360 U0 0 -
          chr1 110547319 110547343 U0 0 -
          chr10 63922216 63922240 U0 0 -
          chr2 71081880 71081904 U0 0 +

          I used SISSRS for such files (bed files)
          As you're finding out, there are a LOT of different file formats. Most of these are interchangeable. BED format can have anywhere between 3 and 12 columns. You tend to find data with the first 6 columns, but if you find pre-aligned paired-end sequences, they may have only the first 3 (required) columns. Also, this is all pre-aligned data as raw data will tend to be in fastq format.

          I'm assuming you're getting these datasets from GEO. If so, the formats of the files are normally described there. Otherwise, #1-3 I'm not familiar with. #4 is a BED format file, you could use this in SISSRS like above. #5 is a BAM format file, that can be directly used in things like MACS and can also be converted to BED using bamtools if whatever program you prefer can't use BAM format. #6 looks like a modified BED format, it's actually close to the format I usually keep things in. I imagine you can put a "chr" in front of the number in the first column and add two columns of periods between columns 3 and 4 to make it a usable BED file. #7 and #8 look like the output of a peak finder. #9 is probably also the output of a peak finder, since the regions are quite broad and there's no strand information. #10 is another BED file. Presumably it was intended for visualization in the genome browser since someone bothered to fill in the itemRgb field.

          BTW, it's probably best to only compare results within a single peak caller. Otherwise, differences in peaks you see between datasets may be due solely to the different algorithms behind the peak callers. Also, it can sometimes be easier to just realign things yourself and thereby produce a BED or BAM format file, since that's pretty quick.

          Comment


          • #6
            technical or biological difference between the two dataset?

            Hello,

            I have two chip-seq samples for the same protein in embryonic stem (ES) cells and rationic acid induced cells. I have obtained around 800 peaks in ES cells and around 7500 peaks in induced cells. Protocol, antibody, peak calling paramteres (MACS) and the person who has done the the experiments are all same. Number of reads obtained in both the samples is similar with similar level of background. If I see peaks in my new dataset, it has good enrichment as compared to the old one at the same region (~50% higher enrichment). I want to know, is this the real biological difference or because of deep sequencing, in the new data set I see good enrichment of tags which is not seen in the old dataset. How to rule out any technical problems, if there are any? Any suggestions are most welcome. Thanks

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM
            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            31 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            32 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            28 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            53 views
            0 likes
            Last Post seqadmin  
            Working...
            X