SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
ChIP-Seq: Enabling Data Analysis on High-Throughput Data in Large Data Depository Usi Newsbot! Literature Watch 1 04-18-2018 10:50 PM
ChIP-Seq: ChIP-Seq Data Analysis: Identification of Protein-DNA Binding Sites with SI Newsbot! Literature Watch 0 12-02-2011 05:51 AM
ChIP-Seq: A fully Bayesian hidden Ising model for ChIP-seq data analysis. Newsbot! Literature Watch 0 09-15-2011 03:10 AM
ChIP-Seq: Computational Analysis of ChIP-seq Data. Newsbot! Literature Watch 0 09-10-2010 03:10 AM
format problem:convert fastq to seq/qual file anyone1985 Bioinformatics 1 04-10-2009 09:27 AM

Reply
 
Thread Tools
Old 12-06-2011, 05:16 AM   #1
sp_wade
Junior Member
 
Location: China

Join Date: Mar 2010
Posts: 9
Default problem on file format in ChIP-Seq data analysis

Hi all,
I am wondering a problem about the file format in ChIP-Seq data analysis.
While I only have aligned data in BED format, what should be done if I want to run the data by a software which could not recognize the BED format such as PeakSeq or QuEST? Is there any way to convert the BED file to ELAND or likeness format file?
Thanks a lot.
sp_wade is offline   Reply With Quote
Old 12-06-2011, 02:58 PM   #2
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

You could largely convert a BED format file to ELAND format. BED format files don't usually contain anything about mismatches to the reference sequence, so you'd have to fudge that. Also, you'd have to look up the sequence for each read, though that's trivial. Frankly, those are the biggest differences in the formats and I doubt that any of the peak finders actually care about those fields. So, in short, yeah, you could probably convert the file type enough to work with a one line command using awk.
dpryan is offline   Reply With Quote
Old 12-06-2011, 05:16 PM   #3
sp_wade
Junior Member
 
Location: China

Join Date: Mar 2010
Posts: 9
Default

Hi, dpryan
It really do make sense. I tried to fudge those data and found it do have no effect on the called peaks.
Thx very much.
sp_wade is offline   Reply With Quote
Old 12-12-2011, 03:54 AM   #4
Pravara_@bioinformatics
Junior Member
 
Location: PUNE

Join Date: Dec 2011
Posts: 2
Default

Dear sir

i am working with chip-seq data.sir i have tried with SISSRS,QuEST,MACS,SICER.

Sir my problme is like ,i am not able to recognize files...like there are several file formats with me..all are chip-seq data...but i don't know whether this all files can i used with all softwares what i mentioned above ..sir please let me know what kind of data is this???

i know chip-seq data always present in following format

chr4 130135336 130135360 U0 0 -
chr1 110547319 110547343 U0 0 -
chr10 63922216 63922240 U0 0 -
chr2 71081880 71081904 U0 0 +

I used SISSRS for such files (bed files)


now there are other formats also like

1 E2H2.aligned.txt




chr13 81419432 81419468 + 205E9.6.559265 2
chr11 44462781 44462817 + 205E9.6.559267 0
chr1 89426606 89426642 - 205E9.6.559270 3
chr12 103518323 103518359 - 205E9.6.559271 0
chrX 128953935 128953971 - 205E9.6.559272 2
chr19 4888146 4888182 - 205E9.6.559274 5
chr4 137770387 137770423 + 205E9.6.559275 1

2.densities.txt

chr1 25 -1
chr1 50 -1
chr1 75 -1
chr1 100 -1
chr1 125 -1
chr1 150 -1
chr1 175 -1
chr1 200 -1

3.chip3034_multi_hg18.txt

AGAGTGTTTCAAACCTGCTCCATGAA 13000 13
AGACGAAGTCTCACTCTGTCACCCAG 13000 164
ATTCCATTCCACTCTGTTCCATTCCA 11953 24
AGTAACCCTTATTCTACTTAATAATG 13000 2
ATGGTAGTTCACACCTATAATCCCCG 11953 11
ATTGGCCAGATGCAGAGGCTCACACC 11953 9
ATAGCACAAAGGCAATAACACTTAAT 10906 3

i used this file format for QuEST

4.bed file

chr1 454 489 CCTAACCCTAACCCTCGCGGTACCCTCAGCCGGCC 0 + - - 0,0,255
chr1 512 547 TTTCGGTGGTACTCTGAAGGCGGAGCACAGTTCTC 0 - - - 255,0,0
chr1 512 547 TTTCGGTGGTACTCTGAAGGCGGAGCACAGCTCTC 0 - - - 255,0,0
chr1 512 547 TTTCGGTGGTACTCTGAAGGCGGAGCACAGTTCTC 0 - - - 255,0,0


5.bam files(these files are not opening in my system)

6.bed files .



6 38662156 38662189 +
8 102050882 102050916 +
16 16805607 16805640 -
10 18950674 18950708 -
4 52586623 52586657 -
8 126508725 126508748 -
5 83713731 83713758 +
1 217224630 217224664 -
2 234129500 234129531 -
5 116295091 116295124 -
17 36024302 36024336 -



7..bed files

chr1 564621 564687 . 0 . 5.575970 3.58854 -1
chr1 569893 569962 . 0 . 7.441230 6.19321 -1
chr1 712868 713455 . 0 . 11.857200 11.4429 -1
chr1 713653 713670 . 0 . 7.278470 4.21542 -1
chr1 713880 714756 . 0 . 87.115402 246.909 -1
chr1 715081 715443 . 0 . 18.861601 21.5467 -1
chr1 761030 763152 . 0 . 99.675797 201.571 -1


8.peaks.txt

chr1 6216808 6219103 985 186 5.29979577395856 799 1.34744732317805e-129
chr6 158010381 158011325 686 65 10.5893955160332 621 1.43057401891788e-129
chr5 33110401 33111074 644 51 12.7903624851984 593 1.50406065933793e-129
chr3 197589215 197590103 652 54 12.2534188623185 598 3.17417576226315e-129
chr3 150539977 150541729 852 129 6.62571157437829 723 3.84605198529492e-129

9.bed file

chr1 5319 6069
chr1 15612 16329
chr1 81077 82406
chr1 227508 228733
chr1 456299 456770
chr1 477582 478232
chr1 501635 501985
chr1 584463 586213

10.bed file


chr14 68535052 68535087 Neg2 1 - 68535052 68535087 153,255,153
chr10 72774109 72774144 Neg3 1 - 72774109 72774144 153,255,153
chr6 163049829 163049864 Pos4 14 + 163049829 163049864 0,0,102
chr7 144599649 144599684 Neg5 1 - 144599649 144599684 153,255,153
chr9 106823345 106823380 Pos6 1 + 106823345 106823380 153,153,255
Pravara_@bioinformatics is offline   Reply With Quote
Old 12-12-2011, 05:14 AM   #5
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Quote:
Originally Posted by Pravara_@bioinformatics View Post
i know chip-seq data always present in following format

chr4 130135336 130135360 U0 0 -
chr1 110547319 110547343 U0 0 -
chr10 63922216 63922240 U0 0 -
chr2 71081880 71081904 U0 0 +

I used SISSRS for such files (bed files)
As you're finding out, there are a LOT of different file formats. Most of these are interchangeable. BED format can have anywhere between 3 and 12 columns. You tend to find data with the first 6 columns, but if you find pre-aligned paired-end sequences, they may have only the first 3 (required) columns. Also, this is all pre-aligned data as raw data will tend to be in fastq format.

I'm assuming you're getting these datasets from GEO. If so, the formats of the files are normally described there. Otherwise, #1-3 I'm not familiar with. #4 is a BED format file, you could use this in SISSRS like above. #5 is a BAM format file, that can be directly used in things like MACS and can also be converted to BED using bamtools if whatever program you prefer can't use BAM format. #6 looks like a modified BED format, it's actually close to the format I usually keep things in. I imagine you can put a "chr" in front of the number in the first column and add two columns of periods between columns 3 and 4 to make it a usable BED file. #7 and #8 look like the output of a peak finder. #9 is probably also the output of a peak finder, since the regions are quite broad and there's no strand information. #10 is another BED file. Presumably it was intended for visualization in the genome browser since someone bothered to fill in the itemRgb field.

BTW, it's probably best to only compare results within a single peak caller. Otherwise, differences in peaks you see between datasets may be due solely to the different algorithms behind the peak callers. Also, it can sometimes be easier to just realign things yourself and thereby produce a BED or BAM format file, since that's pretty quick.
dpryan is offline   Reply With Quote
Old 03-08-2012, 06:43 AM   #6
sikidiri
Member
 
Location: france

Join Date: May 2011
Posts: 13
Default technical or biological difference between the two dataset?

Hello,

I have two chip-seq samples for the same protein in embryonic stem (ES) cells and rationic acid induced cells. I have obtained around 800 peaks in ES cells and around 7500 peaks in induced cells. Protocol, antibody, peak calling paramteres (MACS) and the person who has done the the experiments are all same. Number of reads obtained in both the samples is similar with similar level of background. If I see peaks in my new dataset, it has good enrichment as compared to the old one at the same region (~50% higher enrichment). I want to know, is this the real biological difference or because of deep sequencing, in the new data set I see good enrichment of tags which is not seen in the old dataset. How to rule out any technical problems, if there are any? Any suggestions are most welcome. Thanks
sikidiri is offline   Reply With Quote
Reply

Tags
chip-seq, data analysis, format

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:21 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO