SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Is there a tool that converts TXT, BED, GFF format to VCF? LauraSmith Bioinformatics 4 03-22-2017 02:41 AM
Is it possible to convert a SNP.txt to a bed file or get a SNP.bed from samtools? Ling Bioinformatics 7 04-02-2015 07:17 AM
Updated How to convert .txt file to .bed .GFF or .BAR file format, forevermark4 Bioinformatics 2 06-30-2014 06:02 AM
Are there any good ways to use SAMtools java API to convert .bam file into .txt file? alextree Bioinformatics 8 01-24-2012 10:20 AM
ChIP- seq analysis from bed file mathew Bioinformatics 0 09-29-2011 08:57 PM

Reply
 
Thread Tools
Old 01-23-2009, 07:11 PM   #1
forevermark4
Junior Member
 
Location: Europe

Join Date: Jan 2009
Posts: 6
Default How to convert .txt file to .bed or .gff, How can we use chip seq data in R software

Hi Everyone,

Myself Yogesh kumar and I am new in illumina solexa work for chip seq analysis (cisgenome software).. I need only help to conert this dataset in format which supported by UCSC browser or cisgenome software..Other analysis part I can do.. I need your help I attached my datasets links

http://www.well.ox.ac.uk/htseq/qXZxn...s_1_export.txt

http://www.well.ox.ac.uk/htseq/qXZxn...s_3_export.txt

About these datasets some information

The “s_#” suffix stands for the lane number used in the flow cell.
The “export” format is defined by multiple, tab separated, columns.
For the definition of each column see below.


EXPORT file definitions

Export files are generated by the GERALD step* of the Illumina pipeline. The program called ELAND aligns each read from each lane to the reference genome (in our case the Mus Musculus genome, UCSC release MM8).
When a match is found, then the relative position on the genome is reported on the “s_#_export.txt” file (only chromosome, start position and strand F or R).
If no match is found a “NM” (no match) is reported.
There is a line for each read, whether it aligns or not, and multiple lines for the same read if it aligns in multiple positions.
You can either parse yourself these files (R-Bioconductor, Perl script …) or use public available software (eg CisGenome).
The file contains information on the physical position of the reads in the flow cell, the nucleotide sequence of the read itself, a string of the quality call for each nucleotide in the read (a code developed by Illumina), various flags and the genomic position (if found).
For many purposes not all these information are needed.

*the illumina pipeline consists in several steps, starting from cluster recognition, passing through basecalling and ending with the alignment of the bases to the reference genome.
Not all fields are relevant to a single-read analysis.
1. Machine (Parsed from Run Folder name)
2. Run Number (Parsed from Run Folder name)
3. Lane
4. Tile
5. X Coordinate of cluster
6. Y Coordinate of cluster
7. Index string (Bland for a non-indexed run)
8. Read number (1 or 2 for paired-read analysis, blank for a single-read analysis)
9. Read
10. Quality string—In symbolic ASCII format (ASCII character code = quality value + 64) by default (Set QUALITY_FORMAT --numeric in theGERALD config file for numeric values)
11. Match chromosome—Name of chromosome match OR code indicating why no
match resulted
12. Match Contig—Gives the contig name if there is a match and the match
chromosome is split into contigs (Blank if no match found)
13. Match Position—Always with respect to forward strand, numbering starts at 1 (Blank if no match found)
14. Match Strand—“F” for forward, “R” for reverse (Blank if no match found)
15. Match Descriptor—Concise description of alignment (Blank if no match found)
• A numeral denotes a run of matching bases
• A letter denotes substitution of a nucleotide:
For a 35 base read, “35” denotes an exact match and “32C2” denotes substitution
of a “C” at the 33rd position
16. Single-Read Alignment Score—Alignment score of a single-read match, or for a paired read, alignment score of a read if it were treated as a single read (Blank if no match found)
17. Paired-Read Alignment Score—Alignment score of a paired read and its partner, taken as a pair (Blank for single-read analysis)
18. Partner Chromosome—Name of the chromosome if the read is paired and its partner aligns to another chromosome (Blank for single-read analysis)
19. Partner Contig—Not blank if read is paired and its partner aligns to another
chromosome and that partner is split into contigs (Blank for single-read analysis)
20. Partner Offset—If a partner of a paired read aligns to the same chromosome and contig, this number, added to the Match Position, gives the alignment position of the partner (Blank for single-read analysis)
21. Partner Strand—To which strand did the partner of the paired read align? “F” for forward, “R” for reverse (Blank if no match found, blank for single-read analysis)
22. Filtering—Did the read pass quality filtering? “Y” for yes, “N” for no

Are you have any idea How can we convert these files to WIG, BED and GFF for the UCSC. Any one format is sufficient. for me . otherwise how can we convert .txt file to .BED file.. I am planning to use cisgenome (two sample analysis) software

to look at data mapped on their genomic original contest.

It 'll be great favour for me

Thanks
Yogesh Kumar
forevermark4 is offline   Reply With Quote
Old 01-23-2009, 10:52 PM   #2
ECO
--Site Admin--
 
Location: SF Bay Area, CA, USA

Join Date: Oct 2007
Posts: 1,355
Default

Hi Yogesh,

Your URL's are badly formed.
ECO is offline   Reply With Quote
Old 01-24-2009, 12:06 AM   #3
forevermark4
Junior Member
 
Location: Europe

Join Date: Jan 2009
Posts: 6
Default

Hi Everyone,

Myself Yogesh kumar and I am new in illumina solexa work for chip seq analysis (cisgenome software).. I need only help to conert this dataset in format which supported by UCSC browser or cisgenome software..Other analysis part I can do.. I need your help I attached my datasets links

http://www.well.ox.ac.uk/htseq/qXZxnawHREhej1c2lFut

http://www.well.ox.ac.uk/htseq/qXZxn...s_1_export.txt

http://www.well.ox.ac.uk/htseq/qXZxn...s_3_export.txt

About these datasets some information

The “s_#” suffix stands for the lane number used in the flow cell.
The “export” format is defined by multiple, tab separated, columns.
For the definition of each column see below.


EXPORT file definitions

Export files are generated by the GERALD step* of the Illumina pipeline. The program called ELAND aligns each read from each lane to the reference genome (in our case the Mus Musculus genome, UCSC release MM8).
When a match is found, then the relative position on the genome is reported on the “s_#_export.txt” file (only chromosome, start position and strand F or R).
If no match is found a “NM” (no match) is reported.
There is a line for each read, whether it aligns or not, and multiple lines for the same read if it aligns in multiple positions.
You can either parse yourself these files (R-Bioconductor, Perl script …) or use public available software (eg CisGenome).
The file contains information on the physical position of the reads in the flow cell, the nucleotide sequence of the read itself, a string of the quality call for each nucleotide in the read (a code developed by Illumina), various flags and the genomic position (if found).
For many purposes not all these information are needed.

*the illumina pipeline consists in several steps, starting from cluster recognition, passing through basecalling and ending with the alignment of the bases to the reference genome.
Not all fields are relevant to a single-read analysis.
1. Machine (Parsed from Run Folder name)
2. Run Number (Parsed from Run Folder name)
3. Lane
4. Tile
5. X Coordinate of cluster
6. Y Coordinate of cluster
7. Index string (Bland for a non-indexed run)
8. Read number (1 or 2 for paired-read analysis, blank for a single-read analysis)
9. Read
10. Quality string—In symbolic ASCII format (ASCII character code = quality value + 64) by default (Set QUALITY_FORMAT --numeric in theGERALD config file for numeric values)
11. Match chromosome—Name of chromosome match OR code indicating why no
match resulted
12. Match Contig—Gives the contig name if there is a match and the match
chromosome is split into contigs (Blank if no match found)
13. Match Position—Always with respect to forward strand, numbering starts at 1 (Blank if no match found)
14. Match Strand—“F” for forward, “R” for reverse (Blank if no match found)
15. Match Descriptor—Concise description of alignment (Blank if no match found)
• A numeral denotes a run of matching bases
• A letter denotes substitution of a nucleotide:
For a 35 base read, “35” denotes an exact match and “32C2” denotes substitution
of a “C” at the 33rd position
16. Single-Read Alignment Score—Alignment score of a single-read match, or for a paired read, alignment score of a read if it were treated as a single read (Blank if no match found)
17. Paired-Read Alignment Score—Alignment score of a paired read and its partner, taken as a pair (Blank for single-read analysis)
18. Partner Chromosome—Name of the chromosome if the read is paired and its partner aligns to another chromosome (Blank for single-read analysis)
19. Partner Contig—Not blank if read is paired and its partner aligns to another
chromosome and that partner is split into contigs (Blank for single-read analysis)
20. Partner Offset—If a partner of a paired read aligns to the same chromosome and contig, this number, added to the Match Position, gives the alignment position of the partner (Blank for single-read analysis)
21. Partner Strand—To which strand did the partner of the paired read align? “F” for forward, “R” for reverse (Blank if no match found, blank for single-read analysis)
22. Filtering—Did the read pass quality filtering? “Y” for yes, “N” for no

Are you have any idea How can we convert these files to WIG, BED and GFF for the UCSC. Any one format is sufficient. for me . otherwise how can we convert .txt file to .BED file.. I am planning to use cisgenome (two sample analysis) software

to look at data mapped on their genomic original contest.

It 'll be great favour for me

Thanks
Yogesh Kumar
forevermark4 is offline   Reply With Quote
Old 01-24-2009, 08:29 AM   #4
ECO
--Site Admin--
 
Location: SF Bay Area, CA, USA

Join Date: Oct 2007
Posts: 1,355
Default

OK Yogesh, you don't need to repost the same message. Just click "edit" and fix the URLs.
ECO is offline   Reply With Quote
Old 01-24-2009, 11:23 AM   #5
graveley
Member
 
Location: Hartford, CT

Join Date: Jan 2009
Posts: 11
Default

Dear Yogesh,

We do this by writing a perl script that reads in the alignment information and writes a new file in the appropriate format. I would send you what we use, but we do not use export.txt files. We are currently doing alignments with Bowtie and then converting the output to .gff and .wig files.

Brent
graveley is offline   Reply With Quote
Old 01-26-2009, 02:02 AM   #6
forevermark4
Junior Member
 
Location: Europe

Join Date: Jan 2009
Posts: 6
Default

Hi Brent,

Thanks.. If you dont mind can you send me that perl script .. So I can try here to txt fle and to convert in .gff or .wif format .. Perl script source code which you are using to convert .txt file to .gff or .wig format or alignment script because I already know perl script how to convert .fasta to .embl or other formats
forevermark4 is offline   Reply With Quote
Old 01-27-2009, 03:53 PM   #7
Agent47
Junior Member
 
Location: Philadelphia

Join Date: Jan 2009
Posts: 3
Default

Hi Brent,
I am stuck in almost same position as Yogesh.
I am using MAQ for the alignment of SOLEXA data but i am not able to convert it into .WIG and .GFF format, if you can provide me some directions for this it would be a great help
Thanks!

Arpit
Agent47 is offline   Reply With Quote
Old 01-30-2009, 03:31 PM   #8
tabascoj
Junior Member
 
Location: New Haven, CT, USA

Join Date: Oct 2008
Posts: 1
Default Bowtie to .wig

Brent,
I would really appreciate any perl suggestions for getting the Bowtie alignment into a WIG file (which I intend to use in Gbrowse). I have no problem with the perl conversion of tabulated documents (e.g. Bowtie-->GFF), but I need help getting a pileup and getting the values into the WIG file.

Thanks very much.
Joe
tabascoj is offline   Reply With Quote
Old 01-30-2009, 03:40 PM   #9
apfejes
Senior Member
 
Location: Oakland, California

Join Date: Feb 2008
Posts: 236
Default

I feel silly promoting my own software, but Maq to wig and eland to wig are both handled well by FindPeaks.

http://vancouvershortr.wiki.sourcefo...indPeaksManual

You may not need the ChIP-Seq features, but you can certainly just use it for a quick conversion. (There are converters in the package for creating bed files as well)

As for bowtie, you can always have it produce a .map file and then do the same conversion.

Good luck.
__________________
The more you know, the more you know you don't know. —Aristotle
apfejes is offline   Reply With Quote
Old 02-05-2009, 07:22 AM   #10
jperin
Member
 
Location: Philadelphia

Join Date: Feb 2009
Posts: 10
Default

this may be the wrong place to ask, but I've just tried findPeaks for creating our wig files and it works great. The only problem is that the wig file appears to be offset to the very beginning of the chromosome. Our reference sequence is only a small piece of chromosome 10, in this case. It appears that at some stage in performing the maq alignment, a tag isn't set properly and causes the wig file to insert a label of "hg18_dna" instead of the "chr10", and then the start position for the first base pair in the first header starts at 1, instead of 17M something... where it 'should' be.

The fasta reference file has the correct tag in it, with the right reference, but at some stage this doesn't get passed to findPeaks and the offset is not correctly inserted. I can see the beautiful wig image in a browser, but it displays at the beginning of chr10. I had to manually change the hg18_dna to chr10 for the first to work, but changing the offset isn't as simple since the multiple headers each have their own offset position and it would be hard to calculate, also assuming there's probably a simple way to fix this??

Thanks for any advice.
Juan
jperin is offline   Reply With Quote
Old 02-06-2009, 06:07 AM   #11
mudshark
Senior Member
 
Location: Munich

Join Date: Jan 2009
Posts: 138
Default

hi

i tested several published ChipSeq applications. therefore i would like to mention the spp package for R (http://compbio.med.harvard.edu/Supplements/ChIP-seq/) which was the only piece of software that directly produced kind of convincing output on 2 sample comparison data (i also tested PeakSeq, MACS, cisgenome).
mudshark is offline   Reply With Quote
Old 02-06-2009, 10:41 AM   #12
apfejes
Senior Member
 
Location: Oakland, California

Join Date: Feb 2008
Posts: 236
Default

I'm adding several control modes to FindPeaks this week. I hope you'll revisit that list at some point (=

Anthony
__________________
The more you know, the more you know you don't know. —Aristotle
apfejes is offline   Reply With Quote
Old 02-07-2009, 02:54 AM   #13
mudshark
Senior Member
 
Location: Munich

Join Date: Jan 2009
Posts: 138
Default

hi Anthony,
afaik FindPeaks does not (yet?) support 2-sample analysis, i.e. IP vs. Input. is that correct?
T.
mudshark is offline   Reply With Quote
Old 02-07-2009, 05:50 AM   #14
apfejes
Senior Member
 
Location: Oakland, California

Join Date: Feb 2008
Posts: 236
Default

That's correct - The code is currently in development, but should be ready shortly. (I hate saying that about software, but a lot of the code has already been written.)

Cheers,
Anthony
__________________
The more you know, the more you know you don't know. —Aristotle
apfejes is offline   Reply With Quote
Old 02-11-2009, 11:00 AM   #15
alperyilmaz
Member
 
Location: Columbus, OH

Join Date: Feb 2009
Posts: 11
Default R package from BioConductor

There's another R package to be released, which will be available thru BioConductor. It's mentioned in a recent workshop. The workshop material can be viewed here.
alperyilmaz is offline   Reply With Quote
Old 06-26-2009, 02:17 AM   #16
Layla
Member
 
Location: London

Join Date: Sep 2008
Posts: 58
Default maq to GFF2 format

Hi,

I know this is quite an old thread but useful for me at this moment.
Searching for a method to convert a maq alignment (to human ref genome) output from me-dip seq to GFF2 format for batman but no luck so far. Is there a script already available for this or is there a conversion tool to do this?

GFF2 example:
col1 col2 col3 col4 col5 col6 col7 col8 col9
Chr1 Homo_build36 Reference 1 45 . + 1 Sequence1

How does one define column 6 (score) and column 7 (phase) from the maq alignment file?

Help appreciated

L
Layla is offline   Reply With Quote
Old 06-26-2009, 05:12 AM   #17
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,153
Default

Layla,

For the score I would use the mapping quality (column 7 of the MAQ .aln file). Phase is meaningless in this context, it only applies to coding sequence features. In your case just put a '.' in column 7.
kmcarr is offline   Reply With Quote
Old 06-28-2009, 09:27 AM   #18
joseph
Member
 
Location: ca

Join Date: Feb 2008
Posts: 39
Default

Quote:
Originally Posted by graveley View Post
Dear Yogesh,

We do this by writing a perl script that reads in the alignment information and writes a new file in the appropriate format. I would send you what we use, but we do not use export.txt files. We are currently doing alignments with Bowtie and then converting the output to .gff and .wig files.

Brent
Hi Brent,
I would really appreciate it if you send me your perl scripts for conversion of bowtie output to .gff and .wig files.
my e-mail: [email protected]
Thanks
Joseph
joseph is offline   Reply With Quote
Old 07-16-2009, 05:48 PM   #19
Ka123$
Member
 
Location: MD

Join Date: Jul 2009
Posts: 27
Default Bowtie to Bed files

Hi Brent
could you also help me with the perl script to convert bowtie to BED? I am not a perl person. Also can find peak convert the bowtie-align files to a cisGenome compatible file?
THanks
Ka
Ka123$ is offline   Reply With Quote
Old 07-16-2009, 05:50 PM   #20
Ka123$
Member
 
Location: MD

Join Date: Jul 2009
Posts: 27
Default

HI Brent,
My email is [email protected]ks
What kind of format are (GERALD) aligned output files?Aligned files are mapped files I suppose......just two terminologies.....?
Ka123$ is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:35 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO