Seqanswers Leaderboard Ad

**GenoMax** · 10-16-2013, 03:43 AM

When you are referring to the "eland extended" files which exact files are you referring to?

Do the file names have "s_*_eland_multi.txt" file names or are they called "s_*_eland_results.txt"?

As I recall the "eland_extended" analysis referred to alignments done with sequences that were > 32 bp long (seems odd now but that was the state of the art in 2008) and should have resulted in files with "multi" in the name.

**danielecook** · 10-16-2013, 10:42 AM

I suspect the names have changed - here is a sample of the file:

>HWUSI-EAS107_0008:5:1:18780:1121#0/1 GTGGGGCAGTACCTCTCCTGCAGCTGTTGTTAGTGG 1:1:0 chr19.fa:16326479F30G2CA1,16326543F36
>HWUSI-EAS107_0008:5:1:18802:1128#0/1 TGCCATAGCCTTCCCATGATGCATACTTAGCCTCAC 1:0:0 chr18.fa:24911372R36
>HWUSI-EAS107_0008:5:1:18874:1125#0/1 AAACCCCCACAGTAACAACAGCTCCTCTGGCCCCAA 1:0:0 chr3.fa:130007959R35G
>HWUSI-EAS107_0008:5:1:19029:1128#0/1 AGATTCCTTGACTGGTCTATTGACATTGGCATATTT 1:0:0 chr19.fa:49024613R36
>HWUSI-EAS107_0008:5:1:19073:1118#0/1 TTACCATCCCCTCTATTAATCACATGGAACCTGATA 255:33:1 -
>HWUSI-EAS107_0008:5:1:19096:1122#0/1 CATATGCCTTTAATCCCAGCACTTGGGAGGCAGAGG 148:255:255 -
>HWUSI-EAS107_0008:5:1:19214:1128#0/1 TCAACACACAACTGTATGCTAATGTTCTGATTAATC 1:0:0 chr11.fa:3052382F32T3
>HWUSI-EAS107_0008:5:1:19252:1121#0/1 CTGGCTAGGCAGTCTAGCCCAGTCTGTGAGATCCCG 1:0:0 chr3.fa:138169276F36
>HWUSI-EAS107_0008:5:1:19319:1123#0/1 GGGCTGCTACTCTCACAGAGTCCTGGGGTGGTAGGG 1:0:0 chr11.fa:71607540R36
>HWUSI-EAS107_0008:5:1:19357:1120#0/1 GGCCTTGAAGTGTTAGGTTGTTGGGTTAAAGACTTC NM -
>HWUSI-EAS107_0008:5:1:19415:1126#0/1 ATGGACCCAACAGCCTTCCACACTACAGAAGGATGA 1:0:0 chr15.fa:86824347R36
>HWUSI-EAS107_0008:5:1:19463:1119#0/1 GGGTGTGTTTTAGTTCACAATTCCAAGTTGTAGTCC 1:0:0 chr7.fa:114170220R36
>HWUSI-EAS107_0008:5:1:19527:1128#0/1 TGGGGAGAGGGAAGAGGAATGGCAGCAAGGCACGCC 1:0:0 chr4.fa:114142946F36
>HWUSI-EAS107_0008:5:1:19773:1128#0/1 AGATGCGGTCCCAGTATCAACTAGTTAGTATAGACA 1:0:0 chr7.fa:68841823R36

The file name is simply ends in '.extended'

**GenoMax** · 10-16-2013, 11:10 AM

Excerpt from your example:

>HWUSI-EAS107_0008:5:1:19214:1128#0/1 TCAACACACAACTGTATGCTAATGTTCTGATTAATC 1:0:0 chr11.fa:3052382F32T3
>HWUSI-EAS107_0008:5:1:19252:1121#0/1 CTGGCTAGGCAGTCTAGCCCAGTCTGTGAGATCCCG 1:0:0 chr3.fa:138169276F36
>HWUSI-EAS107_0008:5:1:19319:1123#0/1 GGGCTGCTACTCTCACAGAGTCCTGGGGTGGTAGGG 1:0:0 chr11.fa:71607540R36
>HWUSI-EAS107_0008:5:1:19357:1120#0/1 GGCCTTGAAGTGTTAGGTTGTTGGGTTAAAGACTTC NM -
>HWUSI-EAS107_0008:5:1:19415:1126#0/1 ATGGACCCAACAGCCTTCCACACTACAGAAGGATGA 1:0:0 chr15.fa:86824347R36

Description of the "eland_multi" format (your example looks slightly different in the "matches found" section):

1. Sequence name
2. Sequence
3. Either NM, QC, RM or the following:

• NM—No match found
• QC—No matching done: QC failure (too many Ns)
• RM—No matching done: repeat masked (may be seen if repeatFile.txt was
specified)
• U0—Best match found was a unique exact match
• U1—Best match found was a unique 1-error match
• U2—Best match found was a unique 2-error match
• R0—Multiple exact matches found
• R1—Multiple 1-error matches found, no exact matches
• R2—Multiple 2-error matches found, no exact or 1-error matches

4. x:y:z where x, y, and z are the number of exact, single-error, and 2-error matches
found
5. Matches found (Blank, if no matches found): e.g. BAC_plus_vector.fa:163022R1,170128F2,E_coli.fa:3909847R1
This says there are two matches to BAC_plus_vector.fa: one in the reverse direction starting at position 160322 with one error, one in the forward direction starting at position 170128 with two errors. There is also a single-error match to E_coli.fa [Note: Your example looks different in this section, otherwise the format seems to align well]

Problem is you do not have information about the quality values in this file (there should be a corresponding s_*_sequence.txt file in fastq format). Is there a chance you can get that?

**danielecook** · 10-16-2013, 11:45 AM

Unfortunately - this is the only file I was given to work with. I guess the quality scores were truncated before I got to this? Not sure...

Thanks for your help. Do you know if there are any tools out there for working with this?

**GenoMax** · 10-16-2013, 11:59 AM

"Eland_multi.txt" files never had the quality scores in them. Those would be included in corresponding "s_*_export.txt or s_*_sorted.txt" files (I will assume that those are unavailable).

At a minimum you can create a multi-fasta sequence file (from the file you have). Not having the quality values would make it tough to judge quality of the basecalls.

Ultimately what are you trying to do with this data?

**danielecook** · 10-16-2013, 12:25 PM

This was a project that was passed on to me. It's a ChIP-seq experiment. We are hoping to compare our data against another groups ChIP-seq work which was comparable but in a different cell line. We used Illuminas pipeline to align (which used Eland) and the other group used Bowtie. I doubt I would be able to go back and obtain the fasta/fastq files.

The previous people that worked on this project used this script to converts this file to bed format. Any reads with multiple alignments, and any mismatches are excluded. I will check back and see if I can find out if any other filtering happened prior to me getting this file (I am certainly hoping that is the case!).

Using the script I was able to produce a bed file and use bedtools to convert to bam.

I have bowtie native files from the other group - which I was able to successfully convert to bam as well - so now everything is in bam format.

Thanks again for your help. Any ideas/concerns/comments are greatly appreciated.

**GenoMax** · 10-16-2013, 02:17 PM

Hopefully the mapping was done against the same build of the genome in both cases. Otherwise the comparisons may not be valid.

Since you mentioned ChIP-seq there is USeq: http://useq.sourceforge.net/ which also has a parser for eland_multi files: http://useq.sourceforge.net/applications.html

**danielecook** · 10-16-2013, 06:14 PM

Yeah - I am taking care to verify that things look correct in IGV as far as build is concerned. The past folks were confident its mm9. I will for sure look into useq. Greatly appreciate all your help! I am relatively new but very eager to learn.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 39 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 41 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 35 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 55 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

eland extended

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News