Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • eland extended

    I've been given some 'eland extended' files to work with. There really isn't much online about these. I'm trying to convert them to sam/bam format or something more useful but I keep hitting dead ends. Does anyone know anything about this format, know what data is represented, and any tools for working with it?

    Thanks -

  • #2
    When you are referring to the "eland extended" files which exact files are you referring to?

    Do the file names have "s_*_eland_multi.txt" file names or are they called "s_*_eland_results.txt"?

    As I recall the "eland_extended" analysis referred to alignments done with sequences that were > 32 bp long (seems odd now but that was the state of the art in 2008) and should have resulted in files with "multi" in the name.

    Comment


    • #3
      I suspect the names have changed - here is a sample of the file:

      >HWUSI-EAS107_0008:5:1:18780:1121#0/1 GTGGGGCAGTACCTCTCCTGCAGCTGTTGTTAGTGG 1:1:0 chr19.fa:16326479F30G2CA1,16326543F36
      >HWUSI-EAS107_0008:5:1:18802:1128#0/1 TGCCATAGCCTTCCCATGATGCATACTTAGCCTCAC 1:0:0 chr18.fa:24911372R36
      >HWUSI-EAS107_0008:5:1:18874:1125#0/1 AAACCCCCACAGTAACAACAGCTCCTCTGGCCCCAA 1:0:0 chr3.fa:130007959R35G
      >HWUSI-EAS107_0008:5:1:19029:1128#0/1 AGATTCCTTGACTGGTCTATTGACATTGGCATATTT 1:0:0 chr19.fa:49024613R36
      >HWUSI-EAS107_0008:5:1:19073:1118#0/1 TTACCATCCCCTCTATTAATCACATGGAACCTGATA 255:33:1 -
      >HWUSI-EAS107_0008:5:1:19096:1122#0/1 CATATGCCTTTAATCCCAGCACTTGGGAGGCAGAGG 148:255:255 -
      >HWUSI-EAS107_0008:5:1:19214:1128#0/1 TCAACACACAACTGTATGCTAATGTTCTGATTAATC 1:0:0 chr11.fa:3052382F32T3
      >HWUSI-EAS107_0008:5:1:19252:1121#0/1 CTGGCTAGGCAGTCTAGCCCAGTCTGTGAGATCCCG 1:0:0 chr3.fa:138169276F36
      >HWUSI-EAS107_0008:5:1:19319:1123#0/1 GGGCTGCTACTCTCACAGAGTCCTGGGGTGGTAGGG 1:0:0 chr11.fa:71607540R36
      >HWUSI-EAS107_0008:5:1:19357:1120#0/1 GGCCTTGAAGTGTTAGGTTGTTGGGTTAAAGACTTC NM -
      >HWUSI-EAS107_0008:5:1:19415:1126#0/1 ATGGACCCAACAGCCTTCCACACTACAGAAGGATGA 1:0:0 chr15.fa:86824347R36
      >HWUSI-EAS107_0008:5:1:19463:1119#0/1 GGGTGTGTTTTAGTTCACAATTCCAAGTTGTAGTCC 1:0:0 chr7.fa:114170220R36
      >HWUSI-EAS107_0008:5:1:19527:1128#0/1 TGGGGAGAGGGAAGAGGAATGGCAGCAAGGCACGCC 1:0:0 chr4.fa:114142946F36
      >HWUSI-EAS107_0008:5:1:19773:1128#0/1 AGATGCGGTCCCAGTATCAACTAGTTAGTATAGACA 1:0:0 chr7.fa:68841823R36
      The file name is simply ends in '.extended'

      Comment


      • #4
        Excerpt from your example:

        >HWUSI-EAS107_0008:5:1:19214:1128#0/1 TCAACACACAACTGTATGCTAATGTTCTGATTAATC 1:0:0 chr11.fa:3052382F32T3
        >HWUSI-EAS107_0008:5:1:19252:1121#0/1 CTGGCTAGGCAGTCTAGCCCAGTCTGTGAGATCCCG 1:0:0 chr3.fa:138169276F36
        >HWUSI-EAS107_0008:5:1:19319:1123#0/1 GGGCTGCTACTCTCACAGAGTCCTGGGGTGGTAGGG 1:0:0 chr11.fa:71607540R36
        >HWUSI-EAS107_0008:5:1:19357:1120#0/1 GGCCTTGAAGTGTTAGGTTGTTGGGTTAAAGACTTC NM -
        >HWUSI-EAS107_0008:5:1:19415:1126#0/1 ATGGACCCAACAGCCTTCCACACTACAGAAGGATGA 1:0:0 chr15.fa:86824347R36
        Description of the "eland_multi" format (your example looks slightly different in the "matches found" section):

        1. Sequence name
        2. Sequence
        3. Either NM, QC, RM or the following:

        • NM—No match found
        • QC—No matching done: QC failure (too many Ns)
        • RM—No matching done: repeat masked (may be seen if repeatFile.txt was
        specified)
        • U0—Best match found was a unique exact match
        • U1—Best match found was a unique 1-error match
        • U2—Best match found was a unique 2-error match
        • R0—Multiple exact matches found
        • R1—Multiple 1-error matches found, no exact matches
        • R2—Multiple 2-error matches found, no exact or 1-error matches

        4. x:y:z where x, y, and z are the number of exact, single-error, and 2-error matches
        found
        5. Matches found (Blank, if no matches found): e.g. BAC_plus_vector.fa:163022R1,170128F2,E_coli.fa:3909847R1
        This says there are two matches to BAC_plus_vector.fa: one in the reverse direction starting at position 160322 with one error, one in the forward direction starting at position 170128 with two errors. There is also a single-error match to E_coli.fa [Note: Your example looks different in this section, otherwise the format seems to align well]

        Problem is you do not have information about the quality values in this file (there should be a corresponding s_*_sequence.txt file in fastq format). Is there a chance you can get that?

        Comment


        • #5
          Unfortunately - this is the only file I was given to work with. I guess the quality scores were truncated before I got to this? Not sure...

          Thanks for your help. Do you know if there are any tools out there for working with this?

          Comment


          • #6
            "Eland_multi.txt" files never had the quality scores in them. Those would be included in corresponding "s_*_export.txt or s_*_sorted.txt" files (I will assume that those are unavailable).

            At a minimum you can create a multi-fasta sequence file (from the file you have). Not having the quality values would make it tough to judge quality of the basecalls.

            Ultimately what are you trying to do with this data?
            Last edited by GenoMax; 10-16-2013, 12:01 PM.

            Comment


            • #7
              This was a project that was passed on to me. It's a ChIP-seq experiment. We are hoping to compare our data against another groups ChIP-seq work which was comparable but in a different cell line. We used Illuminas pipeline to align (which used Eland) and the other group used Bowtie. I doubt I would be able to go back and obtain the fasta/fastq files.

              The previous people that worked on this project used this script to converts this file to bed format. Any reads with multiple alignments, and any mismatches are excluded. I will check back and see if I can find out if any other filtering happened prior to me getting this file (I am certainly hoping that is the case!).

              Using the script I was able to produce a bed file and use bedtools to convert to bam.

              I have bowtie native files from the other group - which I was able to successfully convert to bam as well - so now everything is in bam format.

              Thanks again for your help. Any ideas/concerns/comments are greatly appreciated.

              Comment


              • #8
                Hopefully the mapping was done against the same build of the genome in both cases. Otherwise the comparisons may not be valid.

                Since you mentioned ChIP-seq there is USeq: http://useq.sourceforge.net/ which also has a parser for eland_multi files: http://useq.sourceforge.net/applications.html

                Comment


                • #9
                  Yeah - I am taking care to verify that things look correct in IGV as far as build is concerned. The past folks were confident its mm9. I will for sure look into useq. Greatly appreciate all your help! I am relatively new but very eager to learn.

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Essential Discoveries and Tools in Epitranscriptomics
                    by seqadmin


                    The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
                    Yesterday, 07:01 AM
                  • seqadmin
                    Current Approaches to Protein Sequencing
                    by seqadmin


                    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                    04-04-2024, 04:25 PM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 04-11-2024, 12:08 PM
                  0 responses
                  39 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 10:19 PM
                  0 responses
                  41 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 09:21 AM
                  0 responses
                  35 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-04-2024, 09:00 AM
                  0 responses
                  55 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X