Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Backspaces in Chip-Seq data file

    Hi,

    I am new to NGS data analysis and I have got some ELAND output files (specificially the sorted.txt file) which I am planning to analyse using MACS. However, MACS keeps falling over due to a "Strand information can not be recognized in this line" error. I have deduced that this is due to a backspace characters which have appeared between some characters in my file and because MACS can't find a tab between the characters it complains that the line is not in the correct format.

    Here is the offending line: (see the '^H' between the 0 and 1)

    HWI-EAS486 23 1 97 11471 15019 0^H1 CAGGGTCACCCAGAGTGAGTGTGAAGCCAGCCTGAGATC hhYghhhhhhhggfhhghhhgghghghhhghghhhdfch chr10.fa 80424503 F 34G1C1A 6

    Here is the same line as output by the MACS error: (backspace represented as x08 (HEX I think)
    HWI-EAS486\t23\t1\t97\t11471\t15019\t0\x081\tCAGGGTCACCCAGAGTGAGTGTGAAGCCAGCCTGAGATC\thhYghhhhhhhggfhhghhhgghghghhhghghhhdfch\tchr10.fa\t\t80424503\tF\t34G1C1A\t6","34G1C1A


    Does anyone have any idea how to replace these ^H (x08) backspace characters with tabs? the problem I have is that there are numerous occurances of ^H in the file which are legitimate.

    Any help of advice would be very useful.

    Thanks

  • #2
    regular expressions

    I would assume that you're running into a software bug in ELAND. I've only used data coming from it a couple times and don't recall ever running into backspaces.

    Anyway, you could just remove the backspaces with regular expressions in your preferred programming language (or even perl from the command line, if you're into that sort of thing).

    For example, in python, something like the following would replace all backspaces with tabs. A small amount of editing would restrict it to only the ones you want.
    Code:
    import re
    
    f = open("your_file.txt","r")
    of = open("a_new_file.txt","w")
    for line in f:
        of.write(re.sub('\b','\t',line))
    f.close()
    of.close()
    Assuming that the backspaces are only ever replacing a single character and you know (or can look up the ELAND file format), you could instead just use regex to parse the various fields:
    Code:
    re.search("(HWI\-EAS[\d]+)\.{1}([\d]+)\.{1}([\d]+)\.{1}([\d]+)\.{1}([\d]+)\.{1}([\d]+)\.{1}([\d]+)\.{1}([\d]+)\.{1}([ACGT]+)...",line)
    Or something along those lines. There are a lot of ways one could do that. If you're uncomfortable doing that sort of thing yourself then you can probably find someone to write a short script for you in return for a pint or two of decent beer.

    Comment

    Latest Articles

    Collapse

    • seqadmin
      Investigating the Gut Microbiome Through Diet and Spatial Biology
      by seqadmin




      The human gut contains trillions of microorganisms that impact digestion, immune functions, and overall health1. Despite major breakthroughs, we’re only beginning to understand the full extent of the microbiome’s influence on health and disease. Advances in next-generation sequencing and spatial biology have opened new windows into this complex environment, yet many questions remain. This article highlights two recent studies exploring how diet influences microbial...
      02-24-2025, 06:31 AM
    • seqadmin
      Quality Control Essentials for Next-Generation Sequencing Workflows
      by seqadmin




      Like all molecular biology applications, next-generation sequencing (NGS) workflows require diligent quality control (QC) measures to ensure accurate and reproducible results. Proper QC begins at nucleic acid extraction and continues all the way through to data analysis. This article outlines the key QC steps in an NGS workflow, along with the commonly used tools and techniques.

      Nucleic Acid Quality Control
      Preparing for NGS starts with isolating the...
      02-10-2025, 01:58 PM

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by seqadmin, 03-03-2025, 01:15 PM
    0 responses
    46 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 02-28-2025, 12:58 PM
    0 responses
    167 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 02-24-2025, 02:48 PM
    0 responses
    525 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 02-21-2025, 02:46 PM
    0 responses
    256 views
    0 likes
    Last Post seqadmin  
    Working...
    X