Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • MaximusPrime
    Junior Member
    • Jan 2015
    • 3

    Identifying the format of mysterious files

    Hi Everyone,

    I'm very new to this field and am currently working with existing data. My ultimate goal is to map sequence reads to a gene to make some nice visuals.

    I know I need to get the reads into SAM format and then into BAM format in order to give them to a visualization program (like IGV). My problem is that I can't figure out what format the data is currently in.

    From what I can tell, it is in a format SIMILAR to a SOAPalign output, but not quite. The file extension is .txt. Any hints would be greatly appreciated.

    Here is a link to the GEO page where the data lives:
    NCBI's Gene Expression Omnibus (GEO) is a public archive and resource for gene expression data.


    Here is the file header and first line:
    #Conf {confSoapsM = Just "fp_rich1_80610_chr_soap", confReadsM = Just "fp_rich1_80610_rrna_unalign", confOutputM = Just "fp_rich1_80610_chr", confVectorSequ = Chunk "AAAAAAAAAAAAAAAAAAAA" Empty, confFastaM = Just "/home/rawdata/Sequences/saccharomyces_cerevisiae.fa", confMinMatch = 32}
    #F 19:43:59 z
    AAAAAAAAAAAAAAAAAATCTGAAAAAAAAAAAAAA 1 35 chrII 145916 145937 22 0

    Here is the description in the README:
    Alignment files (chr_best):
    1: tag sequence
    2: tag count
    3: alignment score = # of matches, versus reference plus poly-A tail; # mismatches = length($1) - $3.
    4: reference sequence name
    5: reference sequence start position
    6: reference sequence end position
    7: maximum reference alignment length
    8: length ambiguity in match, due to poly-A tailing; minimum reference alignment length is $7 - $8
  • Brian Bushnell
    Super Moderator
    • Jan 2014
    • 2709

    #2
    I don't know what that format is. I suggest you strip out the first column only (sequence), reformat it as fasta, and remap it to produce a proper sam file with things like cigar strings and flags.

    Comment

    • MaximusPrime
      Junior Member
      • Jan 2015
      • 3

      #3
      Thanks Brian. Sadly, I'm a novice -- I can strip out the sequence data and reformat it, but I wouldn't know where to go from there.

      I've been reading up on the various programs/tools available, though. Do you have any beginner's guides for NGS data processing that you're a particular fan of?

      Comment

      • Brian Bushnell
        Super Moderator
        • Jan 2014
        • 2709

        #4
        Unfortunately, no, but you can use BBMap to get the sequence into a mapped, sorted, indexed bam file, which is what IGV needs:

        bbmap.sh nodisk ref=gene.fasta in=reads.fasta out=mapped.sam bs=sort.sh
        sh sort.sh


        The first command will map and create a sam file, and a shellscript. The second command will run the shellscript, which uses samtools to transform the sam file to a sorted indexed bam file. I added that option because I use IGV a lot

        The BBTools package contains a lot of NGS data processing tools, but unfortunately there's no beginners guide - I should write one.

        FYI, a correctly formatted fasta file will look like this:

        >1
        ACGTTTCG
        TTTGGGGGGG
        >2
        AAATTT

        ...etc. It needs to alternate between headers, which start with ">", and sequence, which can span multiple lines, but doesn't have to.
        Last edited by Brian Bushnell; 01-22-2015, 03:00 PM.

        Comment

        • GenoMax
          Senior Member
          • Feb 2008
          • 7142

          #5
          @MaximusPrime: You appear to have have downloaded the wrong files. What you have appears to be some kind of processed data that is provided on the page for the samples (e.g. http://www.ncbi.nlm.nih.gov/geo/quer...?acc=GSM346111).

          Best solution here is to use the sratoolkit to download the fastq files directly. Here is an example of how to do this: http://seqanswers.com/forums/showpos...36&postcount=7

          You can find sratoolkit binaries here: http://www.ncbi.nlm.nih.gov/Traces/s...?view=software

          Comment

          • MaximusPrime
            Junior Member
            • Jan 2015
            • 3

            #6
            Fantastic, thank you.

            I'm sure I'll be back if I run into any trouble

            Comment

            Latest Articles

            Collapse

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by SEQadmin2, 06-05-2026, 10:09 AM
            0 responses
            14 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-04-2026, 08:59 AM
            0 responses
            24 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-02-2026, 12:03 PM
            0 responses
            29 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-02-2026, 11:40 AM
            0 responses
            23 views
            0 reactions
            Last Post SEQadmin2  
            Working...