Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Identifying the format of mysterious files

    Hi Everyone,

    I'm very new to this field and am currently working with existing data. My ultimate goal is to map sequence reads to a gene to make some nice visuals.

    I know I need to get the reads into SAM format and then into BAM format in order to give them to a visualization program (like IGV). My problem is that I can't figure out what format the data is currently in.

    From what I can tell, it is in a format SIMILAR to a SOAPalign output, but not quite. The file extension is .txt. Any hints would be greatly appreciated.

    Here is a link to the GEO page where the data lives:
    NCBI's Gene Expression Omnibus (GEO) is a public archive and resource for gene expression data.


    Here is the file header and first line:
    #Conf {confSoapsM = Just "fp_rich1_80610_chr_soap", confReadsM = Just "fp_rich1_80610_rrna_unalign", confOutputM = Just "fp_rich1_80610_chr", confVectorSequ = Chunk "AAAAAAAAAAAAAAAAAAAA" Empty, confFastaM = Just "/home/rawdata/Sequences/saccharomyces_cerevisiae.fa", confMinMatch = 32}
    #F 19:43:59 z
    AAAAAAAAAAAAAAAAAATCTGAAAAAAAAAAAAAA 1 35 chrII 145916 145937 22 0

    Here is the description in the README:
    Alignment files (chr_best):
    1: tag sequence
    2: tag count
    3: alignment score = # of matches, versus reference plus poly-A tail; # mismatches = length($1) - $3.
    4: reference sequence name
    5: reference sequence start position
    6: reference sequence end position
    7: maximum reference alignment length
    8: length ambiguity in match, due to poly-A tailing; minimum reference alignment length is $7 - $8

  • #2
    I don't know what that format is. I suggest you strip out the first column only (sequence), reformat it as fasta, and remap it to produce a proper sam file with things like cigar strings and flags.

    Comment


    • #3
      Thanks Brian. Sadly, I'm a novice -- I can strip out the sequence data and reformat it, but I wouldn't know where to go from there.

      I've been reading up on the various programs/tools available, though. Do you have any beginner's guides for NGS data processing that you're a particular fan of?

      Comment


      • #4
        Unfortunately, no, but you can use BBMap to get the sequence into a mapped, sorted, indexed bam file, which is what IGV needs:

        bbmap.sh nodisk ref=gene.fasta in=reads.fasta out=mapped.sam bs=sort.sh
        sh sort.sh


        The first command will map and create a sam file, and a shellscript. The second command will run the shellscript, which uses samtools to transform the sam file to a sorted indexed bam file. I added that option because I use IGV a lot

        The BBTools package contains a lot of NGS data processing tools, but unfortunately there's no beginners guide - I should write one.

        FYI, a correctly formatted fasta file will look like this:

        >1
        ACGTTTCG
        TTTGGGGGGG
        >2
        AAATTT

        ...etc. It needs to alternate between headers, which start with ">", and sequence, which can span multiple lines, but doesn't have to.
        Last edited by Brian Bushnell; 01-22-2015, 03:00 PM.

        Comment


        • #5
          @MaximusPrime: You appear to have have downloaded the wrong files. What you have appears to be some kind of processed data that is provided on the page for the samples (e.g. http://www.ncbi.nlm.nih.gov/geo/quer...?acc=GSM346111).

          Best solution here is to use the sratoolkit to download the fastq files directly. Here is an example of how to do this: http://seqanswers.com/forums/showpos...36&postcount=7

          You can find sratoolkit binaries here: http://www.ncbi.nlm.nih.gov/Traces/s...?view=software

          Comment


          • #6
            Fantastic, thank you.

            I'm sure I'll be back if I run into any trouble

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Essential Discoveries and Tools in Epitranscriptomics
              by seqadmin




              The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
              04-22-2024, 07:01 AM
            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Today, 08:47 AM
            0 responses
            12 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            60 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            59 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            54 views
            0 likes
            Last Post seqadmin  
            Working...
            X