Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Identifying the format of mysterious files

    Hi Everyone,

    I'm very new to this field and am currently working with existing data. My ultimate goal is to map sequence reads to a gene to make some nice visuals.

    I know I need to get the reads into SAM format and then into BAM format in order to give them to a visualization program (like IGV). My problem is that I can't figure out what format the data is currently in.

    From what I can tell, it is in a format SIMILAR to a SOAPalign output, but not quite. The file extension is .txt. Any hints would be greatly appreciated.

    Here is a link to the GEO page where the data lives:
    NCBI's Gene Expression Omnibus (GEO) is a public archive and resource for gene expression data.


    Here is the file header and first line:
    #Conf {confSoapsM = Just "fp_rich1_80610_chr_soap", confReadsM = Just "fp_rich1_80610_rrna_unalign", confOutputM = Just "fp_rich1_80610_chr", confVectorSequ = Chunk "AAAAAAAAAAAAAAAAAAAA" Empty, confFastaM = Just "/home/rawdata/Sequences/saccharomyces_cerevisiae.fa", confMinMatch = 32}
    #F 19:43:59 z
    AAAAAAAAAAAAAAAAAATCTGAAAAAAAAAAAAAA 1 35 chrII 145916 145937 22 0

    Here is the description in the README:
    Alignment files (chr_best):
    1: tag sequence
    2: tag count
    3: alignment score = # of matches, versus reference plus poly-A tail; # mismatches = length($1) - $3.
    4: reference sequence name
    5: reference sequence start position
    6: reference sequence end position
    7: maximum reference alignment length
    8: length ambiguity in match, due to poly-A tailing; minimum reference alignment length is $7 - $8

  • #2
    I don't know what that format is. I suggest you strip out the first column only (sequence), reformat it as fasta, and remap it to produce a proper sam file with things like cigar strings and flags.

    Comment


    • #3
      Thanks Brian. Sadly, I'm a novice -- I can strip out the sequence data and reformat it, but I wouldn't know where to go from there.

      I've been reading up on the various programs/tools available, though. Do you have any beginner's guides for NGS data processing that you're a particular fan of?

      Comment


      • #4
        Unfortunately, no, but you can use BBMap to get the sequence into a mapped, sorted, indexed bam file, which is what IGV needs:

        bbmap.sh nodisk ref=gene.fasta in=reads.fasta out=mapped.sam bs=sort.sh
        sh sort.sh


        The first command will map and create a sam file, and a shellscript. The second command will run the shellscript, which uses samtools to transform the sam file to a sorted indexed bam file. I added that option because I use IGV a lot

        The BBTools package contains a lot of NGS data processing tools, but unfortunately there's no beginners guide - I should write one.

        FYI, a correctly formatted fasta file will look like this:

        >1
        ACGTTTCG
        TTTGGGGGGG
        >2
        AAATTT

        ...etc. It needs to alternate between headers, which start with ">", and sequence, which can span multiple lines, but doesn't have to.
        Last edited by Brian Bushnell; 01-22-2015, 03:00 PM.

        Comment


        • #5
          @MaximusPrime: You appear to have have downloaded the wrong files. What you have appears to be some kind of processed data that is provided on the page for the samples (e.g. http://www.ncbi.nlm.nih.gov/geo/quer...?acc=GSM346111).

          Best solution here is to use the sratoolkit to download the fastq files directly. Here is an example of how to do this: http://seqanswers.com/forums/showpos...36&postcount=7

          You can find sratoolkit binaries here: http://www.ncbi.nlm.nih.gov/Traces/s...?view=software

          Comment


          • #6
            Fantastic, thank you.

            I'm sure I'll be back if I run into any trouble

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM
            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            29 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            31 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            28 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            52 views
            0 likes
            Last Post seqadmin  
            Working...
            X