SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Samtools tview: mysterious KKKK within introns. valei Bioinformatics 4 05-02-2016 01:27 AM
cufflinks, tophat, and mysterious snoRNAs JueFish Bioinformatics 0 08-24-2011 09:52 AM
mysterious exons from UCSC rudi283 Bioinformatics 3 03-06-2011 11:45 AM
help, the SOAPsnp input files format ptong7 Bioinformatics 1 06-08-2010 12:05 PM
What does NS:i:1 means in the .sam format files? Gangcai Bioinformatics 2 04-20-2010 11:51 AM

Reply
 
Thread Tools
Old 01-22-2015, 02:25 PM   #1
MaximusPrime
Junior Member
 
Location: South Florida

Join Date: Jan 2015
Posts: 3
Default Identifying the format of mysterious files

Hi Everyone,

I'm very new to this field and am currently working with existing data. My ultimate goal is to map sequence reads to a gene to make some nice visuals.

I know I need to get the reads into SAM format and then into BAM format in order to give them to a visualization program (like IGV). My problem is that I can't figure out what format the data is currently in.

From what I can tell, it is in a format SIMILAR to a SOAPalign output, but not quite. The file extension is .txt. Any hints would be greatly appreciated.

Here is a link to the GEO page where the data lives:
http://www.ncbi.nlm.nih.gov/geo/quer...i?acc=GSE13750

Here is the file header and first line:
#Conf {confSoapsM = Just "fp_rich1_80610_chr_soap", confReadsM = Just "fp_rich1_80610_rrna_unalign", confOutputM = Just "fp_rich1_80610_chr", confVectorSequ = Chunk "AAAAAAAAAAAAAAAAAAAA" Empty, confFastaM = Just "/home/rawdata/Sequences/saccharomyces_cerevisiae.fa", confMinMatch = 32}
#F 19:43:59 z
AAAAAAAAAAAAAAAAAATCTGAAAAAAAAAAAAAA 1 35 chrII 145916 145937 22 0

Here is the description in the README:
Alignment files (chr_best):
1: tag sequence
2: tag count
3: alignment score = # of matches, versus reference plus poly-A tail; # mismatches = length($1) - $3.
4: reference sequence name
5: reference sequence start position
6: reference sequence end position
7: maximum reference alignment length
8: length ambiguity in match, due to poly-A tailing; minimum reference alignment length is $7 - $8
MaximusPrime is offline   Reply With Quote
Old 01-22-2015, 02:36 PM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

I don't know what that format is. I suggest you strip out the first column only (sequence), reformat it as fasta, and remap it to produce a proper sam file with things like cigar strings and flags.
Brian Bushnell is offline   Reply With Quote
Old 01-22-2015, 02:44 PM   #3
MaximusPrime
Junior Member
 
Location: South Florida

Join Date: Jan 2015
Posts: 3
Default

Thanks Brian. Sadly, I'm a novice -- I can strip out the sequence data and reformat it, but I wouldn't know where to go from there.

I've been reading up on the various programs/tools available, though. Do you have any beginner's guides for NGS data processing that you're a particular fan of?
MaximusPrime is offline   Reply With Quote
Old 01-22-2015, 02:57 PM   #4
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Unfortunately, no, but you can use BBMap to get the sequence into a mapped, sorted, indexed bam file, which is what IGV needs:

bbmap.sh nodisk ref=gene.fasta in=reads.fasta out=mapped.sam bs=sort.sh
sh sort.sh


The first command will map and create a sam file, and a shellscript. The second command will run the shellscript, which uses samtools to transform the sam file to a sorted indexed bam file. I added that option because I use IGV a lot

The BBTools package contains a lot of NGS data processing tools, but unfortunately there's no beginners guide - I should write one.

FYI, a correctly formatted fasta file will look like this:

>1
ACGTTTCG
TTTGGGGGGG
>2
AAATTT

...etc. It needs to alternate between headers, which start with ">", and sequence, which can span multiple lines, but doesn't have to.

Last edited by Brian Bushnell; 01-22-2015 at 03:00 PM.
Brian Bushnell is offline   Reply With Quote
Old 01-22-2015, 03:35 PM   #5
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,989
Default

@MaximusPrime: You appear to have have downloaded the wrong files. What you have appears to be some kind of processed data that is provided on the page for the samples (e.g. http://www.ncbi.nlm.nih.gov/geo/quer...?acc=GSM346111).

Best solution here is to use the sratoolkit to download the fastq files directly. Here is an example of how to do this: http://seqanswers.com/forums/showpos...36&postcount=7

You can find sratoolkit binaries here: http://www.ncbi.nlm.nih.gov/Traces/s...?view=software
GenoMax is offline   Reply With Quote
Old 01-22-2015, 04:15 PM   #6
MaximusPrime
Junior Member
 
Location: South Florida

Join Date: Jan 2015
Posts: 3
Default

Fantastic, thank you.

I'm sure I'll be back if I run into any trouble
MaximusPrime is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:24 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO