SEQanswers

Go Back   SEQanswers > General



Similar Threads
Thread Thread Starter Forum Replies Last Post
The best software for mapping SOLiD reads? cczhong SOLiD 26 08-15-2011 02:20 PM
Mapping very short (<20 bp) reads hclee Bioinformatics 4 07-15-2011 09:53 AM
RNA-Seq: SAMMate: a GUI tool for processing short read alignments in SAM/BAM format. Newsbot! Literature Watch 0 01-15-2011 03:50 AM
Mapping Short Reads with unequal length using MAQ TOLEN Illumina/Solexa 0 12-30-2010 08:57 PM
Roche Software - Slow GUI andpet 454 Pyrosequencing 0 03-01-2009 07:40 AM

Reply
 
Thread Tools
Old 11-02-2011, 01:49 PM   #1
asheenlevrai
Member
 
Location: USA

Join Date: Nov 2011
Posts: 28
Default short reads mapping software with GUI needed

(Warning message: total noob talking)

Hello all,

I know nothing about sequencing or sequences alignment.
I was given a collection of short reads (millions of them) and I would like to map them to the mouse genome and to visualize the surrounding sequence and the alignment (and do some kind of quantification of expression, maybe...).

I have no clue on how to do this and started looking around for some software that would help me (I won't blast millions of reads one by one, right?).

However, I found most of the software to be CLI-based. In addition to my total lack of experience in sequencing and sequence handling, I'm not very comfortable with the terminal neither and I fear I would not understand the output of such programs.

I am thus looking for a GUI-based software (any platform is OK, mac would be my favorite choice) in order to allow me to visualize where all these reads map in the mouse genome.

I am ready to read tons of manual pages and tutorial if necessary but, again, I am a bit allergic to the terminal. I mean, if the software is CLI-based but the output is something easy to understand (like a zoomable image, in my dreams) I would go for it.

Tell me if you guys know any tool that could help me (or a better way to answer my questions).

Thank you very much in advance for your help.
Best regards
-a-
asheenlevrai is offline   Reply With Quote
Old 11-02-2011, 05:31 PM   #2
adaptivegenome
Super Moderator
 
Location: US

Join Date: Nov 2009
Posts: 437
Default

IGV does a great job of visualizing nextgen read alignments. You need to first generate a BAM file (a binary representation of aligned reads to a reference) and you can do this with a variety of mapping tools. If your data is Illumina I would recommend BWA or STAMPY, but there are lots of programs to choose from. They are easy to use but not GUI-based. And most run on MACs.
adaptivegenome is offline   Reply With Quote
Old 11-02-2011, 09:07 PM   #3
biznatch
Senior Member
 
Location: Canada

Join Date: Nov 2010
Posts: 124
Default

Maybe something like Partek Genomics Suite, but I think it's expensive. You can get a free trial from them though.
biznatch is offline   Reply With Quote
Old 11-03-2011, 08:42 AM   #4
asheenlevrai
Member
 
Location: USA

Join Date: Nov 2011
Posts: 28
Default

Thank you for these answers.

I think I will try to use IGV first.
Now I need to generate a BSM file from the multiple .sar files that I have.
I'll look into that (BWA or... can bowtie do this as well?).
If you know any "simple" way to do that, I'd be very interested to know about it.
Thanks again
-a-
asheenlevrai is offline   Reply With Quote
Old 11-03-2011, 09:38 AM   #5
ETHANol
Senior Member
 
Location: Western Australia

Join Date: Feb 2010
Posts: 308
Default

You should really consider how easy it is to learn how to use CLI programs.

This is coming from an ex-totally clueless noob.

http://korflab.ucdavis.edu/Unix_and_Perl/
Go through the first section on UNIX and you're ready to go. We're talking an investment in the hours range. A minimal investment of time when you take into consideration how much work you have and will put into your project.
__________________
--------------
Ethan
ETHANol is offline   Reply With Quote
Old 11-03-2011, 10:11 AM   #6
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,992
Default

Check this: http://seqanswers.com/wiki/Software/list

You do not say if you are looking for something free or are willing to buy. In either case you can find something you will like.

CLC Genomics workbench or Geneious (both commercial) would fit. I am not endorsing either. Just providing a pointer.

If you are not averse to using a web accessible resource try: http://usegalaxy.org. Check the wiki and learn links on the Galaxy site to get started.
GenoMax is offline   Reply With Quote
Old 11-03-2011, 03:47 PM   #7
mattanswers
Member
 
Location: Boston

Join Date: Oct 2009
Posts: 65
Default

I would look into Simon Andrews programs, SeqMonk (http://www.bioinformatics.bbsrc.ac.uk/projects/seqmonk/) and FASTQC, (http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/).
SeqMonk is for visualizing after you have done an alignment (used an alignment program like bowtie or bwa; which can be done in Galaxy, http://main.g2.bx.psu.edu/.) and produced an alignment file (preferably in SAM format). Very easy to use and you can go from a whole chromosome view down to a particular gene. Many other things can be done with SeqMonk as well.
First, you want to know how your file of sequences is formatted. If it has been provided by a sequencing facility using Illumina sequencers then it is most likely in FASTQ format. (Don't confuse FASTQ with FASTQC. FASTQC is a quality assessment program, FASTQ is a format for sequence files.) You can Google 'FASTQ' to see what this format is. It is very simple and not complicated. Then look at a portion of your sequence file and see if it is in the same format. If it is a FASTQ file the rest will be easy.

Upload it to Galaxy under Get Data in the Options. You can then first do some quality assessment by finding FASTQC on Galaxy in the Options under NGS:QC and manipulation. FASTQC will work on the FASTQ file and provide you with some quality assessment of your sequences.
The FASTQ file can then be used to align the sequences with bowtie or bwa. Under the Options in Galaxy, go to NGS: mapping and you can then align the FASTQ file using bowtie or bwa. The resulting file can be downloaded to your computer and put into SeqMonk. Download and load the mouse genome into SeqMonk. This can easily be done using SeqMonk when you click 'New Project'. After loading the genome, load your alignment file and you will then be able to visualize the position of your sequences in relation to genes and other annotation on the genome.
mattanswers is offline   Reply With Quote
Old 11-07-2011, 09:03 AM   #8
asheenlevrai
Member
 
Location: USA

Join Date: Nov 2011
Posts: 28
Default

Quote:
Originally Posted by mattanswers View Post
Thank you


Quote:
Originally Posted by mattanswers View Post
SeqMonk is for visualizing after you have done an alignment (used an alignment program like bowtie or bwa; which can be done in Galaxy, http://main.g2.bx.psu.edu/.) and produced an alignment file (preferably in SAM format). Very easy to use and you can go from a whole chromosome view down to a particular gene. Many other things can be done with SeqMonk as well.
My problem right now is to know how to "get" the data.
- I have 20 .sra files.
- In order to align them with BWA (or bowtie) I guess I should merge them into 1 .fastq file, right?
- NCBI says I should use "fastq-dump" to extract fastq from sra but I can also download the data directly in fastq (compressed in fastq.gz). I wonder if these are the same as the files that would be generated by "fastq-dump" I guess the answert is Yes...
- I do not know how to "merge" 20 .fastq files into a single one. Can I just copy-paste using textedit for instance? (.fastq files are just text files after all, right?)
- the resulting .fastq file will be very large (I guess something like 10Gigabytes). Too large to upload?

Tell me what you think about it.

Thank you very much.
-a-
asheenlevrai is offline   Reply With Quote
Old 11-07-2011, 09:11 AM   #9
biznatch
Senior Member
 
Location: Canada

Join Date: Nov 2010
Posts: 124
Default

Converting .sra files to fastq with fastq-dump will give you the same thing as downloading the fastq files, although downloading sra and converting will in my experience be way faster than downloading the fastq files. Although recently I've had some trouble in that the files converted from sra weren't correct and I had to download the fastq files, no idea what went wrong...

Why do you need to merge the fastq files? Usually you would only do this if there are all different runs of the same sample, and you wanted to analyze them combined.
biznatch is offline   Reply With Quote
Old 11-07-2011, 09:19 AM   #10
Richard Finney
Senior Member
 
Location: bethesda

Join Date: Feb 2009
Posts: 700
Default

Examine the fastq from SRA via dump-fastq utility.
http://en.wikipedia.org/wiki/FASTQ_f...e_Read_Archive

The Short Read Archive folks added a complication. The fastq files are often not BWA compatible. You further need to cook the data. Here tools like perl,sed,awk,gcc come in handy.

You'd want to use the "cat" program (from the cygwin/bash/linux command line). A text editor would struggle with the data. "cat" concatenates input files into an output file.

At this point in the game "enterprise java-bean enabled cloud GUI just click and watchen the blinken lichten" ain't there. You really do need to interface with command line. For today at least.

10GB is pretty small these days. Your home DSL may not handle it well, but it's very manageable.
Richard Finney is offline   Reply With Quote
Old 11-07-2011, 11:27 AM   #11
asheenlevrai
Member
 
Location: USA

Join Date: Nov 2011
Posts: 28
Default

Quote:
Originally Posted by biznatch View Post
Why do you need to merge the fastq files? Usually you would only do this if there are all different runs of the same sample, and you wanted to analyze them combined.
do you mean I could align the (millions of) reads contained in 20 .fastq files to the mouse genome? I mean, without getting 20 alignment files at the end, but a single one to look at with a viewer...

Tx
asheenlevrai is offline   Reply With Quote
Old 11-07-2011, 11:29 AM   #12
Richard Finney
Senior Member
 
Location: bethesda

Join Date: Feb 2009
Posts: 700
Default

You'll find it convenient to run fastqs in chunks through alignment software like BWA.
Check out samtools to merge and BAMize the resulting SAM files.
Richard Finney is offline   Reply With Quote
Old 11-07-2011, 11:47 AM   #13
asheenlevrai
Member
 
Location: USA

Join Date: Nov 2011
Posts: 28
Default

Quote:
Originally Posted by Richard Finney View Post
Examine the fastq from SRA via dump-fastq utility.
http://en.wikipedia.org/wiki/FASTQ_f...e_Read_Archive

The Short Read Archive folks added a complication. The fastq files are often not BWA compatible.
Is that due to
"note that the NCBI have converted this FASTQ data from the original Solexa/Illumina encoding to the Sanger standard"
?

Quote:
Originally Posted by Richard Finney View Post
You further need to cook the data. Here tools like perl,sed,awk,gcc come in handy.
by "cooking" you mean going back to the illumina encoding? manually? ouch!!


Quote:
Originally Posted by Richard Finney View Post
You'd want to use the "cat" program (from the cygwin/bash/linux command line). A text editor would struggle with the data. "cat" concatenates input files into an output file.
in order to merge the 20 .fastq files into a single one, right?


Quote:
Originally Posted by Richard Finney View Post
At this point in the game "enterprise java-bean enabled cloud GUI just click and watchen the blinken lichten" ain't there. You really do need to interface with command line. For today at least.
to do what part of the job?
1) .sar -> .fastq conversion? (including encoding restoration)
2).fastq files merge (I think I can read "man cat" and maybe even understand it)
3) align the .fastq file (millions of 25bp reads) with the mouse genome (using BWA or bowtie) to get a SAM/BAM file. (can I do this using Galaxy?)
4) view the alignment with a GUI viewer

did I miss anything?
I guess points 1) and 3) are the most difficult, right?


Quote:
Originally Posted by Richard Finney View Post
10GB is pretty small these days. Your home DSL may not handle it well, but it's very manageable.
OK... cool
Thank you
asheenlevrai is offline   Reply With Quote
Old 11-07-2011, 12:05 PM   #14
asheenlevrai
Member
 
Location: USA

Join Date: Nov 2011
Posts: 28
Default

Quote:
Originally Posted by Richard Finney View Post
You'll find it convenient to run fastqs in chunks through alignment software like BWA.
Check out samtools to merge and BAMize the resulting SAM files.
So it would be more like:
1) convert .sar files to .fastq files using dump-fastq.
I don't know how to do that. Should I install the "SRA toolkit"?

2) align .fastq files to the mouse genome to generate SAM files. Using BWA or bowtie (via Galaxy?)
I don't know how to do that but it should be "easy" to find out.

3) merge SAM files and generate a BAM file using SAMtools.
I do not know how to do that. I guess there's a user manual for samtools

4) view the alignment using a graphical viewer (SeqMonk, IGV, others?)

am I right?

thanks
-a-
asheenlevrai is offline   Reply With Quote
Old 11-07-2011, 01:29 PM   #15
asheenlevrai
Member
 
Location: USA

Join Date: Nov 2011
Posts: 28
Default

Quote:
Originally Posted by asheenlevrai View Post
So it would be more like:
1) convert .sar files to .fastq files using dump-fastq.
I don't know how to do that. Should I install the "SRA toolkit"?
So I used the "fastq-dump" command from the "SRA toolkit" in order to convert 1 of the .sra files to .fastq
I compared (using textedit) this file to the correponding .fastq file I downloaded from NCBI. the 1st thing I saw is that the reads are 36bp long using fastq-dump. This is because the barcode (3bp) and the adapter (8bp) sequences are still there (flanking the "true" read). I guess I do not want these sequences for the alignment, right?

The quality values are different among the 2 .fastq files (as expected due to the re-encoding from NCBI).
______EDIT: Actually they are the same. The different size (36bp vs 25bp) made them just look different. I guess I should just download all the 20 files in .fastq format since this seem the easiest solution._____

_____(should I find a way to get rid of the barcode and adapter sequences?)________

Last edited by asheenlevrai; 11-07-2011 at 02:13 PM. Reason: new info
asheenlevrai is offline   Reply With Quote
Old 11-07-2011, 02:18 PM   #16
mattanswers
Member
 
Location: Boston

Join Date: Oct 2009
Posts: 65
Default

When using bowtie on Galaxy you can choose to use the 'full parameter' list. This will provide you with additional parameters among them are the ability to trim from either or both the 3' and 5' end of your reads. Not trimming would end up greatly decreasing the number of sequences you will align.
There are also programs available were you can 'clip-off' adaptor sequences from the fastq file.
mattanswers is offline   Reply With Quote
Old 11-08-2011, 09:06 AM   #17
asheenlevrai
Member
 
Location: USA

Join Date: Nov 2011
Posts: 28
Default

Well since I get the same quality values when I download the .fastq files from NCBI or when I use fastq-dump on the .sar file, why wouldn't I just use the downloadable .fastq files? The reads are 25bp so there's no need to trim them at all.
However I don't know if the encoding could be a problem since the NCBI encoding is "Sanger" and the reads are supposed to be Illumina reads (but fastq-dump does not affect the encoding apparently, in this case).

asheenlevrai is offline   Reply With Quote
Old 11-08-2011, 02:05 PM   #18
mattanswers
Member
 
Location: Boston

Join Date: Oct 2009
Posts: 65
Default

Since Illumina 1.7 and higher the quality encoding is now basically Sanger. If you have an older Illumina format, you can convert it to Sanger in Galaxy. In Galaxy, go to NGS: QC and Manipulation, then use FASTQ Groomer which can convert FASTQ files between various formats. But I think NCBI has already done this for you.
I think you have to have Sanger format to use bowtie in Galaxy.
Here tells a little more about quality scores, although it does not appear to be up-to-data with the new Illumina 1.7+ : http://en.wikipedia.org/wiki/FASTQ_format
mattanswers is offline   Reply With Quote
Old 11-30-2011, 09:28 AM   #19
asheenlevrai
Member
 
Location: USA

Join Date: Nov 2011
Posts: 28
Default

Hello again,

Can someone explain me, in simple words, what the samse -n option does in BWA?

the man says :

"-n INT Maximum number of alignments to output in the XA tag for reads paired
properly. If a read has more than INT hits, the XA tag will not be written.
[3]
"

XA corresponds to alternative hits.
I do not know exactly what "tags" are...

I guess that, using the default value (3), BWA will report up to 3 alternative hits for a given read. But what about reads that produce more than 3 hits?
a) Will the read be discarded (no hits reported)
b) Will the first (random) 3 hits be reported (other hits discarded)
c) Something else?

Thank you very much in advance for your help.
-a-
asheenlevrai is offline   Reply With Quote
Old 12-01-2011, 06:25 AM   #20
aggp11
Member
 
Location: Wisconsin

Join Date: Jun 2011
Posts: 87
Default

Quote:
Originally Posted by asheenlevrai View Post
I guess that, using the default value (3), BWA will report up to 3 alternative hits for a given read. But what about reads that produce more than 3 hits?
a) Will the read be discarded (no hits reported)
b) Will the first (random) 3 hits be reported (other hits discarded)
c) Something else?

Thank you very much in advance for your help.
-a-
You guessed it right for the default value, i.e. BWA will report up to 3 alternative hits for a given read.

However, if a read has more than 3 hits, by default, BWA would remove the XA tag from the read information (reports a single randomly selected hit, read is not removed the SAM file). To overcome this issue, you can raise the number of hits reported by BWA, using the -n INT option (e.g. -n 100, would return the XA tag for reads that have <=100 hits in the XA tag; the XA tag for any read with more than 100 hits will not be reported).

I hope this makes sense.
Thanks,
P
aggp11 is offline   Reply With Quote
Reply

Tags
galaxy

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:03 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO