SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Bowtie, an ultrafast, memory-efficient, open source short read aligner Ben Langmead Bioinformatics 513 05-14-2015 03:29 PM
STAR vs Tophat (2.0.5/6) dvanic Bioinformatics 44 05-21-2014 08:08 AM
Using Star/ bowtie on cluster babi2305 Bioinformatics 7 02-06-2013 12:11 PM
Suggested aligner for local alignment of RNA-seq data Eric Fournier RNA Sequencing 9 01-23-2013 11:38 AM

Reply
 
Thread Tools
Old 02-14-2013, 11:11 AM   #1
alexdobin
Senior Member
 
Location: NY

Join Date: Feb 2009
Posts: 161
Default STAR: ultrafast universal RNA-seq aligner

Dear All,

I would like to formally introduce on SEQanswers our RNA-seq mapper STAR.
Its advantages include:
  • Very high mapping speed:
    on a modest 12-core cluster STAR maps 400 Million pairs per hour for human 2x100 Illumina reads (>50 times faster than TopHat).
  • Accurate alignment of contiguous and spliced reads:
    in our tests on real and simulated data STAR showed better sensitivity and precision than TopHat.
  • Detection of polyA-tails, non-canonical splices and chimeric (fusion) junctions.
  • Mapping reads of any length:
    STAR can efficiently map reads of any length generated by current or emerging sequencing platforms, starting from ~15 bases (small RNA) and up to full length transcripts several kilobases long.
  • Thorough testing on large ENCODE datasets:
    STAR was used to map 64 Billion reads of long RNA-seq and 16 Billion reads of short RNA-seq, and will be used to map RNA-seq data in the next ENCODE phase.
STAR requires ~30GB of RAM for mapping to the human genome (could be reduced to 16GB in the "sparse" mode with some speed loss).

More information can be found in out recent paper.
If you decide to try it out, please download one of the latest STAR releases.
I will be happy to answer any questions via SEQanswers, STAR discussion forum, or by e-mail:[email protected]

Cheers
Alex

Last edited by alexdobin; 06-20-2014 at 02:43 PM. Reason: fixed URL
alexdobin is offline   Reply With Quote
Old 02-15-2013, 06:01 AM   #2
aldeluis
Junior Member
 
Location: Logroņo, Spain

Join Date: Jul 2009
Posts: 1
Default

Your program is great. It's solving my alignments in a fraction of time respect TopHat. Thank you!!
aldeluis is offline   Reply With Quote
Old 02-15-2013, 07:14 AM   #3
NGSfan
Senior Member
 
Location: Austria

Join Date: Apr 2009
Posts: 181
Default

Yes, I am also happy with the results from STAR. It is incredibly fast and easy to use. I will test the shared memory option again now that I have my own workstation machine to play with.

Compatibility with Cufflinks has also been great. Works with HT-seq as well.

Thank you!!
NGSfan is offline   Reply With Quote
Old 02-18-2013, 04:50 AM   #4
volks
Member
 
Location: hd.de

Join Date: Jun 2010
Posts: 81
Default

i also like STAR a lot. so much simpler and faster than other tools around.

could you give an example how to use gzipped input files (--readFilesCommand)? i cannot get it to work.

thanks!
volks is offline   Reply With Quote
Old 02-18-2013, 04:56 AM   #5
iris_aurelia
Member
 
Location: Netherlands

Join Date: Jul 2012
Posts: 22
Default

I've used it as follows, which works fine:

STAR_2.3.0e.Linux_x86_64/STAR --genomeDir /dir/STAR/Genome --readFilesIn /dir/*.fastq.gz --readFilesCommand zcat --outFileNamePrefix SampleA --runThreadN 20

I hope the implementation of the readcounts will follow soon, since I'm curious to compare this tool to the tools I'm using at this moment.

Last edited by iris_aurelia; 02-18-2013 at 11:51 PM.
iris_aurelia is offline   Reply With Quote
Old 02-18-2013, 08:45 AM   #6
alexdobin
Senior Member
 
Location: NY

Join Date: Feb 2009
Posts: 161
Default

@volks:
Good example from @iris_aurelia.
If you have multiple files you wish to map in one run, they should be separated by commas, while paired-end mates are separated by space:
--readFilesIn Read1a.gz,Read1b.gz Read2a.gz,Read2b.gz

'zcat' should be on your path, or you can use absolute path such as '/bin/zcat'


@iris_aurelia:
Presently I am working on outputting sorted BAM, since 'samtools sort' is a bottleneck for many users (because samtools is not threaded). Then I will code "read counts per gene" (similar to HT-seq counts). Then you will be able to map the reads and get the counts in one run. Hopefully, it will be ready in 1-2 months.
alexdobin is offline   Reply With Quote
Old 02-19-2013, 12:01 AM   #7
iris_aurelia
Member
 
Location: Netherlands

Join Date: Jul 2012
Posts: 22
Default

Hi Alex,

Yes it is a pity that samtools doesn't support multiple threads. We've tried Novosort and that seems to work fine, however you need a license in order to run it in a multithreaded way.

HT-seq is what I usually use to get the readcounts. Are you gonna make this step multithreaded as well? It would indeed be nice if you could just run a sample and get the whole bunch of information all in once.
iris_aurelia is offline   Reply With Quote
Old 02-19-2013, 02:16 AM   #8
NGSfan
Senior Member
 
Location: Austria

Join Date: Apr 2009
Posts: 181
Default

That is a great feature, Alex. I am looking forward to it.

I currently convert cufflinks output into read counts for DESeq.

You read in the length and coverage columns from the isoforms.fpkm_tracking file as well as the read length of your reads and then caclulate:

my $reads = $length * $coverage / $readlength;

I have compared the values to htseq-count and found them to be nearly the same.

The advantage over htseq is you can multi-thread cufflinks.

Last edited by NGSfan; 02-19-2013 at 02:25 AM.
NGSfan is offline   Reply With Quote
Old 02-19-2013, 02:24 AM   #9
JonB
Member
 
Location: Norway

Join Date: Jan 2010
Posts: 83
Default

I have problems generating a new genome.
My genome is more than 300.000 contigs and ~300MB in size.

Feb 18 11:25:11 ..... Started STAR run
Feb 18 11:25:11 ... Starting to generate Genome files
/var/spool/slurmd/job1306271/slurm_script: line 9: 12129 Killed
JonB is offline   Reply With Quote
Old 02-19-2013, 02:27 AM   #10
iris_aurelia
Member
 
Location: Netherlands

Join Date: Jul 2012
Posts: 22
Default

Quote:
Originally Posted by NGSfan View Post
That is a great feature, Alex. I am looking forward to it.

I currently convert cufflinks output into read counts for DESeq.

You read in the length and coverage columns from the isoforms.fpkm_tracking file as well as the read length of your reads and then caclulate:

my $reads = $length * $coverage / $readlength;

I have compared the values to htseq-count and found them to be nearly the same.

The advantage over htseq is you can multi-thread cufflinks.


In my experience Cufflinks (using the multithreaded option) is even slower than HTseq-count.
We are doing some experimenting creating our own multithreaded count script which outputs exact the same data as HT-seq count. In order to do this multithreaded we are running each chromosome separately, which might be an idea for STAR?
iris_aurelia is offline   Reply With Quote
Old 02-19-2013, 05:25 AM   #11
NGSfan
Senior Member
 
Location: Austria

Join Date: Apr 2009
Posts: 181
Default

Quote:
Originally Posted by JonB View Post
I have problems generating a new genome.
My genome is more than 300.000 contigs and ~300MB in size.

Feb 18 11:25:11 ..... Started STAR run
Feb 18 11:25:11 ... Starting to generate Genome files
/var/spool/slurmd/job1306271/slurm_script: line 9: 12129 Killed


When the genome has a lot of contigs, I noticed that STAR needs a lot more memory... how much do you have? I would recommend finding a machine with 256gb RAM if possible.
NGSfan is offline   Reply With Quote
Old 02-19-2013, 05:30 AM   #12
NGSfan
Senior Member
 
Location: Austria

Join Date: Apr 2009
Posts: 181
Default

Quote:
Originally Posted by iris_aurelia View Post
In my experience Cufflinks (using the multithreaded option) is even slower than HTseq-count.
We are doing some experimenting creating our own multithreaded count script which outputs exact the same data as HT-seq count. In order to do this multithreaded we are running each chromosome separately, which might be an idea for STAR?
That's good to know - I have been using 24 threads which is pretty fast, but I did try it on 8 threads and noticed it is not much faster at that level, especially if you include options for multi-read correction and fragment bias correction.

I would support any attempt to multi-thread HTseq-count, let us know when you have a script. You can share it on code.google.com, it's where I put my most useful scripts.

Last edited by NGSfan; 02-19-2013 at 05:33 AM.
NGSfan is offline   Reply With Quote
Old 02-19-2013, 05:34 AM   #13
JonB
Member
 
Location: Norway

Join Date: Jan 2010
Posts: 83
Default

NGSfan:
Thanks. I think I had 60 GB of memory last time. I will try some more
JonB is offline   Reply With Quote
Old 02-19-2013, 08:21 AM   #14
bob-loblaw
Member
 
Location: /home/bob

Join Date: Jun 2012
Posts: 59
Default

This looks great, I think I'll try it out! Have you tested STAR on mixed RNA datasets? i.e. RNA-Seq on a sample containing both human and bacterial/viral RNA? Thats what I'm working with at the moment and Tophat just isn't cutting it, there are millions of reads that are identified as being human and bacterial (depending on which aligning step I run first)
bob-loblaw is offline   Reply With Quote
Old 02-19-2013, 09:41 AM   #15
alexdobin
Senior Member
 
Location: NY

Join Date: Feb 2009
Posts: 161
Default read count

@iris_aurelia:
My plan is to count the reads per gene within STAR algortihm as they are mapped, so it will be as multi-threaded as STAR itself.
Multi-threading on a per-chromosome basis is a good idea. I am going to implement a multi-threaded BAM sorting using this kind of parallelization.

@NGSfan:
Using Cufflinks "coverage" for calculating gene counts is a nice trick. We might as well do it since we are typically running both Cufflinks and HT-seq on our datasets. In our experience Cufflinks' speed is acceptable for A+ datasets, however, it slows down considerably and often gets stuck on A- and Total RNA samples.
alexdobin is offline   Reply With Quote
Old 02-19-2013, 10:01 AM   #16
alexdobin
Senior Member
 
Location: NY

Join Date: Feb 2009
Posts: 161
Default large number of contigs

Quote:
Originally Posted by JonB View Post
I have problems generating a new genome.
My genome is more than 300.000 contigs and ~300MB in size.

Feb 18 11:25:11 ..... Started STAR run
Feb 18 11:25:11 ... Starting to generate Genome files
/var/spool/slurmd/job1306271/slurm_script: line 9: 12129 Killed
As @NGSfan pointed out, this is indeed RAM problem. STAR bins genome sequence in a way that each chromosome (contig) starts at a new bin, which creates an overhead of Nchromosomes*BinSize, where BinSize=2^genomeChrBinNbits. By default, --genomeChrBinNbits = 18,
so BinSize=2^18~256kb, so with 300,000 contigs you would need ~75GB of RAM - that's what likely killed your job.

I suggest that you try a much smaller value of --genomeChrBinNbits 12. This would require just a few GB of RAM and should allow you to generate the genome files. I have not tried STAR with more than 50,000 contigs, and I suspect there might be significant slowdown in the mapping speed when the number of contigs is too big.
alexdobin is offline   Reply With Quote
Old 02-19-2013, 10:25 AM   #17
alexdobin
Senior Member
 
Location: NY

Join Date: Feb 2009
Posts: 161
Default mixed datasets

Quote:
Originally Posted by bob-loblaw View Post
This looks great, I think I'll try it out! Have you tested STAR on mixed RNA datasets? i.e. RNA-Seq on a sample containing both human and bacterial/viral RNA? Thats what I'm working with at the moment and Tophat just isn't cutting it, there are millions of reads that are identified as being human and bacterial (depending on which aligning step I run first)
We used STAR in two mixed-species settings: human + viruses (~4,500 viruses from NCBI) and human + mouse, and in both cases it worked well as far as we could tell. You have to keep an eye on RAM, since STAR would need ~10*TotalGenomeSize bytes of RAM (~50GB for human+mouse), and if you have a large number of small chromosomes/scaffolds/contigs, you would need to reduce --genomeChrBinNbits as I explained in the previous post.

For mixed-species mapping I strongly agree with other users who recommended mapping to a combined genome rather than a 2-step mapping. When a mapper aligns a read to combined genome it considers various possible alignments and picks the best one. This reduces the false positives and negatives for unique calls to each species. On the other hand, in the 2-step method you force alignments to one of the species in the 1st step, so you results will be strongly biased towards the 1st species.
alexdobin is offline   Reply With Quote
Old 02-19-2013, 10:26 AM   #18
cram
Member
 
Location: Canada

Join Date: Nov 2008
Posts: 16
Default

Quote:
As @NGSfan pointed out, this is indeed RAM problem. STAR bins genome sequence in a way that each chromosome (contig) starts at a new bin, which creates an overhead of Nchromosomes*BinSize, where BinSize=2^genomeChrBinNbits. By default, --genomeChrBinNbits = 18,
so BinSize=2^18~256kb, so with 300,000 contigs you would need ~75GB of RAM - that's what likely killed your job.

I suggest that you try a much smaller value of --genomeChrBinNbits 12. This would require just a few GB of RAM and should allow you to generate the genome files. I have not tried STAR with more than 50,000 contigs, and I suspect there might be significant slowdown in the mapping speed when the number of contigs is too big.
I've also had success in working around this problem by creating dummy scaffolds. I cat all the contigs into a single big fasta entry with a couple hundred 'X's separating each contig to (hopefully) prevent spurious alignments from spanning contigs. This made a huge difference in the amount of memory required.
cram is offline   Reply With Quote
Old 02-20-2013, 05:57 AM   #19
alexdobin
Senior Member
 
Location: NY

Join Date: Feb 2009
Posts: 161
Default large number of contigs

Quote:
Originally Posted by cram View Post
I've also had success in working around this problem by creating dummy scaffolds. I cat all the contigs into a single big fasta entry with a couple hundred 'X's separating each contig to (hopefully) prevent spurious alignments from spanning contigs. This made a huge difference in the amount of memory required.
This is a great idea. It will solve both the RAM problem and the slowdown problem I was talking about. However, you would need to post-process your alignments to extract alignments' coordinates within the real contigs from the coordinates within "super-contigs". Also, while Xs or Ns separating the contigs will prevent STAR from "extending" the alignment into adjacent contigs, it will not prevent splicing between contigs. So at the post-processing step you would also need to filter out alignments that span more than one real contig.
alexdobin is offline   Reply With Quote
Old 02-20-2013, 06:18 AM   #20
thomasblomquist
Member
 
Location: Ohio

Join Date: Jul 2012
Posts: 68
Default Trouble generating a small genome suffix array index...

I'm having trouble generating a small reference genome for running STAR. Below is the command, Error output and the input "genome" file. Your thoughts on how I can rectify are welcome. At present, I am using BFAST and it works reasonably well, but am looking for something a little easier for automatically generating seeds for parsing sequencing reads into bins. Thank you for your help and time.

Regards,

-Tom Blomquist

University of Toledo, Ohio



Command:

./STAR --runMode genomeGenerate --genomeDir ./ --genomeFastaFiles ./barcode.fa --runThreadN 8



Error Output:

Feb 19 09:39:34 ..... Started STAR run
Feb 19 09:39:34 ... Starting to generate Genome files
Feb 19 09:39:34 ... starting to sort Suffix Array. This may take a long time...
Feb 19 09:39:34 ... sorting Suffix Array chunks and saving them to disk...
Feb 19 09:39:34 ... loading chunks from disk, packing SA...
Feb 19 09:39:34 ... writing Suffix Array to disk ...
Feb 19 09:39:34 ... Finished generating suffix array
Feb 19 09:39:34 ... starting to generate Suffix Array index...

BUG: next index is smaller than previous, EXITING



./barcode.fa is the following -->

>ATGC
TTTTCATGCGATCAGGCGTCTGTCGTGCTC

>CAGT
TTTTCCAGTGATCAGGCGTCTGTCGTGCTC

>GACA
TTTTCGACAGATCAGGCGTCTGTCGTGCTC

>TGTG
TTTTCTGTGGATCAGGCGTCTGTCGTGCTC

>ACAC
TTTTCACACGATCAGGCGTCTGTCGTGCTC

>TACG
TTTTCTACGGATCAGGCGTCTGTCGTGCTC

>GCTA
TTTTCGCTAGATCAGGCGTCTGTCGTGCTC

>CATA
TTTTCCATAGATCAGGCGTCTGTCGTGCTC

>TAGA
TTTTCTAGAGATCAGGCGTCTGTCGTGCTC

>GACT
TTTTCGACTGATCAGGCGTCTGTCGTGCTC

>CTGA
TTTTCCTGAGATCAGGCGTCTGTCGTGCTC

>TGCT
TTTTCTGCTGATCAGGCGTCTGTCGTGCTC
thomasblomquist is offline   Reply With Quote
Reply

Tags
alignment, genome, mapping, rna-seq, transcirptome

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:41 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO