SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Bowtie, an ultrafast, memory-efficient, open source short read aligner Ben Langmead Bioinformatics 514 03-13-2020 03:57 AM
STAR vs Tophat (2.0.5/6) dvanic Bioinformatics 44 05-21-2014 07:08 AM
Using Star/ bowtie on cluster babi2305 Bioinformatics 7 02-06-2013 11:11 AM
Suggested aligner for local alignment of RNA-seq data Eric Fournier RNA Sequencing 9 01-23-2013 10:38 AM

Reply
 
Thread Tools
Old 12-08-2015, 01:27 PM   #181
rvann
Junior Member
 
Location: Toronto

Join Date: Sep 2015
Posts: 3
Default

Quote:
Originally Posted by alexdobin View Post
Hi @rvann,

the error is caused by the zero-length read sequence, STAR cannot process those.
Hopefully, cutadapt has an option to remove them - this has to be done simultaneously from read1 and read2 files to preserve the order of the reads.

Cheers
Alex
Thanks, Alex. I tried running cutadapt with the flag -m set to N=1 so that any 'zero-length' reads would be discarded (see below). STAR had no problems with these trimmed fastq files.

cutadapt -q 10 -m 1 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT -o outpath/file_R1.fastq.gz -p outpath/file_R2.fastq.gz inpath/file_R1.fastq.gz inpath/file_R2.fastq.gz
rvann is offline   Reply With Quote
Old 01-13-2016, 06:09 AM   #182
bastianwur
Member
 
Location: Germany/Netherlands

Join Date: Feb 2014
Posts: 98
Default

I'm trying to run STAR at the moment, but it's throwing me an error while generating the genome.

I simply use the standard command:
./STAR --runMode genomeGenerate --genomeDir /2tb-internal/data/STAR_test/ --genomeFastaFiles /2tb-internal/data/contig-100.without_polyA.fixed_names.fa --runThreadN 5

The output I get:

Jan 13 15:53:51 ..... Started STAR run
Jan 13 15:53:51 ... Starting to generate Genome files
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Aborted (core dumped)

Not sure what the problem could be.
I didn't compile STAR myself, because I don't have GCC 4.7 installed, therefore compiling doesn't work (Working on Ubuntu 12.04). Is that maybe a problem?
The "genome" is not very big, but is actually a transcriptome, so it's not very big (file is 115 mb), but a lot entries (~200.000), guess that could also be a problem...or not?
Hardware should else be sufficient, have 16 GB RAM, and can switch to a server with 256 GB if necessary.
Any hints what could be the factor causing the problem here?

Thanks .
bastianwur is offline   Reply With Quote
Old 01-14-2016, 12:52 PM   #183
alexdobin
Senior Member
 
Location: NY

Join Date: Feb 2009
Posts: 161
Default

Hi @bastianwur,

quoting from the manual:

"If you are using a genome with a large (>5,000) number of references (chrosomes/sca olds), you may need to reduce the --genomeChrBinNbits to reduce RAM consumption. The following scaling is recommended: --genomeChrBinNbits = min(18, log2(GenomeLength/NumberOfReferences))."

In you case it would be --genomeChrBinNbits 9 .

Cheers
Alex

Quote:
Originally Posted by bastianwur View Post
I'm trying to run STAR at the moment, but it's throwing me an error while generating the genome.

I simply use the standard command:
./STAR --runMode genomeGenerate --genomeDir /2tb-internal/data/STAR_test/ --genomeFastaFiles /2tb-internal/data/contig-100.without_polyA.fixed_names.fa --runThreadN 5

The output I get:

Jan 13 15:53:51 ..... Started STAR run
Jan 13 15:53:51 ... Starting to generate Genome files
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Aborted (core dumped)

Not sure what the problem could be.
I didn't compile STAR myself, because I don't have GCC 4.7 installed, therefore compiling doesn't work (Working on Ubuntu 12.04). Is that maybe a problem?
The "genome" is not very big, but is actually a transcriptome, so it's not very big (file is 115 mb), but a lot entries (~200.000), guess that could also be a problem...or not?
Hardware should else be sufficient, have 16 GB RAM, and can switch to a server with 256 GB if necessary.
Any hints what could be the factor causing the problem here?

Thanks .
alexdobin is offline   Reply With Quote
Old 02-03-2016, 10:19 AM   #184
emixaM
Member
 
Location: Quebec, QC, Canada

Join Date: May 2011
Posts: 25
Default

Hi alexdobin

As many posts that you answered, I got a RAM problem.

The machines I use have 24GB of RAM and I am working with human data.

This is my command (I decreased genomeChrBinNbits argument, as I saw in your previous answers):

Code:
STAR --runMode genomeGenerate \
  --genomeDir test_star/reference.Merged \
  --genomeFastaFiles Homo_sapiens.GRCh38.fa \
  --runThreadN 2 \
  --genomeChrBinNbits 8 \
  --limitGenomeGenerateRAM 10000000000 \
  --sjdbFileChrStartEnd alignment_1stPass/AllSamples.SJ.out.tab \
  --limitIObufferSize 4000000000 \
  --sjdbOverhang 99
This is the log:

Code:
Feb 03 11:15:54 ..... Started STAR run
Feb 03 11:15:54 ... Starting to generate Genome files
Feb 03 11:17:11 ... starting to sort  Suffix Array. This may take a long time...
Feb 03 11:17:35 ... sorting Suffix Array chunks and saving them to disk...
Feb 03 12:37:51 ... loading chunks from disk, packing SA...
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
/var/spool/torque/mom_priv/jobs/1759223.moab.host.SC: line 41: 27870 Aborted                 STAR --runMode genomeGenerate --genomeDir test_star/reference.Merged --genomeFastaFiles Homo_sapiens.GRCh38.fa --runThreadN 2 --genomeChrBinNbits 8 --limitGenomeGenerateRAM 10000000000 --sjdbFileChrStartEnd alignment_1stPass/AllSamples.SJ.out.tab --limitIObufferSize 4000000000 --sjdbOverhang 99
The execution created 45 SA_* files of around 1GB.

What am I doing wrong in the parameters? I would like it to run with a ~20 GB of RAM consumption.

Thanks in advance for any help you might give me.
emixaM is offline   Reply With Quote
Old 02-05-2016, 12:46 PM   #185
alexdobin
Senior Member
 
Location: NY

Join Date: Feb 2009
Posts: 161
Default

Quote:
Originally Posted by emixaM View Post
Hi alexdobin

As many posts that you answered, I got a RAM problem.

The machines I use have 24GB of RAM and I am working with human data.

This is my command (I decreased genomeChrBinNbits argument, as I saw in your previous answers):

Code:
STAR --runMode genomeGenerate \
  --genomeDir test_star/reference.Merged \
  --genomeFastaFiles Homo_sapiens.GRCh38.fa \
  --runThreadN 2 \
  --genomeChrBinNbits 8 \
  --limitGenomeGenerateRAM 10000000000 \
  --sjdbFileChrStartEnd alignment_1stPass/AllSamples.SJ.out.tab \
  --limitIObufferSize 4000000000 \
  --sjdbOverhang 99
This is the log:

Code:
Feb 03 11:15:54 ..... Started STAR run
Feb 03 11:15:54 ... Starting to generate Genome files
Feb 03 11:17:11 ... starting to sort  Suffix Array. This may take a long time...
Feb 03 11:17:35 ... sorting Suffix Array chunks and saving them to disk...
Feb 03 12:37:51 ... loading chunks from disk, packing SA...
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
/var/spool/torque/mom_priv/jobs/1759223.moab.host.SC: line 41: 27870 Aborted                 STAR --runMode genomeGenerate --genomeDir test_star/reference.Merged --genomeFastaFiles Homo_sapiens.GRCh38.fa --runThreadN 2 --genomeChrBinNbits 8 --limitGenomeGenerateRAM 10000000000 --sjdbFileChrStartEnd alignment_1stPass/AllSamples.SJ.out.tab --limitIObufferSize 4000000000 --sjdbOverhang 99
The execution created 45 SA_* files of around 1GB.

What am I doing wrong in the parameters? I would like it to run with a ~20 GB of RAM consumption.

Thanks in advance for any help you might give me.
Hi @emixaM,

with 24GB of RAM and human genome you would need to use --genomeSAsparseD 2.
You do not need --genomeChrBinNbits 8 --limitIObufferSize 4000000000, and can increase
--limitGenomeGenerateRAM 20000000000
Code:
STAR --runMode genomeGenerate \
  --genomeDir test_star/reference.Merged \
  --genomeFastaFiles Homo_sapiens.GRCh38.fa \
  --runThreadN 2 \
  --limitGenomeGenerateRAM 20000000000 \
  --sjdbFileChrStartEnd alignment_1stPass/AllSamples.SJ.out.tab \
  --sjdbOverhang 99
Cheers
Alex
alexdobin is offline   Reply With Quote
Old 02-08-2016, 06:47 AM   #186
emixaM
Member
 
Location: Quebec, QC, Canada

Join Date: May 2011
Posts: 25
Default

Thank you very much Alex!

I tried, for good measure, to launch STAR from the beginning, not directly on the 2nd pass, on my 24GB machine, and I got this error.

Here is the command:
Code:
STAR --runMode alignReads \
  --genomeDir Homo_sapiens.GRCh38/genome/star_index/Ensembl77.sjdbOverhang99 \
  --readFilesIn \
    sample_L001.trim.pair1.fastq.gz \
    sample_L001.trim.pair2.fastq.gz \
  --runThreadN 2 \
  --readFilesCommand zcat \
  --outStd Log \
  --outSAMunmapped Within \
  --outSAMtype BAM Unsorted \
  --outFileNamePrefix alignment_1stPass/sample/ \
  --outSAMattrRGline ID:"sample_L001" 	PL:"ILLUMINA" 	 PU:"CXX_1" 		SM:"sample" 	CN:"seq"  \
  --limitGenomeGenerateRAM 20000000000 \
  --limitIObufferSize 32000000 \
  --genomeSAsparseD 2
And the error:
Code:
Feb 08 10:39:30 ..... Started STAR run
Feb 08 10:39:31 ..... Loading genome

EXITING: fatal error trying to allocate genome arrays, exception thrown: std::bad_alloc
Possible cause 1: not enough RAM. Check if you have enough RAM 31550092110 bytes
Possible cause 2: not enough virtual memory allowed with ulimit. SOLUTION: run ulimit -v 31550092110

Feb 08 10:39:31 ...... FATAL ERROR, exiting

gzip: stdout: Broken pipe

gzip: stdout: Broken pipe
Is it the same problem?
emixaM is offline   Reply With Quote
Old 02-08-2016, 10:16 AM   #187
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,076
Default

Quote:
Originally Posted by emixaM View Post
Possible cause 2: not enough virtual memory allowed with ulimit. SOLUTION: run ulimit -v 31550092110
@emixaM: Can you post the output from?

Code:
$ ulimit -a
GenoMax is offline   Reply With Quote
Old 02-08-2016, 10:22 AM   #188
emixaM
Member
 
Location: Quebec, QC, Canada

Join Date: May 2011
Posts: 25
Default

No problem:

Code:
$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 95162
max locked memory       (kbytes, -l) 4096000
max memory size         (kbytes, -m) unlimited
open files                      (-n) 32768
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 95162
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
emixaM is offline   Reply With Quote
Old 02-08-2016, 10:26 AM   #189
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,076
Default

Are you the only user on this machine when you try to run STAR? Are there any other processes going on in background? Have you looked at output of say "top"? What does the memory section up at the top say?
GenoMax is offline   Reply With Quote
Old 02-08-2016, 10:34 AM   #190
emixaM
Member
 
Location: Quebec, QC, Canada

Join Date: May 2011
Posts: 25
Default

Yes, I am the only one, it runs on a compute node of a HPC.

Here is the top of a compute node (no STAR running):

Code:
$ top -M
top - 14:33:48 up 3 days, 23:33,  0 users,  load average: 0.00, 1.98, 5.10
Tasks: 333 total,   1 running, 332 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.0%sy,  0.0%ni, 99.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:    23.584G total, 1948.043M used,   21.682G free,    0.000k buffers
Swap:    0.000k total,    0.000k used,    0.000k free, 1073.691M cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
31824 emixaM   20   0 27620 1620 1072 R  0.3  0.0   0:00.06 top
...
emixaM is offline   Reply With Quote
Old 02-08-2016, 10:42 AM   #191
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,076
Default

Are you using a job scheduler then? Perhaps the scheduler requires an explicit request for more RAM when not using default (e.g. with LSF it would be -M 20).
GenoMax is offline   Reply With Quote
Old 02-08-2016, 11:37 AM   #192
emixaM
Member
 
Location: Quebec, QC, Canada

Join Date: May 2011
Posts: 25
Default

Yes I am using a schedule (MOAB) but never had any problem accessing all the RAM of the nodes.
emixaM is offline   Reply With Quote
Old 02-09-2016, 03:32 PM   #193
alexdobin
Senior Member
 
Location: NY

Join Date: Feb 2009
Posts: 161
Default

Quote:
Originally Posted by umn_bist View Post
I am attempting to crossover to STAR from TopHat for my RNAseq workflow but I had a question regarding its capabilities.

Specifically, TopHat had an option to include mate-inner-dist and mate-std-dev but relied on RSeQC to input these values for each unique samples.

I was wondering if STAR provides an option to streamline/automate this process. Thank you very much for the awesome tool!
Hi @umn_bist

STAR does not need or use the "expected" insert length information for mapping - so yo do not have to worry about it.
For mapping of DNA-seq data we can decide whether an alignment has insert size that's too big or too small by comparing it "expected" distribution of insert size.
However, in RNA-seq the unsequenced portion of insert between the mates can contain a splice junctions.
This means that we cannot simply calculate the insert size from an alignment to compare it with "expected" insert size.

Cheers
Alex
alexdobin is offline   Reply With Quote
Old 02-16-2016, 01:16 PM   #194
harrike
Member
 
Location: St. Louis, MO

Join Date: Jun 2010
Posts: 29
Default

Hi Alex,

I have a simple question that is tell the strand of a read mapped to (strand-specific data). I viewed my output .bam file using "sambamba view". Is it possible to tell tell the strand from the example given below?

Quote:
HWI-ST1395:97:d29b4acxx:5:2105:5630:92677 99 chr4 15003609 3 100M = 15003773 264 GCATAATTATGAAGATATTTTTTTATATGGTCCGACTATATAAAATAAATATCATAACGGAAATCAACTTGAACAATGCTGTACCAAAACTACAGT
GGTT CCCFFFFFHHHHGJIJJIJJJIJJJJJJJJGIJJIIJJJIJJJIJJGGJJIJJJJIJJIJIJHGGGFFDFDFECEEECDDDDDDCDDDDDDCCDD@@C@B NH:i:2 HI:i:1 AS:i:198 nM:i:0

HWI-ST1395:97:d29b4acxx:7:2301:9293:56479 99 chr4 15003609 3 100M = 15003701 192 GCATAATTATGAAGATATTTTTTTATATGGTCCGACTATATAAAATAAATATCATAACGGAAATCAACTTGAACAATGCTGTACCAAAACTACAGT
GGTT CCCFFFFFHHHHHJJJIIJJJJJJHJJJJJJJIIJJJJIIJJGJIIJJJJIJJJJGGIJIGJFEHHEFFFEFECEEECDDDACCCDCDDBDDCDD>@CCD NH:i:2 HI:i:1 AS:i:198 nM:i:0
If not, how can I do that (get the strand information). I am also interested to know what the second column stands for in the above example? I am not able to figure it out.

Thanks,

Rui

Last edited by harrike; 02-16-2016 at 01:18 PM.
harrike is offline   Reply With Quote
Old 02-16-2016, 01:41 PM   #195
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,076
Default

Quote:
Originally Posted by harrike View Post
I am also interested to know what the second column stands for in the above example? I am not able to figure it out.

Thanks,

Rui
Second column is bitwise FLAG. Check section 1.4 from full SAM spec: https://samtools.github.io/hts-specs/SAMv1.pdf

Bitwise FLAG also tells you the strand information (https://www.biostars.org/p/59388/). You can "translate" the FLAG using this tool: https://broadinstitute.github.io/pic...ain-flags.html

Last edited by GenoMax; 02-16-2016 at 01:46 PM.
GenoMax is offline   Reply With Quote
Old 02-16-2016, 01:44 PM   #196
harrike
Member
 
Location: St. Louis, MO

Join Date: Jun 2010
Posts: 29
Default

Does the 2nd column indicate the strand? I check a few lines of my data. The frequency and number (2nd column value) are listed below.

Frequency Number
847 147
75 163
902 339
94 355
97 403
909 419
71 83
847 99

419/339 pair stands for the "+" strand, 147/99 the "-" strand"? What others?
harrike is offline   Reply With Quote
Old 02-16-2016, 01:47 PM   #197
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,076
Default

Quote:
Originally Posted by harrike View Post
Does the 2nd column indicate the strand? I check a few lines of my data. The frequency and number (2nd column value) are listed below.

Frequency Number
847 147
75 163
902 339
94 355
97 403
909 419
71 83
847 99

419/339 pair stands for the "+" strand, 147/99 the "-" strand"? What others?
See the added info in my post above.
GenoMax is offline   Reply With Quote
Old 02-16-2016, 01:54 PM   #198
harrike
Member
 
Location: St. Louis, MO

Join Date: Jun 2010
Posts: 29
Default

Hi Genomax,

Thanks for providing the info. It is quite helpful. I am clear now.

Rui
harrike is offline   Reply With Quote
Old 02-19-2016, 06:31 AM   #199
harrike
Member
 
Location: St. Louis, MO

Join Date: Jun 2010
Posts: 29
Default

Hi Alex,

This time I am using STAR to another set of data, which are strand-specific, paired-end, and of 150 bp read length. The command I used is "

STAR-STAR_2.4.2a/bin/Linux_x86_64/STAR --genomeDir Zmay_AGPv2_STAR_index/ --runThreadN 24 --readFilesIn Zm_ant_02_07a_TGACCA_L001_R1_001.fastq Zm_ant_02_07a_TGACCA_L001_R2_001.fx/ --runThreadN 24 --readFilesIn Zm_TGACCA_L001_R1_001.fastq Zm_TGACCA_L001_R2_001.fastq --outSAype EndToEnd --outFilterIntronMotifs RemoveNoncanonical --outFilterType BySJout --outFileNamePrefix Zm_antMtype BAM Unsorted --outFilterMultimapNmax 20 --alignIntronMax 10000 --alignMatesGapMax 10000 --alignEndsType EndToEnd --outFilterIntronMotifs RemoveNoncanonical --outFilterType BySJout --outFileNamePrefix Zm_TGACCA_L001_R1R2_

There are 56.34 of reads unmapped (short), see the Log.final.out file below:

Quote:
Started job on | Feb 19 07:56:41
Started mapping on | Feb 19 07:57:00
Finished on | Feb 19 08:02:30
Mapping speed, Million of reads per hour | 209.57

Number of input reads | 19210657
Average input read length | 302
UNIQUE READS:
Uniquely mapped reads number | 5863394
Uniquely mapped reads % | 30.52%
Average mapped length | 301.78
Number of splices: Total | 5527054
Number of splices: Annotated (sjdb) | 5329770
Number of splices: GT/AG | 5445210
Number of splices: GC/AG | 76305
Number of splices: AT/AC | 5539
Number of splices: Non-canonical | 0
Mismatch rate per base, % | 0.83%
Deletion rate per base | 0.07%
Deletion average length | 3.02
Insertion rate per base | 0.07%
Insertion average length | 1.98
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 1335209
% of reads mapped to multiple loci | 6.95%
Number of reads mapped to too many loci | 5625
% of reads mapped to too many loci | 0.03%
UNMAPPED READS:
% of reads unmapped: too many mismatches | 4.69%
% of reads unmapped: too short | 56.34%
% of reads unmapped: other | 1.46%
What are the possible reason of this low-mapping rate? Thanks,

Rui
harrike is offline   Reply With Quote
Old 02-19-2016, 07:06 AM   #200
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,076
Default

I suggest that you start by looking at a few (10-20) unmapped reads and blast them against nt to see what they are aligning to. You may be surprised by what you find and it may provide an explanation for the low % alignment.
GenoMax is offline   Reply With Quote
Reply

Tags
alignment, genome, mapping, rna-seq, transcirptome

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:15 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO