SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Bowtie, an ultrafast, memory-efficient, open source short read aligner Ben Langmead Bioinformatics 513 05-14-2015 03:29 PM
STAR vs Tophat (2.0.5/6) dvanic Bioinformatics 44 05-21-2014 08:08 AM
Using Star/ bowtie on cluster babi2305 Bioinformatics 7 02-06-2013 12:11 PM
Suggested aligner for local alignment of RNA-seq data Eric Fournier RNA Sequencing 9 01-23-2013 11:38 AM

Reply
 
Thread Tools
Old 05-08-2013, 04:38 AM   #41
bruce01
Senior Member
 
Location: Dublin

Join Date: Mar 2011
Posts: 156
Default

Ok, asked over on Stackoverflow, this works:

group1=( $files/Sample1*r1* );
group2=( $files/Sample1*r2* );
( IFS=,; STAR --readFilesIn "${group1[*]}" "${group2[*]}" [OPTIONS]);

Thanks for the help and ideas Dpryan.

##Edit: DPryan, sorry, getting wires crossed between here and Stackoverflow. I was asking how to give STAR the input that I had created, above works. I am reticent to concatenate gzip files, I dont want to create doubles and don't want to change the gzips in any way before aligning: paranoia!

Last edited by bruce01; 05-08-2013 at 04:48 AM. Reason: Miscommunication with poster
bruce01 is offline   Reply With Quote
Old 05-09-2013, 08:25 AM   #42
Auction
Member
 
Location: california

Join Date: Jul 2009
Posts: 24
Default

You can also try following commands, it works for me.
fq1=`ls -m *_R1_*.fastq.gz | tr -d '\n' | tr -d ' '`
fq2=${fq1//"_R1_"/"_R2_"}
STAR --readFilesIn $fq1 $fq2
Auction is offline   Reply With Quote
Old 05-29-2013, 06:55 AM   #43
priya
Member
 
Location: sweden

Join Date: Apr 2013
Posts: 56
Default

Quote:
Originally Posted by alexdobin View Post


If you have stranded RNA-seq data, you do not need to use any specific STAR options. Instead, you need to run Cufflinks with the library option --library-type options. For example,
cufflinks ... ... --library-type fr-firststrand
should be used for the “standard” dUTP protocol. This option has to be used only for Cufflinks runs and not for STAR runs.
It is recommended to remove the non-canonical junctions for Cufflinks runs using STAR's options:
--outFilterIntronMotifs RemoveNoncanonical OR RemoveNoncanonicalUnannotated

Hi Alex,
I am trying STAR to align the reads and then use the Cufflinks to look for expression values.I have stranded RNA-seq data. MAy I know why it is recommended to remove the non-canonical junctions for cufflinks run. How is it gonna affect in Cufflinks if I use the default parameter "no filtering" ??
priya is offline   Reply With Quote
Old 05-29-2013, 07:04 AM   #44
NGSfan
Senior Member
 
Location: Austria

Join Date: Apr 2009
Posts: 181
Default

hi priya, you may want to post this and carry on the conversation at the google groups for rna-star:

https://groups.google.com/forum/#!forum/rna-star
NGSfan is offline   Reply With Quote
Old 05-29-2013, 11:47 AM   #45
alexdobin
Senior Member
 
Location: NY

Join Date: Feb 2009
Posts: 161
Default

I believe it's best to feed Cufflinks only with the highest confidence alignments, and non-canonical junctions in my experience contain more false positives.
Also, many non-canonical splices occur just a few bases away from the highly expressed canonical, which could be caused by sequencing/mapping errors, and possibly by spliceosome errors. These splices will likely throw Cufflinks assembly off.
alexdobin is offline   Reply With Quote
Old 06-12-2013, 03:54 AM   #46
Nicolas Nalpas
Member
 
Location: Ireland

Join Date: Apr 2011
Posts: 19
Default

Dear Alex,

I am currently using STAR on a new strand-specific paired-end RNA-seq dataset, and I am very happy with the output.
I just have one question though (as a matter of interest really), does STAR calculate the MAPQ value for the reads? I am asking because at the moment with my new dataset (and also with a simulated dataset) all my mapped reads have a MAPQ value of 255, see below:
Code:
FCC1GLHACXX:3:1101:16858:25563#GTGAAA   163     5       19275940        255     1S65M24S        =       19275940        65      GCCTTTTGAGACAATACAAATCAAAATATTTACAGAGATAAGGCAGAATCAAACTACATTAAGGAGGGTCTGAGATCGGAAGAGCGTCGT      bbbeeeeegggggiiiiiiiiiiiiiiiiiiiiihiiiiiiiiiiiiihihiiiiiiiiihiiiiiiidggggggeeeeccccccccccc      NH:i:1  HI:i:1	AS:i:124        nM:i:2
FCC1GLHACXX:3:1101:16858:25563#GTGAAA   83      5       19275940        255     19S65M6S        =       19275940        -65     CGTGTGCTCTTCCGATCTGCCTTTTGAGACAATACAAATCAAAATATTTACAGAGATAAGGCAGAATCAAACTACATTAAGGAGGGTCTG      cdddeeeegggihhhiiiiiihhfhiihghiihgiiihghiiiiiiiiiiihiiiiiiiiiiiiiiiiiiihhgiiigggggeeeeebbb      NH:i:1  HI:i:1	AS:i:124        nM:i:2
FCC1GLHACXX:3:1101:16918:25584#GTGAAA   99      14      81019100        255     90M     =       81019191        181     TTGTACCAGTTATCAAACTGTGTTTTGATGGGATAGAGATTGATATTTTGTTTGCAAGATTAGCACTGCAGACTATTCCAGAAGACTTGG      bbbeeeeeggggghiiiiiiehdghiifhhiifghiegfhhhihiiiiiigiihifhihhhiiiiiiiiiiiiiiiihgggfggeee_cd      NH:i:1  HI:i:1  AS:i:176	nM:i:1
FCC1GLHACXX:3:1101:16918:25584#GTGAAA   147     14      81019191        255     90M     =       81019100        -181    CTTAAGAGATGACAGTCTGCTTAAAAATTTAGATATAAGATGTATAAGAAGTCTTAACGGTTGCAGGGTAACCGATGAAATTTTACATCT      ddddddeeedeeeggggihdeiiiiiiiiiiihiihiiihhiihhhhiiihhihiihihhhiiiiiiiiiiiiiiiigggggeeeeeab_      NH:i:1  HI:i:1  AS:i:176	nM:i:1
Thanks a lot for you help, regards,
Nicolas
Nicolas Nalpas is offline   Reply With Quote
Old 06-12-2013, 07:08 AM   #47
Nino
Member
 
Location: New York City

Join Date: Mar 2013
Posts: 27
Default

Hey Nicolas,

Star doesnt actually have a mapping quality score for the reads theirs is setup like this:

255 = uniquely mapped reads
3 = read maps to 2 locations
2 = read maps to 3 locations
1 = reads maps to 4-9 locations
0 = reads maps to 10 or more locations

So in essence they only have 5 values for their mapping quality score 255, 3, 2, 1, and 0 with respect to what I just mentioned above earlier. I hope this helps in answering your question

Thanks,
Nino
Nino is offline   Reply With Quote
Old 06-12-2013, 07:15 AM   #48
Nicolas Nalpas
Member
 
Location: Ireland

Join Date: Apr 2011
Posts: 19
Default

Dear Nino,
That's great, thanks a lot for your answer.
It makes a lot of sense since I set up the alignment to output only uniquely mapping reads.
Thanks for the help.
Regards,
Nicolas
Nicolas Nalpas is offline   Reply With Quote
Old 06-12-2013, 03:54 PM   #49
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 423
Default

Quote:
Originally Posted by alexdobin View Post
I believe it's best to feed Cufflinks only with the highest confidence alignments, and non-canonical junctions in my experience contain more false positives.
Also, many non-canonical splices occur just a few bases away from the highly expressed canonical, which could be caused by sequencing/mapping errors, and possibly by spliceosome errors. These splices will likely throw Cufflinks assembly off.
to add to that recommendation I've also found that cufflinks seems to perform worse if your alignments contain primary and secondary positions for reads. It's best to only have one alignment per read or pair.
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
sdriscoll is offline   Reply With Quote
Old 06-18-2013, 06:59 AM   #50
kevinrue
Member
 
Location: London

Join Date: Jan 2013
Posts: 25
Default Issue with shared memory

Dear STAR community and developers,

STAR is great and efficient tool for RNA reads mapping, thank you.

Meanwhile, I think I have an issue being that STAR does not seem to share the genome between parallel runs:
When running htop while submitting parallel STAR jobs, I observe a propotional increase in the amount of RAM used, about 25GB per genome for the bovine genome. (memory bar was essentially empty before submitting the jobs, and cache filled 25*5GB after submitting 5 jobs)

For the record, in htop the increase of RAM memory used affects the cache memory(yellow bars), while the "used" RAM (green) remains very low (about 5GB).

The system is Ubuntu 12.04.2 LTS (GNU/Linux 3.5.0-34-generic x86_64)

Here are two commands among the several I submitted in parallel (replace some paths with ...):

STAR --runMode alignReads --genomeDir /.../STAR2.3.0e_no_annotation/ --genomeLoad LoadAndRemove --readFilesIn /.../N1855_CN_0H_pe1.fastq /.../N1855_CN_0H_pe2.fastq --runThreadN 3 --outFilterMultimapNmax 10 --outSAMmode Full --outSAMattributes Standard --outFileNamePrefix ./N1855_CN_0H_ --outReadsUnmapped Fastx

STAR --runMode alignReads --genomeDir /.../STAR2.3.0e_no_annotation/ --genomeLoad LoadAndRemove --readFilesIn /.../N1178_CN_0H_pe1.fastq /.../N1178_CN_0H_pe2.fastq --runThreadN 3 --outFilterMultimapNmax 10 --outSAMmode Full --outSAMattributes Standard --outFileNamePrefix ./N1178_CN_0H_ --outReadsUnmapped Fastx

Here is the memory status (as output from top) after 5 jobs submitted.
Mem: 264131104k total, 193475272k used, 70655832k free, 89844k buffers


Note:
I obtained the numbers above by submitting the jobs right after restarting the machine. About 15min into the jobs, the 256 GB of RAM were fully in use:
Mem: 264131104k total, 263666924k used, 464180k free, 62412k buffers


Am I right in stating the genome is loaded separately in memory for the different jobs, as they seem to add up in the htop memory bar? If so, is it something in the command lines submitted which is wrong?

Many thanks,
Kevin Rue
kevinrue is offline   Reply With Quote
Old 06-18-2013, 07:47 AM   #51
NitaC
Member
 
Location: Philadelphia

Join Date: Apr 2013
Posts: 17
Default

I may not be entirely correct as I suppose I am still a novice, but yes, I do believe Star loads it all into memory. However, after the first run the remaining alignments are faster. (At least this is what I've observed.) I also attempted to submit multiple runs at one time and I ended up just killing the process. If you have the computing power, just run Star with a few more threads one alignment at a time. It really is a fast tool.
NitaC is offline   Reply With Quote
Old 06-18-2013, 08:26 AM   #52
kevinrue
Member
 
Location: London

Join Date: Jan 2013
Posts: 25
Default Issue with shared memory (2nd post)

Hi NitaC,

Thank you for your answer. Submitting one job at a time with more resources is an option, but the sharing of genome in memory is definitely a powerful and attractive feature which I would like to use.

I am not sure what you mean by:
Quote:
loads it all into memory
From my point of vie, "all" would be a single instance of the bovine genome. I suspect you meant "loads all the genomes for each separate job" ?
From what the manual states, if one instance is already present in memory, any subsequent job should use this instance, instead of loading another instance of the same genome. (LoadAndRemove option)

Meanwhile, what I observed is that for each concurrent job submitted (all pointing at the same genome folder), the amount of cache RAM used increases by approx. 25GB.
In htop, for each job, I read the three columns:
VIRT RES SHR
25.9G 25.7G 25.1G

These numbers suggest that the genome is loaded in the shared memory, although we recently submitted 10 jobs simultaneously, and the machine almost froze (extremely slow, even to log in), and even the SWAP memory was fully used.

I hope this explains more clearly my dilemna, between my understanding of the manual (one instance loaded) and my observations (increasing RAM usage, frozen machine)

Anyone has a further insight into the sharing of genome?

Many thanks
Kevin
kevinrue is offline   Reply With Quote
Old 06-18-2013, 01:43 PM   #53
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 423
Default

First load the genome into shared memory without aligning anything at all:
Code:
STAR --genomeDir /.../STAR2.3.0e_no_annotation/ --genomeLoad LoadAndExit
Then for each instance you need to specify the option '--genomeLoad LoadAndKeep'. This instructs STAR to check for the shared genome in shared memory and load it for the run then leave it loaded in shared memory.

When you're finally done aligning you need to run the following to unload the genome from shared memory:

Code:
STAR --genomeDir /.../STAR2.3.0e_no_annotation/ --genomeLoad Remove
They also suggest that if a genome has been loaded into shared memory for some time it may need to be unloaded and reloaded because it may get "paged out" by the system if it wasn't being used. This can have a serious impact to STAR's performance.
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
sdriscoll is offline   Reply With Quote
Old 06-19-2013, 09:07 AM   #54
kevinrue
Member
 
Location: London

Join Date: Jan 2013
Posts: 25
Default Issue with shared memory (solved)

Hi sdriscoll,

Many thanks for the very helpful answer, it seems to work on our system exactly the way you described it.

For those interested, I attached a file containing the memory usage as outputed by top when submitting multiple jobs the way sdriscoll described.

In short:
  • Loading the genome was accompanied with 52GB increase of RAM usage (probably some independent process account for part of this, as bovine genome is expected ~27GB)
  • Each job was accompanied by a marginal increase of RAM usage (jobs confirmed by a 3% increase in CPU usage)
  • Killing the jobs left the RAM usage constant (but reduced the CPU usage)
  • Removing the genome reduced the RAM usage by ~27GB (expected size of bovine genome)

Our confusion was that we understood from the STAR manual, that the LoadAndRemove option would share a genome the same way as LoadAndKeep, except for removing the genome from memory after the last job finishes.
Apparently, this is not the case.

If I got it right:
  • each job using LoadAndRemove will ignore any pre-loaded instance of the genome, load a new instance and remove it when done
  • each job using LoadAndKeep will share the existing instance of the genome, or load one if absent

Please correct me if I am wrong! (peer-review always appreciated )
Kevin
Attached Files
File Type: txt STAR_shared_genome.txt (3.1 KB, 48 views)
kevinrue is offline   Reply With Quote
Old 06-19-2013, 11:55 AM   #55
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 423
Default

I think the documentation should include some examples because the explanation is a little confusing. Glad that worked, though.
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
sdriscoll is offline   Reply With Quote
Old 06-19-2013, 12:18 PM   #56
Nicolas Nalpas
Member
 
Location: Ireland

Join Date: Apr 2011
Posts: 19
Default

Hi everyone,

At the moment, I am having an issue with STAR 2.3.0e, which I am trying to use in order to align 2x90bases paired-end reads sample against Mycobacterium bovis genome NC_002945.3 from NCBI.
I obtain the error below:

Code:
Jun 19 19:45:09 ..... Started STAR run
Jun 19 19:45:09 ..... Started mapping
Segmentation fault (core dumped)
The command used to generate genome was (which seems to be working fine):

Code:
STAR --runMode genomeGenerate --genomeDir /path/STAR2.3.0e --genomeFastaFiles /path/Mycobacterium_bovis_NC_002945.3.fasta --runThreadN 1
The command used to align reads was (which gives me the error above):

Code:
STAR --runMode alignReads --genomeDir /workspace/storage/genomes/Mycobacterium_bovis/NC_002945.3/STAR2.3.0e --genomeLoad NoSharedMemory --readFilesIn /home/dmagee/scratch/ALV_MAC_RNAseq/fastq_sequence/N1178_CN_0H_pe1.fastq /home/dmagee/scratch/ALV_MAC_RNAseq/fastq_sequence/N1178_CN_0H_pe2.fastq --runThreadN 1 --outFilterMultimapNmax 10 --outSAMmode Full --outSAMattributes Standard --outFileNamePrefix ./N1178_CN_0H_ --outReadsUnmapped Fastx
I am not sure of what is going on there, since my command was working perfectly when I aligned the exact same reads sample against Bos taurus genome.

Any idea on what could be the problem, it will be very much appreciated.
Thanks a lot.
Best wishes,
Nicolas
Nicolas Nalpas is offline   Reply With Quote
Old 06-19-2013, 01:44 PM   #57
alexdobin
Senior Member
 
Location: NY

Join Date: Feb 2009
Posts: 161
Default

Hi Kevin, Shawn,

it's great that Shawn's work-around worked. However, the --genomeLoad LoadAndRemove option is supposed to work the same way, allowing one copy of the genome to be shared between the jobs. The only thing I can think of at the moment is that if you submit the jobs precisely at the same moment, they might not see each other shared memory, and decide to allocate it on its own. Could you please try to run 2-3 jobs (without killing your server ) pausing for 10sec between them, and send me the Log.out outputs for each job. Also, while they are running, you can run Linux 'ipcs' command which will tell us which shared memory pieces are being used.

The fact that 50GB of RAM is used for the 27GB index also concerns me a bit.

Cheers
Alex
alexdobin is offline   Reply With Quote
Old 06-19-2013, 01:47 PM   #58
alexdobin
Senior Member
 
Location: NY

Join Date: Feb 2009
Posts: 161
Default

Quote:
Originally Posted by Nicolas Nalpas View Post
Hi everyone,

At the moment, I am having an issue with STAR 2.3.0e, which I am trying to use in order to align 2x90bases paired-end reads sample against Mycobacterium bovis genome NC_002945.3 from NCBI.
I obtain the error below:

Code:
Jun 19 19:45:09 ..... Started STAR run
Jun 19 19:45:09 ..... Started mapping
Segmentation fault (core dumped)
Nicolas
Hi Nicolas,

This is a known problem for very small genomes. At the genome generation step, please try to reduce the value of --genomeSAindexNbases to <=8, and then re-run the mapping step.
Generally, --genomeSAindexNbases needs to be scaled with the genome length, as ~min(14,log2(ReferenceLength)/2 - 1). I will need to incorporate this scaling in the future releases.

Cheers
Alex
alexdobin is offline   Reply With Quote
Old 06-20-2013, 01:37 AM   #59
Nicolas Nalpas
Member
 
Location: Ireland

Join Date: Apr 2011
Posts: 19
Default

Dear Alex,

Thanks for your help, I have tried to generate my Mycobacterium bovis genome as you rocommanded:

Code:
STAR --runMode genomeGenerate --genomeDir /workspace/storage/genomes/Mycobacterium_bovis/NC_002945.3/STAR2.3.0e --genomeFastaFiles /workspace/storage/genomes/Mycobacterium_bovis/NC_002945.3/source_file/Mycobacterium_bovis_NC_002945.3.fasta --genomeSAindexNbases 8 --runThreadN 1
And I now obtain a different error while doing the alignment:

Code:
Jun 20 09:23:01 ...... FATAL ERROR, exiting
Jun 20 09:23:01 ..... Started STAR run

EXITING because of FATAL error, could not open file /path/sjdbInfo.txt
SOLUTION: check that the path to genome files, specified in --genomDir is correct and the files are present, and have user read permsissions
So I checked the genome generate directory, into which read and write permissions are allowed, however there is no such "sjdbInfo.txt" file.

My command for alignment was:

Code:
STAR --runMode alignReads --genomeDir /path/STAR2.3.0e --genomeLoad LoadAndKeep --readFilesIn /path/N1178_CN_0H_pe1.fastq /path/N1178_CN_0H_pe2.fastq --runThreadN 1 --outFilterMultimapNmax 10 --outSAMmode Full --outSAMattributes Standard --outFileNamePrefix ./N1178_CN_0H_ --outReadsUnmapped Fastx
Any idea on how I can sort this out?

Thanks a lot for all your help.
Regards,
Nicolas
Nicolas Nalpas is offline   Reply With Quote
Old 06-20-2013, 01:59 AM   #60
Nicolas Nalpas
Member
 
Location: Ireland

Join Date: Apr 2011
Posts: 19
Default

Dear Alex,

I actually sorted it out by reading again your previous answer, so I tried out with --genomeSAindexNbases to 7 and it seems to work fine now, no error so far.

Thanks again for all your help, very appreciated.
Regards,
Nicolas
Nicolas Nalpas is offline   Reply With Quote
Reply

Tags
alignment, genome, mapping, rna-seq, transcirptome

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:26 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO