Seqanswers Leaderboard Ad

**alexdobin** · 02-19-2013, 10:01 AM

large number of contigs

Originally posted by JonB View Post

I have problems generating a new genome.
My genome is more than 300.000 contigs and ~300MB in size.

Feb 18 11:25:11 ..... Started STAR run
Feb 18 11:25:11 ... Starting to generate Genome files
/var/spool/slurmd/job1306271/slurm_script: line 9: 12129 Killed

As @NGSfan pointed out, this is indeed RAM problem. STAR bins genome sequence in a way that each chromosome (contig) starts at a new bin, which creates an overhead of Nchromosomes*BinSize, where BinSize=2^genomeChrBinNbits. By default, --genomeChrBinNbits = 18,
so BinSize=2^18~256kb, so with 300,000 contigs you would need ~75GB of RAM - that's what likely killed your job.

I suggest that you try a much smaller value of --genomeChrBinNbits 12. This would require just a few GB of RAM and should allow you to generate the genome files. I have not tried STAR with more than 50,000 contigs, and I suspect there might be significant slowdown in the mapping speed when the number of contigs is too big.

**alexdobin** · 02-19-2013, 10:25 AM

mixed datasets

Originally posted by bob-loblaw View Post

This looks great, I think I'll try it out! Have you tested STAR on mixed RNA datasets? i.e. RNA-Seq on a sample containing both human and bacterial/viral RNA? Thats what I'm working with at the moment and Tophat just isn't cutting it, there are millions of reads that are identified as being human and bacterial (depending on which aligning step I run first)

We used STAR in two mixed-species settings: human + viruses (~4,500 viruses from NCBI) and human + mouse, and in both cases it worked well as far as we could tell. You have to keep an eye on RAM, since STAR would need ~10*TotalGenomeSize bytes of RAM (~50GB for human+mouse), and if you have a large number of small chromosomes/scaffolds/contigs, you would need to reduce --genomeChrBinNbits as I explained in the previous post.

For mixed-species mapping I strongly agree with other users who recommended mapping to a combined genome rather than a 2-step mapping. When a mapper aligns a read to combined genome it considers various possible alignments and picks the best one. This reduces the false positives and negatives for unique calls to each species. On the other hand, in the 2-step method you force alignments to one of the species in the 1st step, so you results will be strongly biased towards the 1st species.

**cram** · 02-19-2013, 10:26 AM

As @NGSfan pointed out, this is indeed RAM problem. STAR bins genome sequence in a way that each chromosome (contig) starts at a new bin, which creates an overhead of Nchromosomes*BinSize, where BinSize=2^genomeChrBinNbits. By default, --genomeChrBinNbits = 18,
so BinSize=2^18~256kb, so with 300,000 contigs you would need ~75GB of RAM - that's what likely killed your job.

I suggest that you try a much smaller value of --genomeChrBinNbits 12. This would require just a few GB of RAM and should allow you to generate the genome files. I have not tried STAR with more than 50,000 contigs, and I suspect there might be significant slowdown in the mapping speed when the number of contigs is too big.

I've also had success in working around this problem by creating dummy scaffolds. I cat all the contigs into a single big fasta entry with a couple hundred 'X's separating each contig to (hopefully) prevent spurious alignments from spanning contigs. This made a huge difference in the amount of memory required.

**alexdobin** · 02-20-2013, 05:57 AM

large number of contigs

Originally posted by cram View Post

I've also had success in working around this problem by creating dummy scaffolds. I cat all the contigs into a single big fasta entry with a couple hundred 'X's separating each contig to (hopefully) prevent spurious alignments from spanning contigs. This made a huge difference in the amount of memory required.

This is a great idea. It will solve both the RAM problem and the slowdown problem I was talking about. However, you would need to post-process your alignments to extract alignments' coordinates within the real contigs from the coordinates within "super-contigs". Also, while Xs or Ns separating the contigs will prevent STAR from "extending" the alignment into adjacent contigs, it will not prevent splicing between contigs. So at the post-processing step you would also need to filter out alignments that span more than one real contig.

**thomasblomquist** · 02-20-2013, 06:18 AM

Trouble generating a small genome suffix array index...

I'm having trouble generating a small reference genome for running STAR. Below is the command, Error output and the input "genome" file. Your thoughts on how I can rectify are welcome. At present, I am using BFAST and it works reasonably well, but am looking for something a little easier for automatically generating seeds for parsing sequencing reads into bins. Thank you for your help and time.

Regards,

-Tom Blomquist

University of Toledo, Ohio

Command:

./STAR --runMode genomeGenerate --genomeDir ./ --genomeFastaFiles ./barcode.fa --runThreadN 8

Error Output:

Feb 19 09:39:34 ..... Started STAR run
Feb 19 09:39:34 ... Starting to generate Genome files
Feb 19 09:39:34 ... starting to sort Suffix Array. This may take a long time...
Feb 19 09:39:34 ... sorting Suffix Array chunks and saving them to disk...
Feb 19 09:39:34 ... loading chunks from disk, packing SA...
Feb 19 09:39:34 ... writing Suffix Array to disk ...
Feb 19 09:39:34 ... Finished generating suffix array
Feb 19 09:39:34 ... starting to generate Suffix Array index...

BUG: next index is smaller than previous, EXITING

./barcode.fa is the following -->

>ATGC
TTTTCATGCGATCAGGCGTCTGTCGTGCTC

>CAGT
TTTTCCAGTGATCAGGCGTCTGTCGTGCTC

>GACA
TTTTCGACAGATCAGGCGTCTGTCGTGCTC

>TGTG
TTTTCTGTGGATCAGGCGTCTGTCGTGCTC

>ACAC
TTTTCACACGATCAGGCGTCTGTCGTGCTC

>TACG
TTTTCTACGGATCAGGCGTCTGTCGTGCTC

>GCTA
TTTTCGCTAGATCAGGCGTCTGTCGTGCTC

>CATA
TTTTCCATAGATCAGGCGTCTGTCGTGCTC

>TAGA
TTTTCTAGAGATCAGGCGTCTGTCGTGCTC

>GACT
TTTTCGACTGATCAGGCGTCTGTCGTGCTC

>CTGA
TTTTCCTGAGATCAGGCGTCTGTCGTGCTC

>TGCT
TTTTCTGCTGATCAGGCGTCTGTCGTGCTC

**alexdobin** · 02-20-2013, 06:30 AM

small reference

Originally posted by thomasblomquist View Post

I'm having trouble generating a small reference genome for running STAR. Below is the command, Error output and the input "genome" file. Your thoughts on how I can rectify are welcome. At present, I am using BFAST and it works reasonably well, but am looking for something a little easier for automatically generating seeds for parsing sequencing reads into bins. Thank you for your help and time.

Regards,

-Tom Blomquist

University of Toledo, Ohio

Tom,
please use the following parameters:
--genomeChrBinNbits 6 --genomeSAindexNbases 4
I was able to run successful genome generation step for your small reference.
It looks like you want to use STAR in a quite non-standard way.
If you explain it a bit more I could suggest the parameters for the mapping step.

**thomasblomquist** · 02-20-2013, 06:38 AM

Originally posted by alexdobin View Post

Tom,
please use the following parameters:
--genomeChrBinNbits 6 --genomeSAindexNbases 4
I was able to run successful genome generation step for your small reference.
It looks like you want to use STAR in a quite non-standard way.
If you explain it a bit more I could suggest the parameters for the mapping step.

Thank you. I do tend to go about things in a non-standard way

. I am performing targeted resequencing of an RNA-seq library. Also known as amplicon sequencing. It is also semi-quantitative. My end goal being, I want to count the number of times a given sequence (e.g. for a gene target), or genetic variation (alleles from 1 base substitution up to 6 bases substituted). I want to be able to assign each sequence read the "name" of the consensus sequence it best matches.

So, my reference genome is usually anywhere from 10-1000 short sequences of ~10-40 bases in length.

Thank you for your time. I look forward to seeing how your software works in my studies.

Regards,

-Tom Blomquist

**alexdobin** · 02-21-2013, 01:10 PM

Originally posted by thomasblomquist View Post

Thank you. I do tend to go about things in a non-standard way

. I am performing targeted resequencing of an RNA-seq library. Also known as amplicon sequencing. It is also semi-quantitative. My end goal being, I want to count the number of times a given sequence (e.g. for a gene target), or genetic variation (alleles from 1 base substitution up to 6 bases substituted). I want to be able to assign each sequence read the "name" of the consensus sequence it best matches.

So, my reference genome is usually anywhere from 10-1000 short sequences of ~10-40 bases in length.

Thank you for your time. I look forward to seeing how your software works in my studies.

Regards,

-Tom Blomquist

If I understand it correctly, you need to match short sub-sequences of your long reads to a set of reference short sequences. For starters I would suggest the following parameters:
--outFilterMismatchNmax N (N=number of mismatches you wish to tolerate)
--outFilterScoreMinOverLread 0
--outFilterMatchNminOverLread 0
--outFilterMatchNmin L (L=shortest of the reference sequences)

Please let me know whether it works for you.

**thomasblomquist** · 02-21-2013, 01:31 PM

Originally posted by alexdobin View Post

If I understand it correctly, you need to match short sub-sequences of your long reads to a set of reference short sequences. For starters I would suggest the following parameters:
--outFilterMismatchNmax N (N=number of mismatches you wish to tolerate)
--outFilterScoreMinOverLread 0
--outFilterMatchNminOverLread 0
--outFilterMatchNmin L (L=shortest of the reference sequences)

Please let me know whether it works for you.

In essence. Since I generally know where the alleles are, the sequence reads are trimmed to include the variant loci and ~15 bases of sequencing read on either side of the expected loci(s). So, I'm matching ~40 base sequences to a reference library of chromosomes that are ~20-30 bases.

Thanks again for your help. I will try the modifications later tonight. -Tom

**thomasblomquist** · 02-22-2013, 02:18 AM

Had the chance to test your recommended parameters for my specific needs. This is not my normal mode of operation, but... HOLY POOP!!! I cut a 2 hour BFAST match and sorting into bins down to 10 seconds!!! And the specificity and sensitivity is off the hook! When I do some more formal testing I will provide back metrics.

You are a computer jedi master! This will make analysis of clinical sequencing data a breeze. Very intuitive command lines.

-Tom Blomquist

**bob-loblaw** · 02-22-2013, 02:24 AM

Originally posted by alexdobin View Post

We used STAR in two mixed-species settings: human + viruses (~4,500 viruses from NCBI) and human + mouse, and in both cases it worked well as far as we could tell. You have to keep an eye on RAM, since STAR would need ~10*TotalGenomeSize bytes of RAM (~50GB for human+mouse), and if you have a large number of small chromosomes/scaffolds/contigs, you would need to reduce --genomeChrBinNbits as I explained in the previous post.

For mixed-species mapping I strongly agree with other users who recommended mapping to a combined genome rather than a 2-step mapping. When a mapper aligns a read to combined genome it considers various possible alignments and picks the best one. This reduces the false positives and negatives for unique calls to each species. On the other hand, in the 2-step method you force alignments to one of the species in the 1st step, so you results will be strongly biased towards the 1st species.

Interesting, thanks. Unfortunately for mixed-species mapping (At least when one has alternative splicing and the other does not) the way tophat works it's essentially the same as running bowtie for bacterial reads first (Though I think there might be a way around it). There's also problems with the max size of a bowtie index :/ Would the size be a concern for STAR? Or could it run a database of anysize? (Hopefully RAM shouldn't be too much of an issue for me)

Also just to clarify, you don't know STAR would perform with bacterial reads?

**thomasblomquist** · 02-22-2013, 07:28 AM

For loci that vary by one base substitution, what would be the best parameters to discriminate the two alleles?

Also, I want to tailor the output for single best match, and if two are equivalent, no matches reported.

Regards,

-Tom

**Nino** · 03-27-2013, 11:54 AM

Hey, does anyone know if you need the reference genome indexed according to Star because I know for tophat2 the reference genome needs to be indexed *.b2t (bowtie2)

Thanks,
Nino

**alexdobin** · 03-27-2013, 01:46 PM

Originally posted by bob-loblaw View Post

Interesting, thanks. Unfortunately for mixed-species mapping (At least when one has alternative splicing and the other does not) the way tophat works it's essentially the same as running bowtie for bacterial reads first (Though I think there might be a way around it). There's also problems with the max size of a bowtie index :/ Would the size be a concern for STAR? Or could it run a database of anysize? (Hopefully RAM shouldn't be too much of an issue for me)

Also just to clarify, you don't know STAR would perform with bacterial reads?

Sorry, did not check this thread for awhile, just found your questions.

STAR will work with database of any size, however, required RAM scales linearly with database size, at ~10*DBsize bytes.

STAR works fine with bacterial reads, this is the work we did with long and small RNA-seq for some bacteria.

**alexdobin** · 03-27-2013, 01:56 PM

Originally posted by thomasblomquist View Post

For loci that vary by one base substitution, what would be the best parameters to discriminate the two alleles?

Also, I want to tailor the output for single best match, and if two are equivalent, no matches reported.

Regards,

-Tom

Sorry, did not check this thread for awhile, just found your question.

With default parameters, if two alignments differ by one mismatch, only the best one would be reported. This is controlled by --outFilterMultimapScoreRange, (=1 by default) which defines the range of alignment scores that are reported as multi-mappers.

The max number of alignments that allowed for output is controlled by --outFilterMultimapNmax. It's 10 be default, so any read with 10 or fewer alignments with scores >= BestScore-1 will be reported.

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 12 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News