Originally posted by Nino
View Post
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
Originally posted by alexdobin View PostWe used STAR in two mixed-species settings: human + viruses (~4,500 viruses from NCBI) and human + mouse, and in both cases it worked well as far as we could tell. You have to keep an eye on RAM, since STAR would need ~10*TotalGenomeSize bytes of RAM (~50GB for human+mouse), and if you have a large number of small chromosomes/scaffolds/contigs, you would need to reduce --genomeChrBinNbits as I explained in the previous post
Would you post a command that one would use to generate a mixed genome for mouse and human? We normally align to mouse and human genomes using GTF files of annotated mRNAs. Can one incorporate both of those GTF files into a combined mouse+human index?
Comment
-
Originally posted by apredeus View PostAlex, thank you for the great tool - STAR is indeed very impressive!
Would you post a command that one would use to generate a mixed genome for mouse and human? We normally align to mouse and human genomes using GTF files of annotated mRNAs. Can one incorporate both of those GTF files into a combined mouse+human index?
to generate combined mouse/human genome with STAR you would need to modify slightly your fasta and gtf files:
1. Modify chromosomes names so that mouse and human chromosomes have distinct names, e.g. chr1h/chr1m etc. In the FASTA files you need to make these modifications in all sequences name lines (i.e. starting with ">"). In GTF files you would need to modify all chromosome names in field 1.
2. Make sure that the transcript_id in GTF files are distinct for mouse and human. This is usually the case, for instance, Gencode has "ENSMUSTxxxxx" for mouse and "ENSTxxxxx" for human.
3. Concatenate GTF files for mouse and human into a single GTF file
4. Run genome generation with
STAR --runMode genomeGenerate --runThreadN 12 --genomeDir ./ --genomeFastaFiles /path/to/human.fa /path/to/mouse.fa --sjdbGTFfile /path/to/mouse_human.gtf --sjdbOverhang 100
If you want to use mRNA GTF files instead or in addition to standard annotations, I would recommend checking the splice junctions in this file for very short introns, and filtering them out - please see this post.
Cheers
Alex
Comment
-
Great, thank you very much for the informative answer. I'll make sure to filter out the ultra-short introns.
What would "standard" annotation be in this case? RefSeq? It's just that we have always used mRNA collection for both humans and mice (there's about 1.5 mil for mm9 and 2.5 or so for hs19), I'm not sure what else do people use.
Comment
-
Also, I meant to ask - do you normally include random chromosomes and "hap" chromosomes from human genome into the overall index? I know it shouldn't make a huge difference, however there's quite a few transcripts that are mapped to these.
Thank you!
Comment
-
Originally posted by apredeus View PostGreat, thank you very much for the informative answer. I'll make sure to filter out the ultra-short introns.
What would "standard" annotation be in this case? RefSeq? It's just that we have always used mRNA collection for both humans and mice (there's about 1.5 mil for mm9 and 2.5 or so for hs19), I'm not sure what else do people use.
You can simply add these annotations GTF files to the mRNA GTF files, making sure transcript IDs are distinct. This will increase the number of junctions in your database, which is usually beneficial. Again, depending on the quality of mRNA alignments you may need to filter them for junctions with very short introns, and, possibly, other artifacts. A simple way to do it is to start generating the genome with mRNA GTF file, then filter suspicious junctions from jdbList.out.tab file, and then re-generate genome feeding the filtered junctions with --sjdbFileChrStartEnd (you can include the annotations GTF file at the same time).
Comment
-
Originally posted by apredeus View PostAlso, I meant to ask - do you normally include random chromosomes and "hap" chromosomes from human genome into the overall index? I know it shouldn't make a huge difference, however there's quite a few transcripts that are mapped to these.
Thank you!
On the other hand, *hap* scaffolds represent haplotypes. Some reads will map equally well to multiple haplotypes, and thus will be marked as "multi-mappers", which is not a desired behavior in most cases. Unless you need this kind of halpotype-aware mapping, I do not recommend including them.
Comment
-
Originally posted by alexdobin View PostI recommend including the *GL* or *gl* marked "unplaced" scaffolds. There are some rRNAs on these scaffolds from which large number of reads may originate, especially if the ribo-depletion did not work well.
On the other hand, *hap* scaffolds represent haplotypes. Some reads will map equally well to multiple haplotypes, and thus will be marked as "multi-mappers", which is not a desired behavior in most cases. Unless you need this kind of halpotype-aware mapping, I do not recommend including them.
Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc
So, definitely including the extra scaffolds!
Comment
-
Originally posted by alexdobin View PostThere are many possible choices for annotations: RefSeq, UCSC genes, ENSEMBL. I would recommend Gencode annotations, which are based on ENSEMBL, are very comprehensive, and are used by ENCODE.
You can simply add these annotations GTF files to the mRNA GTF files, making sure transcript IDs are distinct. This will increase the number of junctions in your database, which is usually beneficial. Again, depending on the quality of mRNA alignments you may need to filter them for junctions with very short introns, and, possibly, other artifacts. A simple way to do it is to start generating the genome with mRNA GTF file, then filter suspicious junctions from jdbList.out.tab file, and then re-generate genome feeding the filtered junctions with --sjdbFileChrStartEnd (you can include the annotations GTF file at the same time).
Comment
-
failed to generate genome using STAR
Hi, I build genome using command:
STAR --runMode genomeGenerate --genomeDir STAR_pathway --genomeFastaFiles file.fa.gz --runThreadN 10
Then I failed and got message: "BUG: next index is smaller than previous, EXITING".
Also, does anyone have more detailed manual of STAR, I downloaded the manual from the website, it shows /pathToStarDir/STAR --runMode genomeGenerate --genomeDir /path/to/
GenomeDir --genomeFastaFiles /path/to/genome/fasta1 /path/to/genome/
fasta2 --runThreadN <n> …
What are the other opions in ...? I tried unzip the fa.gz file to fa file and then got the the wrong message: "limitGenomeGenerateRAM=28is too small for your genome
SOLUTION: please specify limitGenomeGenerateRAM not less than114 GB and make that much RAM available".
For other aligners we can type -h or --help to find the details, but not for star...Last edited by shangzhong0619; 06-19-2014, 02:38 PM.
Comment
-
Hi Shangzhong,
please try one of the latest STAR patches.
Do you have a large number of contigs/scaffolds in your genome assembly? This would explain the error message (see this post). If so, you need to use --genomeChrBinNbits 14 or smaller.
You can find brief description of all parameters at the end of the manual, or in the parametersDefault file in the source directory. If you want to use annotations to improve mapping accuracy, you will need:
sjdbGTFfile -
string: path to the GTF file with annotations
sjdbOverhang 0
int>=0: length of the donor/acceptor sequence on each side of the junctions, ideally = (mate_length - 1)
if =0, splice junction database is not used
Cheers
Alex
Comment
-
Originally posted by alexdobin View PostHi Shangzhong,
please try one of the latest STAR patches.
Do you have a large number of contigs/scaffolds in your genome assembly? This would explain the error message (see this post). If so, you need to use --genomeChrBinNbits 14 or smaller.
You can find brief description of all parameters at the end of the manual, or in the parametersDefault file in the source directory. If you want to use annotations to improve mapping accuracy, you will need:
sjdbGTFfile -
string: path to the GTF file with annotations
sjdbOverhang 0
int>=0: length of the donor/acceptor sequence on each side of the junctions, ideally = (mate_length - 1)
if =0, splice junction database is not used
Cheers
Alex
Thanks for your reply, yes my reference fasta has many scaffolds. When I try to install the latest version, it shows the following effor.
samtools/libbam.a(bgzf.o): In function `bgzf_compress':
/home/dobin/STARcode/samtools-0.1.19/bgzf.c:187: undefined reference to `deflateInit2_'
/home/dobin/STARcode/samtools-0.1.19/bgzf.c:188: undefined reference to `deflate'
/home/dobin/STARcode/samtools-0.1.19/bgzf.c:189: undefined reference to `deflateEnd'
/home/dobin/STARcode/samtools-0.1.19/bgzf.c:195: undefined reference to `crc32'
/home/dobin/STARcode/samtools-0.1.19/bgzf.c:195: undefined reference to `crc32'
samtools/libbam.a(bgzf.o): In function `bgzf_dopen':
/home/dobin/STARcode/samtools-0.1.19/bgzf.c:160: undefined reference to `compressBound'
samtools/libbam.a(bgzf.o): In function `bgzf_open':
/home/dobin/STARcode/samtools-0.1.19/bgzf.c:142: undefined reference to `compressBound'
samtools/libbam.a(bgzf.o): In function `inflate_block':
/home/dobin/STARcode/samtools-0.1.19/bgzf.c:224: undefined reference to `inflateInit2_'
/home/dobin/STARcode/samtools-0.1.19/bgzf.c:228: undefined reference to `inflate'
/home/dobin/STARcode/samtools-0.1.19/bgzf.c:229: undefined reference to `inflateEnd'
/home/dobin/STARcode/samtools-0.1.19/bgzf.c:233: undefined reference to `inflateEnd'
samtools/libbam.a(bam_import.o): In function `ks_getuntil2':
/home/dobin/STARcode/samtools-0.1.19/bam_import.c:17: undefined reference to `gzread'
samtools/libbam.a(bam_import.o): In function `__bam_get_lines':
/home/dobin/STARcode/samtools-0.1.19/bam_import.c:76: undefined reference to `gzdopen'
/home/dobin/STARcode/samtools-0.1.19/bam_import.c:92: undefined reference to `gzclose'
/home/dobin/STARcode/samtools-0.1.19/bam_import.c:76: undefined reference to `gzopen64'
samtools/libbam.a(bam_import.o): In function `sam_close':
/home/dobin/STARcode/samtools-0.1.19/bam_import.c:485: undefined reference to `gzclose'
samtools/libbam.a(bam_import.o): In function `sam_open':
/home/dobin/STARcode/samtools-0.1.19/bam_import.c:472: undefined reference to `gzdopen'
/home/dobin/STARcode/samtools-0.1.19/bam_import.c:472: undefined reference to `gzopen64'
samtools/libbam.a(bam_import.o): In function `sam_header_read2':
/home/dobin/STARcode/samtools-0.1.19/bam_import.c:126: undefined reference to `gzdopen'
/home/dobin/STARcode/samtools-0.1.19/bam_import.c:126: undefined reference to `gzopen64'
samtools/libbam.a(bam_import.o): In function `ks_getc':
/home/dobin/STARcode/samtools-0.1.19/bam_import.c:17: undefined reference to `gzread'
samtools/libbam.a(bam_import.o): In function `sam_header_read2':
/home/dobin/STARcode/samtools-0.1.19/bam_import.c:147: undefined reference to `gzclose'
collect2: ld returned 1 exit status
make: *** [STAR] Error 1
I have samtools-0.1.19 in my computer. what was this error about? thank you.
Comment
-
Hi Shangzong,
please try this patch http://labshare.cshl.edu/shares/ging...R_2.3.1z10.tgz
If this does not help, please use the pre-compiled STAR or STARstatic executables in the source directory.
Cheers
Alex
Comment
-
Originally posted by alexdobin View PostHi Shangzong,
please try this patch http://labshare.cshl.edu/shares/ging...R_2.3.1z10.tgz
If this does not help, please use the pre-compiled STAR or STARstatic executables in the source directory.
Cheers
Alex
Comment
-
Originally posted by shangzhong0619 View PostThanks. It works. I have another problem, when indexing the genome, does STAR accept gzipped fasta file? It didn't work for me and got "BUG: next index is smaller than previous", I also tried --readFilesCommand zcat, still didn't work. But when I unzip the fasta file, it works.
for genome generation, STAR needs unzipped fasta. You do it once per genome, and can delete the fasta after the genome is generated. '--readFilesCommand zcat' option only applies to fastq/fasta reads at the mapping stage.
Cheers
Alex
Comment
Latest Articles
Collapse
-
by seqadmin
Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...-
Channel: Articles
12-16-2024, 07:57 AM -
-
by seqadmin
Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.
Long-Read Sequencing
Long-read sequencing has seen remarkable advancements,...-
Channel: Articles
12-02-2024, 01:49 PM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, 12-17-2024, 10:28 AM
|
0 responses
28 views
0 likes
|
Last Post
by seqadmin
12-17-2024, 10:28 AM
|
||
Started by seqadmin, 12-13-2024, 08:24 AM
|
0 responses
44 views
0 likes
|
Last Post
by seqadmin
12-13-2024, 08:24 AM
|
||
Started by seqadmin, 12-12-2024, 07:41 AM
|
0 responses
30 views
0 likes
|
Last Post
by seqadmin
12-12-2024, 07:41 AM
|
||
Started by seqadmin, 12-11-2024, 07:45 AM
|
0 responses
42 views
0 likes
|
Last Post
by seqadmin
12-11-2024, 07:45 AM
|
Comment