Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • alexdobin
    Senior Member
    • Feb 2009
    • 161

    #31
    Originally posted by Nino View Post
    Hey, does anyone know if you need the reference genome indexed according to Star because I know for tophat2 the reference genome needs to be indexed *.b2t (bowtie2)

    Thanks,
    Nino
    You will need to generate special genome files for STAR.
    This is done with the following command:
    STAR --runMode genomeGenerate --genomeDir /path/to/GenomeDir --genomeFastaFiles /path/to/genome/fasta1 /path/to/genome/fasta2 --runThreadN <Nthreads>
    If you want to use annotations for improved mapping accuracy, you also need to use:
    --sjdbGTFfile /path/to/Annot.gtf --sjdbOverhang <N>, where ideally N=ReadMateLength-1, or you could generically use ~100.

    Comment

    • Auction
      Member
      • Jul 2009
      • 24

      #32
      Originally posted by alexdobin View Post
      You will need to generate special genome files for STAR.
      This is done with the following command:
      STAR --runMode genomeGenerate --genomeDir /path/to/GenomeDir --genomeFastaFiles /path/to/genome/fasta1 /path/to/genome/fasta2 --runThreadN <Nthreads>
      If you want to use annotations for improved mapping accuracy, you also need to use:
      --sjdbGTFfile /path/to/Annot.gtf --sjdbOverhang <N>, where ideally N=ReadMateLength-1, or you could generically use ~100.
      Alex, I successfully used STAR to generate the SAM file. But I can't find how to specify the output for chimeric alignments. Should I use "--outSAMunmapped Within" to include everything in the SAM and use samtools to find chimeric alignments? And also for "--outReadsUnmapped", does it include chimeric and singleton?

      Thanks.

      Comment

      • alexdobin
        Senior Member
        • Feb 2009
        • 161

        #33
        Originally posted by Auction View Post
        Alex, I successfully used STAR to generate the SAM file. But I can't find how to specify the output for chimeric alignments. Should I use "--outSAMunmapped Within" to include everything in the SAM and use samtools to find chimeric alignments? And also for "--outReadsUnmapped", does it include chimeric and singleton?

        Thanks.
        To switch on chimeric detection and output, you would need to specify non-zero --chimSegmentMin, which is a minimum length of a segment (piece) of which chimeras are made. For example, if you have 2x100 PE reads and specify --chimSegmentMin, you could have a chimera in which one segment of (100-mate1+80-mate2) bases maps non-chimerically to one chromosome, and another segement of 20b-mate2 maps to another chromosome.
        The Chimeric output will go into Chimeric.out.sam and Chimeric.out.junction files.

        Note that the same read can have both acceptable non-chimeric (output to Aligned.out.sam) and chimeric alignments (output to Chimeric.out.*). A read is considered "unmapped" if it does not have an acceptable non-chimeric alignment, and --outSAMunmapped Within will output "unmapped" reads into Aligned.out.sam without alignment coordinates (which allows to fully reconstruct fastq file from the SAM file), while --outReadsUnmapped Fastx will output them into a fastq or fasta files.

        There are other parameters that control chimeric detection:
        chimJunctionOverhangMin 20
        int>0: minimum overhang for a chimeric junction
        chimScoreMin 0
        int>0: minimum total (summed) score of the chimeric segments
        chimScoreDropMax 20
        int>0: max drop (difference) of chimeric score (the sum of scores of all chimeric segements) from the read length
        chimScoreSeparation 10
        int>0: minimum difference (separation) between the best chimeric score and the next one
        chimScoreJunctionNonGTAG -1
        int: penalty for a non-GT/AG chimeric junction

        Comment

        • Sipkovandam@gmail.com
          Member
          • Mar 2013
          • 13

          #34
          I am pretty new to RNA-seq analysis and I am now using STAR instead of Tophat and I am very satisfied with both the quality of the results and the speed at which I get them. One thing I miss though is the .GTF file I get from Tophat that contains new genes predicted based on the reads and splice junktions.
          Does anyone know if there is a way I can combine an existing GTF file with the .tab file to create a new .GTF (or GFF) file containing newly predicted gene sites (with random names for these)?

          Comment

          • alexdobin
            Senior Member
            • Feb 2009
            • 161

            #35
            Originally posted by [email protected] View Post
            I am pretty new to RNA-seq analysis and I am now using STAR instead of Tophat and I am very satisfied with both the quality of the results and the speed at which I get them. One thing I miss though is the .GTF file I get from Tophat that contains new genes predicted based on the reads and splice junktions.
            Does anyone know if there is a way I can combine an existing GTF file with the .tab file to create a new .GTF (or GFF) file containing newly predicted gene sites (with random names for these)?
            As far as I know TopHat does not produce a GTF file on its own, at least it was true for the last version I tried (~2.0.3). You need to feed the alignments to Cufflinks, which will assemble and quantify transcripts, and produce the GTF file.

            You can run Cufflinks on STAR alignments.
            If you have un-stranded RNA-seq data you will need to run STAR with --outSAMstrandField intronMotif option, which will generate the XS strand attribute for all alignments that contain splice junctions. The spliced alignments that have undefined strand (i.e. containing only non-canonical junctions) will be suppressed.

            If you have stranded RNA-seq data, you do not need to use any specific STAR options. Instead, you need to run Cufflinks with the library option --library-type options. For example,
            cufflinks ... ... --library-type fr-firststrand
            should be used for the “standard” dUTP protocol. This option has to be used only for Cufflinks runs and not for STAR runs.
            It is recommended to remove the non-canonical junctions for Cufflinks runs using STAR's options:
            --outFilterIntronMotifs RemoveNoncanonical OR RemoveNoncanonicalUnannotated

            Comment

            • Sipkovandam@gmail.com
              Member
              • Mar 2013
              • 13

              #36
              As far as I know TopHat does not produce a GTF file on its own, at least it was true for the last version I tried (~2.0.3). You need to feed the alignments to Cufflinks, which will assemble and quantify transcripts, and produce the GTF file.
              You are right, sorry I mixed it up a bit. Thanks for the information on the options I should use.

              Comment

              • bruce01
                Senior Member
                • Mar 2011
                • 160

                #37
                Hi all, sorry for the basic question:

                I am writing a bash script to submit star jobs, remove duplicates, get counts etc. The dataset I have has multiple fastq per sample, but different numbers for each. I have made files containing fastq in the specified format (fq_r1_1,..,fq_r1_n). Can I use these when submitting the STAR job? Ie:


                STAR [options] readFilesIn $files/file_read1 $files/file_read2

                ?

                Have tried a few ways to do this but can't figure it out or get STAR to accept input. I am a 'midrange' bioinformatics PhD, so don't hold back on most efficient or crazy way of doing this!

                Thanks in advance,

                Bruce.
                Last edited by bruce01; 05-07-2013, 05:12 AM.

                Comment

                • dpryan
                  Devon Ryan
                  • Jul 2011
                  • 3478

                  #38
                  Originally posted by bruce01 View Post
                  I have made files containing fastq in the specified format (fq_r1_1,..,fq_r1_n). Can I use these when submitting the STAR job? Ie:

                  STAR [options] readFilesIn $files/file_read1 $files/file_read2
                  Have you just tried the following?
                  Code:
                  STAR --readFilesIn Sample1_r1_1.fq,Sample1_r1_2.fq,Sample1_r1_3.fq... Sample1_r2_1.fq,Sample1_r2_2.fq,Sample1_r2_3.fq...
                  You could also just concatenate the files together as appropriate and use the result.

                  Comment

                  • bruce01
                    Senior Member
                    • Mar 2011
                    • 160

                    #39
                    Dpryan, yes have tried using wildcards as input to test it works, I get a segmentation fault. When I run it with all filenames included as standard it runs fine. I have a lot of samples, with variable numbers of fastq files per sample, and want a single script to submit to a queue. So inputting all fastq by hand is not an option, hence my original question.

                    Concatenating the fastqs will mean I have to uncompress them, using computing time and I am keen to go from the .gz that my facility have supplied. This can't be too big of a problem is it?

                    Comment

                    • dpryan
                      Devon Ryan
                      • Jul 2011
                      • 3478

                      #40
                      My example didn't use wildcards, so I'm not sure where that idea came from.

                      You can just concatenate the gzipped files together without uncompressing them first.

                      The other normal process would be to simply write your script to generate the comma separated list that's then fed to STAR. You should be able to do that easily enough in bash, which whatever you're using for job scheduling probably already can handle.

                      Comment

                      • bruce01
                        Senior Member
                        • Mar 2011
                        • 160

                        #41
                        Ok, asked over on Stackoverflow, this works:

                        group1=( $files/Sample1*r1* );
                        group2=( $files/Sample1*r2* );
                        ( IFS=,; STAR --readFilesIn "${group1[*]}" "${group2[*]}" [OPTIONS]);

                        Thanks for the help and ideas Dpryan.

                        ##Edit: DPryan, sorry, getting wires crossed between here and Stackoverflow. I was asking how to give STAR the input that I had created, above works. I am reticent to concatenate gzip files, I dont want to create doubles and don't want to change the gzips in any way before aligning: paranoia!
                        Last edited by bruce01; 05-08-2013, 03:48 AM. Reason: Miscommunication with poster

                        Comment

                        • Auction
                          Member
                          • Jul 2009
                          • 24

                          #42
                          You can also try following commands, it works for me.
                          fq1=`ls -m *_R1_*.fastq.gz | tr -d '\n' | tr -d ' '`
                          fq2=${fq1//"_R1_"/"_R2_"}
                          STAR --readFilesIn $fq1 $fq2

                          Comment

                          • priya
                            Member
                            • Apr 2013
                            • 57

                            #43
                            Originally posted by alexdobin View Post


                            If you have stranded RNA-seq data, you do not need to use any specific STAR options. Instead, you need to run Cufflinks with the library option --library-type options. For example,
                            cufflinks ... ... --library-type fr-firststrand
                            should be used for the “standard” dUTP protocol. This option has to be used only for Cufflinks runs and not for STAR runs.
                            It is recommended to remove the non-canonical junctions for Cufflinks runs using STAR's options:
                            --outFilterIntronMotifs RemoveNoncanonical OR RemoveNoncanonicalUnannotated

                            Hi Alex,
                            I am trying STAR to align the reads and then use the Cufflinks to look for expression values.I have stranded RNA-seq data. MAy I know why it is recommended to remove the non-canonical junctions for cufflinks run. How is it gonna affect in Cufflinks if I use the default parameter "no filtering" ??

                            Comment

                            • NGSfan
                              Senior Member
                              • Apr 2009
                              • 181

                              #44
                              hi priya, you may want to post this and carry on the conversation at the google groups for rna-star:

                              Comment

                              • alexdobin
                                Senior Member
                                • Feb 2009
                                • 161

                                #45
                                I believe it's best to feed Cufflinks only with the highest confidence alignments, and non-canonical junctions in my experience contain more false positives.
                                Also, many non-canonical splices occur just a few bases away from the highly expressed canonical, which could be caused by sequencing/mapping errors, and possibly by spliceosome errors. These splices will likely throw Cufflinks assembly off.

                                Comment

                                Latest Articles

                                Collapse

                                • SEQadmin2
                                  From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                  by SEQadmin2


                                  Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                  The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                  ...
                                  06-02-2026, 10:05 AM
                                • SEQadmin2
                                  Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                  by SEQadmin2


                                  With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                  Introduction

                                  Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                  05-22-2026, 06:42 AM
                                • SEQadmin2
                                  Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                                  by SEQadmin2

                                  Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                                  Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                                  05-06-2026, 09:04 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by SEQadmin2, 06-02-2026, 12:03 PM
                                0 responses
                                19 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 06-02-2026, 11:40 AM
                                0 responses
                                14 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 05-28-2026, 11:40 AM
                                0 responses
                                29 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 05-26-2026, 10:12 AM
                                0 responses
                                31 views
                                0 reactions
                                Last Post SEQadmin2  
                                Working...