Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Originally posted by dpryan View Post
    The primary assembly contains only contigs assembled as chromosomes. The top level assembly contains that plus contigs that either (A) are known to belong to a specific chromosome but we're not sure where yet (unplaced contigs) or (B) are believed to be human but we're not even sure what chromosome to align them to (unlocalized contigs). If you've ever seen things like chr17_random, then that's from the first group. For humans, the top level assembly also contains many of the alternate haplotypes (i.e., full length chromosomes with all but a smallish region hard masked, where the non-masked region is commonly variant in the population). Newer versions of BWA can use these to increase the quality of mapping (to my knowledge, BWA is the only aligner that can natively use these at the moment...and this is all as of the last couple weeks). Presumably more aligners will add this feature in, since it really is the proper way to do it.

    BTW, with RNAseq you can get rid of the *hap* files (actually, you should do this unless you're using BWA).
    First: Thank you so much for your input - I really appreciate your time and hope that others, which later on potentially will end up reading this via google, will also benefit!

    Next: Having recently finished my Bioinformatics PhD in statistical methodology and vaccine development, the world of NGS is new to me!

    Finally: I am using the STAR aligner (STAR_2.4.0h1) and I have build genome indices using GRCh38 [1] as described in [2]. I have then aligned my human 101bp Illumina paired-end reads to this build, using options "ENCODE standard options for long RNA-seq" in [2] section 3.2.1 and then finally I am trying to use featureCounts (Version 1.4.6) to generate a count matrix. Based on this I aim at performing DEG analysis.

    Questions:
    a) Is this approach valid/recommendable or have I missed something?
    b) Given the above scenario, would you recommend using the 'primary' or the 'top level' assembly?
    c) Any other input?

    Cheers again!

    1. ftp://ftp.ensembl.org/pub/release-78...assembly.fa.gz
    2. https://github.com/alexdobin/STAR/bl...STARmanual.pdf

    Comment


    • #17
      Welcome to the NGS side of bioinfo!

      a) Sounds good to me, in fact that's what I'd do (well, it essentially is what I do, though I work mostly on mice).
      b) It shouldn't much matter. It's certainly simpler to just use the primary assembly. If you choose the top level assembly then make sure to remove any chromosome with "_hap_" in the name. Not doing that will artificially inflate the number of multimappers.
      c) Don't forget to trim adapters from the fastq files. You can quality trim a bit too, but don't be too aggressive there.

      Anyway, it sounds like you're on the right track!

      Comment


      • #18
        Originally posted by dpryan View Post
        Welcome to the NGS side of bioinfo!

        a) Sounds good to me, in fact that's what I'd do (well, it essentially is what I do, though I work mostly on mice).
        b) It shouldn't much matter. It's certainly simpler to just use the primary assembly. If you choose the top level assembly then make sure to remove any chromosome with "_hap_" in the name. Not doing that will artificially inflate the number of multimappers.
        c) Don't forget to trim adapters from the fastq files. You can quality trim a bit too, but don't be too aggressive there.

        Anyway, it sounds like you're on the right track!
        Thanks! So far, I find it a bit of a strange world, with a lot of 'best guesses'!

        Regarding 'trimming', I have been told, that this is not necessary, since the aligner will simply 'skip' adapters/bad quality bases, apparently using something called 'soft-clipping'?

        Comment


        • #19
          Originally posted by LeonDK View Post
          Regarding 'trimming', I have been told, that this is not necessary, since the aligner will simply 'skip' adapters/bad quality bases, apparently using something called 'soft-clipping'?
          Jein, as my German colleagues would say. Aligners can nicely soft-clip low-quality or otherwise mismatched stretches of sequence. However, adapter sequence is going to be both good quality and easy to spot. The general worry is that by leaving it in there, you can bias what should otherwise be multimapping reads/pairs to aberrantly become uniquely mapping. Since adapter sequences can be spotted with high fidelity, they can just be trimmed off to avoid this. Quality trimming doesn't gain much for RNAseq (there are some papers on this), though it seems that trimming off the really low quality bases (<5 or so) seems to have some benefit). Anyway, I believe that STAR has an option to adapter trim for you, which is probably the most convenient way to go.

          Comment


          • #20
            Originally posted by dpryan View Post
            Jein, as my German colleagues would say. Aligners can nicely soft-clip low-quality or otherwise mismatched stretches of sequence. However, adapter sequence is going to be both good quality and easy to spot. The general worry is that by leaving it in there, you can bias what should otherwise be multimapping reads/pairs to aberrantly become uniquely mapping. Since adapter sequences can be spotted with high fidelity, they can just be trimmed off to avoid this. Quality trimming doesn't gain much for RNAseq (there are some papers on this), though it seems that trimming off the really low quality bases (<5 or so) seems to have some benefit). Anyway, I believe that STAR has an option to adapter trim for you, which is probably the most convenient way to go.
            Ok, I think that makes sense! And 'Jein' would be 'Nja' in Danish

            Looking in the manual, STAR does indeed have options for adapter removal:
            ----------------------------------------
            --clip3pAdapterSeq
            default: -
            string(s): adapter sequences to clip from 3p of each mate. If one value is given,
            it will be assumed the same for both mates.
            --clip3pAdapterMMp
            default: 0.1
            double(s): max proportion of mismatches for 3p adpater clipping for each mate.
            If one value is given, it will be assumed the same for both mates.
            --clip3pAfterAdapterNbases
            default: 0
            int(s): number of bases to clip from 3p of each mate after the adapter clipping.
            If one value is given, it will be assumed the same for both mates.
            ----------------------------------------

            However I am not quite sure of:
            a) Which would be the right option to use
            b) How do I find out the appropriate adapter sequence to state to STAR?

            Comment


            • #21
              a) --clip3pAdapterSeq
              b) You'd need to either run fastQC (or something like that) or just ask whomever did the sequencing. They can give you the adapter sequence (it's presumably just the standard Illumina adapter sequence).

              Comment


              • #22
                Originally posted by dpryan View Post
                a) --clip3pAdapterSeq
                b) You'd need to either run fastQC (or something like that) or just ask whomever did the sequencing. They can give you the adapter sequence (it's presumably just the standard Illumina adapter sequence).
                Ok, so the kit used is [1] and the standard Illumina adapter sequences can be found in [2], where it on p12 says:

                TruSeq Universal Adapter
                5’ AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT

                Is this then the sequence, I should state after the '--clip3pAdapterSeq' option? Or should I reverse it like so:

                3' TCTAGCCTTCTCGCAGCACATCCCTTTCTCACATCTAGAGCCACCAGCGGCATAGTAA

                ?


                1. http://www.illumina.com/products/tru..._prep_kit.html
                2. http://supportres.illumina.com/docum...nce-letter.pdf

                Comment


                • #23
                  For read mapping, you may also try Subread aligner. Its performance has been demonstrated in the recent SEQC/MAQC III study:

                  We present primary results from the Sequencing Quality Control (SEQC) project, coordinated by the US Food and Drug Administration. Examining Illumina HiSeq, Life Technologies SOLiD and Roche 454 platforms at multiple laboratory sites using reference RNA samples with built-in controls, we assess RNA …


                  This publication includes detailed results from the Subread-featureCounts-limma/voom pipeline and comparisons with other pipelines from using billions of RNA-seq reads generated in six different sequencing centers.
                  Last edited by shi; 01-15-2015, 02:46 PM.

                  Comment


                  • #24
                    Originally posted by shi View Post
                    For read mapping, you may also try Subread aligner. Its performance has been demonstrated in the recent SEQC/MAQC III study:

                    We present primary results from the Sequencing Quality Control (SEQC) project, coordinated by the US Food and Drug Administration. Examining Illumina HiSeq, Life Technologies SOLiD and Roche 454 platforms at multiple laboratory sites using reference RNA samples with built-in controls, we assess RNA …


                    This publication includes detailed results from the Subread-featureCounts-limma/voom pipeline and comparisons with other pipelines from using billions of RNA-seq reads generated in six different sequencing centers.
                    Thank you for your input. The reason for using star is speed and that it "is-the-aligner-of-choice", where I am working

                    Comment


                    • #25
                      Hi all,

                      Similar to the question posted by Leon regarding the chromosomes no: 270 with latest release 78, GRCh38.gtf from ensemble I found the same issue in my analysis. I am using RNA-seq paired end data illumina 75bp read outs. After QC and trimming adapters and low quality bp, I have aligned trimmed paired reads with Tophat2.0/Bowtie2. Before aligning, concatenated all .fa files for chromosomes 1:22, +M+X+Y, and then built index on the concatenated.GrCh38.fa. Using the bam file generated featureCount program, read summarisation was performed keeping parameter s -2 (52.1% assigned fragments) as s -1 (~1.5%) was giving significantly low successfully assigned fragments.

                      Result
                      Meta-features : 58674
                      Chromosomes : 270
                      This was tested in standalone FeatureCount program coded in C.

                      When the same data was used in Rsubread featureCount function
                      Meta-features : 64253
                      Chromosomes : 270

                      Questions
                      1. Is there any discrepancy in the program in C and R package.
                      2. Chromosomes 270 and meta feature 58674 makes sense ?
                      3. Increasing number of threads to 20 in both (C and R) is not increasing the speed of the program, and none of the cores/threads are being utilised.

                      Comment


                      • #26
                        Can you provide your commands and also the versions of programs you use? This will help find out the difference between your two runs.

                        Also, the percentages of assigned fragments suggest your data are reversely stranded reads. So you should use '-s 2', not '-s 1'.

                        Cheers,
                        Wei

                        Comment


                        • #27
                          I just found when when gene_name is used both the packages it provides consistent meta-features.
                          C version
                          featureCounts -p -s 2 -M \
                          -a ../Homo_sapiens.GRCh38.78.gtf -o counts.txt -t exon -g gene_name -T 10 *accepted_hits.bam

                          R subread
                          count<-featureCounts(files="accepted_hits.bam",
                          ,annot.ext="Homo_sapiens.GRCh38.78.gtf",isGTFAnnotationFile=TRUE,
                          GTF.featureType="exon",GTF.attrType="gene_name",useMetaFeatures=TRUE,
                          allowMultiOverlap=FALSE,isPairedEnd=TRUE,requireBothEndsMapped=FALSE,
                          checkFragLength=FALSE,minFragLength=50,maxFragLength=1000,
                          nthreads=20,strandSpecific=2,minMQS=0,
                          readExtension5=0,readExtension3=0,read2pos=NULL,
                          minReadOverlap=1,countSplitAlignmentsOnly=FALSE,
                          countMultiMappingReads=TRUE,countPrimaryAlignmentsOnly=FALSE,
                          countChimericFragments=TRUE,ignoreDup=FALSE,chrAliases=NULL,reportReads=FALSE)

                          However, when multiple featurecounts ran on different tabs for different samples. Multi-threading does not get used up. However, for single run, its much faster.

                          Comment


                          • #28
                            total assigned fragments : 47.9%

                            Summary
                            Status 34b1_accepted_hits.bam
                            Assigned 28314328
                            Unassigned_Ambiguity 6375026
                            Unassigned_MultiMapping 0
                            Unassigned_NoFeatures 24480296
                            Unassigned_Unmapped 0
                            Unassigned_MappingQuality 0
                            Unassigned_FragementLength 0
                            Unassigned_Chimera 0
                            Unassigned_Secondary 0
                            Unassigned_Nonjunction 0
                            Unassigned_Duplicate 0

                            Any reason why there are so many unassigned Nofeature and ambiguous reads. When the aligned bam file was tested with samtools flagstats function it showed 100% mapped reads and ~86% properly paired reads.

                            Comment


                            • #29
                              Having a high mapping percentage does not mean you will have a high percentage of reads being assigned to genes included in your annotation. This is affected by many factors such as the completeness of your annotation, your mRNA enrichment quality and etc.

                              I typically see around 50-60% of reads successfully assigned to genes when NCBI RefSeq gene annotation is used.

                              Wei

                              Comment


                              • #30
                                Thank you very much for your advise and time.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Essential Discoveries and Tools in Epitranscriptomics
                                  by seqadmin




                                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                  04-22-2024, 07:01 AM
                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, Yesterday, 11:49 AM
                                0 responses
                                15 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-24-2024, 08:47 AM
                                0 responses
                                16 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                61 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                60 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X