Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • alekzs
    Junior Member
    • Jan 2018
    • 8

    ERCC - no gene counts

    Hi!
    I have a problem with recovering added ERCC to my RNAseq samples. Briefly, I'm doing smart-seq2 on single human T cells and align with STAR to hg19+ERCC sequences. Both FASTA and GTF file have the ERCCs but when I look at gene count results they don't show up at all, regardless if I use STARs genecount option, HTseq-count or RSEM.
    Let's assume I "forgot" to add the ERCC spike-ins and the count is actually 0... Shouldn't the gene names still appear in the downstream files but the value just be 0? If I run Samtools idxstats on the STAR output (sorted or unsorted bam file), it shows the ERCC "chromosomes".
    I'm confused by all of this and can't even figure out where in the pipeline my mistake might be. Help!

    Here are my commands:
    STAR --runMode genomeGenerate --runThreadN 8 --genomeDir indices/STAR --genomeFastaFiles path/to/genomeE.fa --sjdbGTFfile path/to/genesE.gtf --genomeChrBinNbits 12

    STAR --runMode alignReads \
    --genomeLoad NoSharedMemory \
    --genomeDir indices/STAR \
    --readFilesIn XX_R1_001.fastq.gz XX_R2_001.fastq.gz \
    --outFileNamePrefix /results/ercc/$i \
    --quantMode GeneCounts \
    --twopassMode Basic \
    --outSAMtype BAM Unsorted SortedByCoordinate \
    --readFilesCommand zcat

    htseq-count --mode=union --idattr=gene_name -f bam -order pos --stranded=no XX-Aligned.bam /path/to/genesE.gtf > XX-gene.count
    ### I tried stranded=yes or reverse but that didn't help either.

    Any pointers highly appreciated!!

    Alex
  • GenoMax
    Senior Member
    • Feb 2008
    • 7142

    #2
    You made a "new" reference by appending the fasta ERCC sequences to end of human genome and then created the STAR indexes from this hybrid file?

    Comment

    • alekzs
      Junior Member
      • Jan 2018
      • 8

      #3
      Originally posted by GenoMax View Post
      You made a "new" reference by appending the fasta ERCC sequences to end of human genome and then created the STAR indexes from this hybrid file?
      Yes, I added both FASTA and GTF annotations und used the hybrid!

      Comment

      • GenoMax
        Senior Member
        • Feb 2008
        • 7142

        #4
        Then I am inclined to speculate that someone forgot to spike the ERCC aliquots. Unless alignments are not being reported since they fail STAR's multi-mapping threshold. Look into that as well.

        Did you make the libraries (and add ERCC)?

        Comment

        • alekzs
          Junior Member
          • Jan 2018
          • 8

          #5
          Originally posted by GenoMax View Post
          Then I am inclined to speculate that someone forgot to spike the ERCC aliquots. Unless alignments are not being reported since they fail STAR's multi-mapping threshold. Look into that as well.

          Did you make the libraries (and add ERCC)?
          I did everything myself so chances are 50-50 I guess.
          Anyhow, even if I didn't add the spike ins, shouldn't the gene names from the reference appear in a gene count file? Like, normal genes get 0 alignments/counts but they're still in the list, right?

          Comment

          • GenoMax
            Senior Member
            • Feb 2008
            • 7142

            #6
            When you added them to the GTF file they were in the correct format?

            Are you able to see alignments for them in the BAM file?

            Comment

            • alekzs
              Junior Member
              • Jan 2018
              • 8

              #7
              Originally posted by GenoMax View Post
              When you added them to the GTF file they were in the correct format?

              Are you able to see alignments for them in the BAM file?
              Code:
              Tail of FASTA file:
              >ERCC-00171 DQ854994 Ac03459967_a1 Ac03460063_a1
              CTGGAGATTGTCTCGTACGGTTAAGAGCCTCCGCCCGTCTCTGGGACTATGGACGGGCACGCTCATATCAGGCTATATTTGGTCCGGGTTATTATCGTCGCGGTTACCGTAATACTTCAGATCAGTTAAGTAGGGCCATATGCCTCGGGAATAAGCTGACGGTGACAAGGTTTCCCCCTAATCGAGACGCTGCAATAACACAGGGGCATACAGTAACCAGGCAAGAGTTCAATCGCTTAGTTTCGTGGCGGGATTTGAGGAAAACTGCGACTGTTCTTTAACCAAACATCCGTGCGATTCGTGCCACTCGTAGACGGCATCTCACAGTCACTGAAGGCTATTAAAGAGTTAGCACCCACCATTGGATGAAGCCCAGGATAAGTGACCCCCCCGGACCTTGGAGTTTCATGCTAATCAAAGAAGAGCTAATCCGACGTAAAGTTGCGGCGTTGATTACGCAGGATTGCGACCAAAGAACGAGAAAAAAAAAAAAAAAAAAAAAAAA
              
              Tail of GTF file
              >ERCC-00171	ercc	gene	1	506	.	+	.	gene_id "GERCC-00171"; gene_version "1"; gene_name "ERCC-00171"; gene_source "ercc"; gene_biotype "ercc";
              
              samtools view -h 10BTreg02_S290_L003Aligned.sortedByCoord.out.bam ERCC-00171
              >NS500597:113:HH5HKBGX5:3:11406:4418:20117	83	ERCC-00171	441	255	9S29M	=	60	-410	ACGACGTAGGTTGCGGCGTTGATTACGCAGGATTGCGA	EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA	NH:i:1	HI:i:1	AS:i:65	nM:i:0
              NS500597:113:HH5HKBGX5:3:21612:3414:12240	89	ERCC-00171	442	255	36M	*	0	0	TTGCGGCGTTGATTACGCAGGATTGCTACCAAAGAA	EAEEEEEEAEAEEEEE/EEEEEEEAEEEEEEAAAAA	NH:i:1	HI:i:1	AS:i:33	nM:i:1
              (there's many more lines, finds other ERCC numbers as well)
              
              tail -n 20 10BTreg02_S290_L003ReadsPerGene.out.tab
              ENSG00000224240	0	0	0
              ENSG00000227629	0	0	0
              ENSG00000237917	0	0	0
              ENSG00000231514	0	0	0
              ENSG00000235857	0	0	0
              That's all I have to offer.
              Last edited by GenoMax; 04-27-2018, 11:52 AM. Reason: Added [code] tags

              Comment

              • r.rosati
                Member
                • Aug 2015
                • 95

                #8
                Here I am with makeshift solutions, but if you make the BAM into a SAM, you can `grep` it to see if the sequences are there.

                Comment

                • alekzs
                  Junior Member
                  • Jan 2018
                  • 8

                  #9
                  Originally posted by r.rosati View Post
                  Here I am with makeshift solutions, but if you make the BAM into a SAM, you can `grep` it to see if the sequences are there.
                  ha, that approach was far easier...

                  grep "ERCC-" 10B02_3.sam -c
                  6770

                  So, yes... they are there, just don't end up in any count file.

                  Comment

                  • r.rosati
                    Member
                    • Aug 2015
                    • 95

                    #10
                    I meant like grepping for
                    CTGGAGATTGTCTCGTACGGTTAAGAGCCTCCGCCC
                    (or any other fragment in the ERCC controls, I copy-pasted the one you wrote in a previous post)

                    Comment

                    • GenoMax
                      Senior Member
                      • Feb 2008
                      • 7142

                      #11
                      Can you try featureCounts to do the counts? It will not count multi-mapping reads by default.

                      Comment

                      • arnollito
                        Junior Member
                        • Jul 2018
                        • 1

                        #12
                        Hi alekzs, how did you solve this issue in the end? Greetings from Switzerland.

                        Comment

                        • alekzs
                          Junior Member
                          • Jan 2018
                          • 8

                          #13
                          Originally posted by arnollito View Post
                          Hi alekzs, how did you solve this issue in the end? Greetings from Switzerland.
                          Yes... I used RSEM for the counting and the index generation with the ERCC-appended hg19 file had failed because the chr-labels weren't compatible so RSEM used an old index without ERCC genes.
                          I edited my fused ERCC-hg19, re-run the index step and then it worked. Hope that helps!

                          Comment

                          Latest Articles

                          Collapse

                          • SEQadmin2
                            From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                            by SEQadmin2


                            Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                            The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                            ...
                            Yesterday, 10:05 AM
                          • SEQadmin2
                            Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                            by SEQadmin2


                            With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                            Introduction

                            Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                            05-22-2026, 06:42 AM
                          • SEQadmin2
                            Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                            by SEQadmin2

                            Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                            Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                            05-06-2026, 09:04 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by SEQadmin2, Yesterday, 12:03 PM
                          0 responses
                          17 views
                          0 reactions
                          Last Post SEQadmin2  
                          Started by SEQadmin2, Yesterday, 11:40 AM
                          0 responses
                          13 views
                          0 reactions
                          Last Post SEQadmin2  
                          Started by SEQadmin2, 05-28-2026, 11:40 AM
                          0 responses
                          29 views
                          0 reactions
                          Last Post SEQadmin2  
                          Started by SEQadmin2, 05-26-2026, 10:12 AM
                          0 responses
                          31 views
                          0 reactions
                          Last Post SEQadmin2  
                          Working...