Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Sort problem (Tophat --> Cufflinks)

    Hi all,

    I am trying to follow the simple recipe for differential analysis in RNA-seq data without gene and transcript discovery.

    I ran tophat (v1.0.12) using the default parameters and then cuffdiff (v1.3.0) with the standard mm9 files and NCBIM37 (with added "chr" prefix to contig names; This work is done in mouse).

    I am getting an error from cuffdiff saying that the sam files are not sorted. However, looking at the files I see that they are lexicographically sorted (as expected since they are produced by cuffdiff: < chr1, chr10, chr11,...,chr2, chr3, ... , chrM/X/Y> ).

    I tried to add a header to the sam files , but it did not help.

    Here are the technical details:

    Tophat Command:
    tophat -r 150 -G $REF_FILE -o $OUT mm9 $IN_1 $IN_2

    Cuffdiff command:
    cuffdiff -o $OUT Mus_musculus.NCBIM37.61.chr.gtf $SAMPLE_1/accepted_hits.sam $SAMPLE_2/accepted_hits.sam



    Error from cuffdiff:
    ========
    You are using Cufflinks v1.3.0, which is the most recent release.
    [bam_header_read] EOF marker is absent. The input is probably truncated.
    [bam_header_read] invalid BAM binary header (this is not a BAM file).
    File 3//accepted_hits.sam doesn't appear to be a valid BAM file, trying SAM...
    [bam_header_read] EOF marker is absent. The input is probably truncated.
    [bam_header_read] invalid BAM binary header (this is not a BAM file).
    File 5//accepted_hits.sam doesn't appear to be a valid BAM file, trying SAM...
    [16:44:37] Loading reference annotation.
    [16:44:44] Inspecting maps and determining fragment length distributions.

    Error: this SAM file doesn't appear to be correctly sorted!
    current hit is at chrX:3030137, last one was at chrM:16220
    Cufflinks requires that if your file has SQ records in
    the SAM header that they appear in the same order as the chromosomes names
    in the alignments.
    If there are no SQ records in the header, or if the header is missing,
    the alignments must be sorted lexicographically by chromsome
    name and by position.
    ===============





    Edited sam files and added header:
    @HD VN:1.0 SO:coordinate
    @SQ SN:chr1 LN:197195432
    @SQ SN:chr10 LN:129993255
    @SQ SN:chr11 LN:121843856
    @SQ SN:chr12 LN:121257530
    @SQ SN:chr13 LN:120284312
    @SQ SN:chr14 LN:125194864
    @SQ SN:chr15 LN:103494974
    @SQ SN:chr16 LN:98319150
    @SQ SN:chr17 LN:95272651
    @SQ SN:chr18 LN:90772031
    @SQ SN:chr19 LN:61342430
    @SQ SN:chr2 LN:181748087
    @SQ SN:chr3 LN:159599783
    @SQ SN:chr4 LN:155630120
    @SQ SN:chr5 LN:152537259
    @SQ SN:chr6 LN:149517037
    @SQ SN:chr7 LN:152524553
    @SQ SN:chr8 LN:131738871
    @SQ SN:chr9 LN:124076172
    @SQ SN:chrM LN:16299
    @SQ SN:chrX LN:166650296
    @SQ SN:chrY LN:15902555
    @PG TopHat VN:1.0.130


    Output with headers:
    =============
    You are using Cufflinks v1.3.0, which is the most recent release.
    [bam_header_read] EOF marker is absent. The input is probably truncated.
    [bam_header_read] invalid BAM binary header (this is not a BAM file).
    File 3//accepted_hits.sam doesn't appear to be a valid BAM file, trying SAM...
    [bam_header_read] EOF marker is absent. The input is probably truncated.
    [bam_header_read] invalid BAM binary header (this is not a BAM file).
    File 5//accepted_hits.sam doesn't appear to be a valid BAM file, trying SAM...
    [20:30:45] Loading reference annotation.
    [20:30:54] Inspecting maps and determining fragment length distributions.

    Error: this SAM file doesn't appear to be correctly sorted!
    current hit is at chr10:3001890, last one was at chr1:197184857
    Cufflinks requires that if your file has SQ records in
    the SAM header that they appear in the same order as the chromosomes names
    in the alignments.
    If there are no SQ records in the header, or if the header is missing,
    the alignments must be sorted lexicographically by chromsome
    name and by position.
    ==============


    Would greatly appreciate your help!

  • #2
    I would try to convert your sam files to bam files, sort and convert back to sam files which should add a header (all with samtools) and see if you still have a problem.

    Not saying it would work, but probably worth a try since it would be very easy.
    --------------
    Ethan

    Comment


    • #3
      Thank you so much for the quick reply.

      I'm wondering -- The sam files are already lexicographically sorted. Why should re-sorting make a difference? (is there some other technical issue I'm overlooking?).
      I'm not sure this is a header issue since I tried running with/ without a header (maybe there's something wrong with my header format?).

      Thanks again!

      Comment


      • #4
        1. Because do you really know your file is correctly sorted for sure? That's a lot of lines to read through.
        2. Because if samtools has a problem reading your sam file it might give you a better idea of what is wrong with it.
        3. Because Tophat told you your file wasn't sorted, maybe it's telling the truth.
        4. Because there is a small chance it might solve your problem.
        5. When stuff isn't working, I just throw everything at it I can think of.
        --------------
        Ethan

        Comment


        • #5
          Thanks again for the quick reply. I really appreciate it.

          I went through the file (with a simple script) and verified the chromosome order (it's properly sorted).

          From the error msg one can see that cufflinks complains since he sees "chrX" after "chrM" (which is a bit weird). What I then found is that things work just fine if I remove the chrM lines.

          I'm not crazy about this workaround though. Any inputs?

          Comment


          • #6
            The source of the issue is that you added "chr" to the NCBI reference file. The chr addition is the expect format for UCSC files, which use chrMT not chrM. Ultimately, there is no reason to add chr to the reference file in the majority of situations.

            Comment


            • #7
              I filter out the chrM reads as well because HTseq-count doesn't like it that the GTF file I use (iGenomes) doesn't contain any chrM genes. It seems that chrM messes up more then one RNA-seq workflow. I don't see it as an issue. It's just one line of code that runs pretty quickly to get rid of the chrM reads. Does your GTF file contain any chrM genes. If it does't then it is no loss.

              Glad to be of absolutely no help and suggest you spend some of your time on something that didn't work. Ha ha.
              --------------
              Ethan

              Comment


              • #8
                Originally posted by Jon_Keats View Post
                The source of the issue is that you added "chr" to the NCBI reference file. The chr addition is the expect format for UCSC files, which use chrMT not chrM. Ultimately, there is no reason to add chr to the reference file in the majority of situations.
                Wow, that is just confusing. How hard would it be for NCBI, Ensembl and UCSC to just agree on chromosome names.
                --------------
                Ethan

                Comment


                • #9
                  NCBI and ENSEMBL do, while UCSC is fixated on messing people up with "chr", "MT versus M", and the whole 1-based and 0-based annotation issues (at least in my opinion). One good rule of thumb, is to absolutely stick with one source of annotation and what ever you do, don't mix and match UCSC stuff with other sources.... Sorry just my pet peeve after finding Illumina mis-mapped SNPs back in the day with iGenome because someone used UCSC dpSNP entries which shift the mapping by 1bp and seeing many of these issues.

                  Ethan,

                  on a separate note do you track PF rates on your CHIP-seq runs? We have some odd correlations with IP method.

                  Comment


                  • #10
                    I can say I take a glance but that is about it. Usually, I just look and say that seems reasonable and move on. There's always something more to look at, I suppose.
                    --------------
                    Ethan

                    Comment


                    • #11
                      If you believe sort order is the real issue, running Picards ReorderSam.jar after a SamSort.jar would work.

                      Comment


                      • #12
                        cuffdif problems with chrM and chrX not in sorted order

                        I made a 3 record sam file with 1 chr1, chrX, and chrM record. When chrM comes prior to chrX I get the same sort order problem. Changing chrM to chrMT does not fix the problem.
                        barry

                        Comment


                        • #13
                          Also,
                          If I change chrM to a chrZ and place it after chrX there is no problem.
                          barry

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Current Approaches to Protein Sequencing
                            by seqadmin


                            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                            04-04-2024, 04:25 PM
                          • seqadmin
                            Strategies for Sequencing Challenging Samples
                            by seqadmin


                            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                            03-22-2024, 06:39 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, 04-11-2024, 12:08 PM
                          0 responses
                          31 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 10:19 PM
                          0 responses
                          33 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 09:21 AM
                          0 responses
                          28 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-04-2024, 09:00 AM
                          0 responses
                          53 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X