Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Weird Cufflinks results from odd alignments

    I recently ran some collaborative FASTQ files through my standard Tophat/Cufflinks pipeline, and got some really weird results. The Cufflinks yielded mostly zero-FPKM genes, which started me on this frustrating journey. I started by checking SAMTOOLS FLAGSTAT for one of the BAM-files:

    Code:
    14078966 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 duplicates
    14078966 + 0 mapped (100.00%:-nan%)
    14078966 + 0 paired in sequencing
    7039483 + 0 read1
    7039483 + 0 read2
    72 + 0 properly paired (0.00%:-nan%)
    14078966 + 0 with itself and mate mapped
    0 + 0 singletons (0.00%:-nan%)
    1785552 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)
    I have never seen something like this before. Only 72 paired reads? The Tophat alignment_summary.txt looks like this:

    Code:
    Left reads:
              Input     :   4602804
               Mapped   :   4571790 (99.3% of input)
                of these:    328183 ( 7.2%) have multiple alignments (60211 have >20)
    Right reads:
              Input     :   4602804
               Mapped   :   4571790 (99.3% of input)
                of these:    328183 ( 7.2%) have multiple alignments (60211 have >20)
    99.3% overall read mapping rate.
    
    Aligned pairs:   4571790
         of these:    328183 ( 7.2%) have multiple alignments
                     4571758 (100.0%) are discordant alignments
     0.0% concordant pair alignment rate.
    So the reads are getting mapped, but discordantly? What does that mean, and how can I fix it? This is obviously affecting the Cufflinks analysis greatly. I've also tried aligning with STAR, but the Cufflinks output from that yields similar results, so it seems it is something with the FASTQ files or their processing.

    Some googling turned up that the paired files may be out-of-sync, so I used this python script to re-sync them. The output looks very similar to to what I already have, but I'm currently in the process of aligning the new FASTQ files, but I don't have high hopes of this fixing the issue. [EDIT: it didn't.]

    If it matters, the FASTQ files I recieved were, I think, not the actual raw FASTQ files, but rather BAM->FASTQ converted files (my collaborator is unsure, as she is not a bioinformatician and the bioinformatician who did it is not available). I recieved the FASTQ QC reports and they look good, but I have not had time to run through any QC myself.

    Does anybody know what's wrong here? This collaboration is really a side-project for me, and I'm quite frustrated that there's no many problems with it...

    [EDIT]: I've now also run the alignment for just one of the read pair files, and that gives much better results. So, this could mean that the reads are somehow out of sync, but that the script I used above didn't solve the problem?
    Last edited by ErikFas; 04-21-2016, 04:30 AM.

  • #2
    Can you paste a few lines from your R1 and R2 fastq files? My guess is that something went wrong in the BAM->FASTQ conversion and that's why they are showing up as discordant

    Comment


    • #3
      @ErikFas: This does sound like a classic case of reads having gone out of order. Can you try repair.sh from BBMap on your fastq files and see if that helps any?

      @fanli may also be on to something. If the read headers were changed/messed-up in some way then the re-pair programs may not be able to realign the reads.

      Comment


      • #4
        Originally posted by fanli View Post
        Can you paste a few lines from your R1 and R2 fastq files? My guess is that something went wrong in the BAM->FASTQ conversion and that's why they are showing up as discordant
        First 5 read pairs from each file:

        Code:
        R1:
        @HWUSI-EAS101E:4:FC:2:1:1028:9892_2:N:0:/1
        CCTTGCTCAGCTCACACCGCAGCGTGGCCGTGGCCCCTTCTGTGGCCTCCT
        +
        IIIIIFIIHIGHIFIIGDIIIIIGIIGIEIEHH@FEE+CC2;5;;;?94?5
        @HWUSI-EAS101E:4:FC:2:1:1029:7552_2:N:0:/1
        CTTGTTTGAGGAAGCAGAGAAGAATGAAAAGAAAGGCCTGAAGGACTGGGA
        +
        G:D@DD=13@@12@2;1774EEB<DB>9?>GAAAADB>DB?-EEDGGGG@G
        @HWUSI-EAS101E:4:FC:2:1:1029:10586_2:N:0:/1
        TCCAACTCTACATCACTTGAGTTCTCATCTTCAACCTCTCCTTCTTCAGCT
        +
        GIDIIDIIIGHIHIIG<GGEHIIIGHGIIIIIGHIGIIBIIHGGDIII@GI
        @HWUSI-EAS101E:4:FC:2:1:1029:10648_1:N:0:/1
        CGACAGCTCCTCAACTGCCTCCATGTCATCACCCTGTACAACCGCATCAAG
        +
        HDH:HGHHHHHHHGHHGHHHHHHHHHHHHHHEDFHHHHHHHHHDGHGHGHH
        @HWUSI-EAS101E:4:FC:2:1:1029:10648_2:N:0:/1
        CTGCCGGGTCATGGTTGACCACATCCCCGATGGCTGTGACGAGTCTGATGA
        +
        =HGIHGGEG<;B?BBIIG<GIEGGIEGDG<DED>EGABEEGEGDGDBDGG8
        Code:
        R2:
        @HWUSI-EAS101E:4:FC:2:1:1028:9892_2:N:0:/2
        CCTTGCTCAGCTCACACCGCAGCGTGGCCGTGGCCCCTTCTGTGGCCTCCT
        +
        IIIIIFIIHIGHIFIIGDIIIIIGIIGIEIEHH@FEE+CC2;5;;;?94?5
        @HWUSI-EAS101E:4:FC:2:1:1029:7552_2:N:0:/2
        CTTGTTTGAGGAAGCAGAGAAGAATGAAAAGAAAGGCCTGAAGGACTGGGA
        +
        G:D@DD=13@@12@2;1774EEB<DB>9?>GAAAADB>DB?-EEDGGGG@G
        @HWUSI-EAS101E:4:FC:2:1:1029:10586_2:N:0:/2
        TCCAACTCTACATCACTTGAGTTCTCATCTTCAACCTCTCCTTCTTCAGCT
        +
        GIDIIDIIIGHIHIIG<GGEHIIIGHGIIIIIGHIGIIBIIHGGDIII@GI
        @HWUSI-EAS101E:4:FC:2:1:1029:10648_1:N:0:/2
        CGACAGCTCCTCAACTGCCTCCATGTCATCACCCTGTACAACCGCATCAAG
        +
        HDH:HGHHHHHHHGHHGHHHHHHHHHHHHHHEDFHHHHHHHHHDGHGHGHH
        @HWUSI-EAS101E:4:FC:2:1:1029:10648_2:N:0:/2
        CTGCCGGGTCATGGTTGACCACATCCCCGATGGCTGTGACGAGTCTGATGA
        +
        =HGIHGGEG<;B?BBIIG<GIEGGIEGDG<DED>EGABEEGEGDGDBDGG8
        I haven't really looked that much at raw FASTQ files before, but if I compare to other data I have (the actual raw data), they have the read pair info (_1 or _2) in the middle of the header. Looking at some of the rows above, it seems that a read can sometimes be _1 in the middle, while being /2 at the end. Could this be the problem?

        I ran the same FASTQ files as above through BBMap's repair.sh:

        Code:
        Executing jgi.SplitPairsAndSingles [rp, in=exp2_1.fastq.gz, in2=exp2_2.fastq.gz, out=exp2.repaired_1.fastq.gz, out2=exp2.repaired_2.fastq.gz, outs=exp2_singletons.fastq.gz, -Xmx12g]
        
        Set INTERLEAVED to false
        Started output stream.
        
        Input:                  	9205608 reads 		469486008 bases.
        Result:                 	9205608 reads (100.00%) 	469486008 bases (100.00%)
        Pairs:                  	9205608 reads (100.00%) 	469486008 bases (100.00%)
        Singletons:             	0 reads (0.00%) 	0 bases (0.00%)
        
        Time:   			16.581 seconds.
        Reads Processed:       9205k 	555.18k reads/sec
        Bases Processed:        469m 	28.31m bases/sec
        The first 5 read pairs are identical to the ones above. If it is the problem as above, then the repair/syncing scripts won't work because they are, as far as they can tell, already synced because of the correct /1 or /2 at the ends, but the actual alignment step takes the _1 or _2 in the middle?
        Last edited by ErikFas; 04-21-2016, 10:26 PM.

        Comment


        • #5
          If you take a look at the FASTQ Illumina read header specification you will agree that someone has changed the read headers in your dataset so they are a mishmash of the two formats used. This does not even appear to be barcoded data (or if it was that barcode has been removed from the header).

          I think you should change the headers back to one of the two types (if this is old data appropriate for the old headers then use those) and then try repair.sh again.

          Comment


          • #6
            Yes indeed, they don't seem to match either format... I'm quite unsure as to what parts I should change, and how. Is it the "_2" part in reads with "/1" at the end, or other things as well? For example, I don't seem to have a flow cell ID, just "FC" with nothing next to it. What part(s) of the header are actually used in the alignments?

            Comment


            • #7
              To me this data seems to be from Illumina v.1.8 so you could take out the /1/2 at the end of the reads and then remove the _ before the _1:N and _2:N. Flowcell ID also appears to have been taken out (leaving a simple FC) but that is not so critical.

              Code:
              @HWUSI-EAS101E:4:FC:2:1:1029:10648_2:N:0:/2
              to
              Code:
              @HWUSI-EAS101E:4:FC:2:1:1029:10648 2:N:0:
              If this is barcoded data then add that back in the header like

              Code:
              @HWUSI-EAS101E:4:FC:2:1:1029:10648 2:N:0:ACCGTC
              One reason people do modifications like your data is if they want to preserve the read info in their BAM's. Generally aligners will drop the rest of read header text once they encounter a space in the name.
              Last edited by GenoMax; 04-22-2016, 04:23 AM.

              Comment


              • #8
                Am I blind or do the reads appear identical to me?

                Comment


                • #9
                  Originally posted by fanli View Post
                  Am I blind or do the reads appear identical to me?
                  Ha! Great catch.

                  Someone stuck a /1 and /2 at the end of fastq headers of the same datafile and called it a day. The files also seem to be a mix of Read1 and Read 2, if the _1 and _2 are real.

                  Comment


                  • #10
                    Yeah, they do seem identical... I didn't catch it either, so I did what Genomax said and I'm currently aligning the results - I guess we'll see what happens! But, what if they are identical?

                    Comment


                    • #11
                      It's pointless to redo the alignment if the reads are fundamentally flawed. You should probably go back and see if whoever did the BAM->FASTQ conversion made a mistake.

                      Comment


                      • #12
                        @ErikFas: Since the reads are mixed the results would probably look no better.

                        If you must work with what you have then separate Read 1 and Read 2 data into separate files and start over.

                        But I do agree with @fanli. If you have to mess with the raw data to this extent then it is always better to go back and and get an un-tampered version (for the lack of a better word).

                        Comment


                        • #13
                          Yes, that's what I've trying to do just after I got the first Cufflinks results, actually, but since as the person whom I'm in contact with isn't actually a bioinformatician and the person responsible for the analysis is not available... I'm up **** creek without a paddle in that regard!

                          I guess I'll have to figure out how to separate them by some script and see if that does it. Would you say that the "_1" and "_2" are the correct reads, then, and that I should separate based on that? The reads have to be in the same order in the two files, if I don't misremember, or?

                          Comment


                          • #14
                            I must say that we may not know for sure (until you confirm from the person in the know) that the _1 and _2 are actually represent original Read 1 and Read 2.

                            But desperate times call for similar measures. Following is tested only on the snippet above and may not work with your full file but give it a try.

                            Code:
                            $ grep -A 3 "_1:N" your_file | grep -v "\-\-" > read1_file
                            
                            $ grep -A 3 "_2:N" your_file | grep -v "\-\-" > read2_file
                            Follow this up with removal of the _ and the /1/2 like you did this morning and then pass the two files through repair.sh again. This will ensure that the reads are in the same order in the final files.

                            Let us see if that works.

                            Comment


                            • #15
                              Well, I can't get repair.sh to run, because apparently some of the reads and different read and quality lengths, so I guess something happened in the code you gave. I'm not sure what the second grep in your command does. It greps the lines that don't contains "-"?

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Essential Discoveries and Tools in Epitranscriptomics
                                by seqadmin


                                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
                                Yesterday, 07:01 AM
                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              39 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              41 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              35 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              55 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X