Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Hi darked89,

    Yes, you can use just the unmapped reads. I have been doing this for a while, and the results of using unmapped reads and the full dataset are virtually the same. As far as I know, HMMSplicer doesnt rely on exon islands so the unspliced reads dont really contribute to the search. Please correct me if I am wrong.

    I suggest you to pool all the unmapped reads from multiple lanes as a single file.

    Comment


    • #17
      Originally posted by proteomania View Post
      Hi darked89,

      Yes, you can use just the unmapped reads. I have been doing this for a while, and the results of using unmapped reads and the full dataset are virtually the same. As far as I know, HMMSplicer doesnt rely on exon islands so the unspliced reads dont really contribute to the search. Please correct me if I am wrong.

      I suggest you to pool all the unmapped reads from multiple lanes as a single file.
      Hi proteomania,

      many thanks for answering.

      One last thing: can mix reads with various lengths? I got 54 & 76bp reads.

      Comment


      • #18
        Thanks for all your feedback! To answer your questions/comments:

        -- The option for output in SAM format is a good one and I'll work on getting that in the next release.
        -- As proteomania suggests, you can use only unmapped reads. HMMSplicer doesn't use exon islands -- in fact, the first step of the program is to remove mapped reads, so using unmapped reads should speed up the processing considerably without affecting the results.
        -- Unfortunately you cannot mix reads of various lengths. Because of the HMM training step, all the reads have to be the same length.

        Comment


        • #19
          Is SAM format output file avaliable now?

          Hmmsplicer got non-canonical junctions from illumina RNA-seq single reads successfully. Now I need to get expression levels for all transcripts. I thought about cufflinks, which requires mapped reads in SAM format file. It is very nice to know that you are working on new version of hmmsplicer to get SAM format file. Hope we can get your new version soon. And any BETA version is fine.

          We appreciate your program very much.

          Thanks,

          Qi

          Comment


          • #20
            Hi,

            This program looks great, but I'm having some trouble getting it to run with my data. I ran the test Pf data and it worked fine.

            My data is half a lane of 100bp single reads from a HiSeq machine, and the reference genome is very small (3Mb).

            Here is what I get:

            python runHMM.py -o ./results -i 101108_SN132_B_s_3_seq_GOE-1.fastq -a 5 -b ref -j 10 -k 200 -p 2 -g ref_genome.fna

            Traceback (most recent call last):
            File "runHMM.py", line 927, in <module>
            sys.exit(main(None))
            File "runHMM.py", line 304, in main
            readLength = verifyQualityType(inputFastq, qualityFormat)
            File "runHMM.py", line 410, in verifyQualityType
            raise hmmErrors.InvalidQuality("Found an illegal quality value of %s in %s (quality string: %s, position: %s).\nAre you sure the quality flag is set right?" % (x, titleStr, qualityStr, i))
            hmmErrors.InvalidQuality: Found an illegal quality value of 70 in HWI-ST132_0411:3:1:1144:1916#CCGGAG/1 (quality string: gfgggggggggggggbggggfgggggeegedgedgeggdebeYfdgegdeeceZcedaaeeeeebb]bdXabYbb`^ac[aP__^ccc^aTV[QbMY]^^, position: 0).
            Are you sure the quality flag is set right?


            I'm not sure why it thinks the quality scores are invalid, I'm pretty sure they're regular Phred scores. Any help would be much appreciated.

            Comment


            • #21
              Quality flag error?

              Hi all,
              I am getting the same error with the first file of paired-end input:

              hmmErrors.InvalidQuality: Found an illegal quality value of 68 in HWI-EAS413_0027:1:1:1131:2813#0/1 (quality string: eeceacdbc\a[Y\a`a`b^]B]^^BBBBBBBBBBB, position: 0).
              Are you sure the quality flag is set right?

              I've commented out the offending line of python code and Bowtie will then run just fine but then HMMSplicer will crash with the following output:

              # reads processed: 22513124
              # reads with at least one reported alignment: 20011702 (88.89%)
              # reads that failed to align: 2501422 (11.11%)
              Reported 20011702 alignments to 1 output stream(s)
              # reads processed: 4741978
              # reads with at least one reported alignment: 1717170 (36.21%)
              # reads that failed to align: 11938 (0.25%)
              # reads with alignments suppressed due to -m: 3012870 (63.54%)
              Reported 40764406 alignments to 1 output stream(s)
              12:02:07 02/07/11: Running bowtie
              12:06:38 02/07/11: Running bowtie-half
              12:21:56 02/07/11: Making genome dictionary
              12:22:52 02/07/11: Making seeds
              12:31:08 02/07/11: Training HMM with subset
              Traceback (most recent call last):
              File "runHMM.py", line 927, in <module>
              sys.exit(main(None))
              File "runHMM.py", line 322, in main
              readLength, numMMinSeed, isLargeGenome, logfile, qualityFormat, deleteTmp)
              File "runHMM.py", line 577, in runHMM
              junctionHMM.trainHMM(seedSelectionFile, configVals.HMM_INITIAL_NAME, trainedHMM, numObs=configVals.HMM_TRAINING_SIZE)
              File "/work/gladstone/b/j/0105/HMMSplicer/hmmSplicer-0.9.5/junctionHMM.py", line 238, in trainHMM
              hmm.multiple_learn(obsList, quals, 100, True)
              File "/work/gladstone/b/j/0105/HMMSplicer/hmmSplicer-0.9.5/hmmWithQuality.py", line 644, in multiple_learn
              Bo[i] = self.B[qualList[i], obsIndices[i]]
              IndexError: invalid index

              I haven't found other example fastq sequence to try this with yet (GEO seems to remove the quality flags all together and just post fasta sequence).

              Best,
              Nathan

              Comment


              • #22
                Re: Quality flag error, and Incorrect junctions found

                Regarding the "illegal quality score" error, it seems that if you tell the program that you are using a different type of quality score it will run. I tried the setting -q t (new solexa format) which caused HMMsplicer to convert the quality format and then run successfully. The conversion step was quite lengthy, so it would be nice to find out why this is necessary.

                However, after getting the program to work I ran into another issue. When HMMsplicer found the introns I was expecting, it annotated them as being longer than they actually are. Essentially, it extended the boundaries around each intron in both 5' and 3' directions. When I tried using a more stringent intron size parameter (ie not allowing the program to look for introns longer than the longest one that is known), it found no junctions at all. I find this very odd because I have manually cut the known introns out of the genome reference file and run bowtie and had reads successfully map over the splice junctions.
                So why does HMMsplicer not find the correct junctions? Any thoughts?

                Also, is there a way to tell it to look for canonical GT-AG introns only?

                Thanks

                Comment


                • #23
                  That worked great! Thanks. The quality data should be in Phred format already, but it ran fine.

                  On a separate point, do you know if it is possible to run without quality scores? To better see what some of the pros and cons for RNASeq versus microarray, I would like to analyze a published dataset that only has the sequence but not quality scores in fasta format.

                  Comment


                  • #24
                    Junction Normalization?

                    I was wondering if anyone has dealt with downstream normalization of junction results between two different biological conditions after obtaining junction alignments from HMMSplicer (or other applications for that matter)?

                    For example, after getting, let's say 50k junctions for runs on sample A and 75k junctions for runs on sample B, how can I best normalize the counts for these junctions when examining relative differences in junction abundance between the two samples. Would this be straight up RPKM, quantile normalization or fitting the results to a distribution? Software recommendations?

                    Thanks,
                    Nathan

                    Comment


                    • #25
                      @nsalomonis :
                      -- I'm glad you resolved the quality flag issue.
                      -- There is currently no way to run HMMSplicer without quality scores. If you really wanted to try using sequence data without quality scores, you could convert fasta to fastq and just use the max quality value for every base. Your results won't be as good as running HMMSplicer with quality values, since the quality values are used to best determine where the splice junction should reside, but it should enable you to do a rough comparison.
                      -- I don't have any good answers off-hand for the normalization question. If you know of any junctions in highly expressed genes which should be equal between the two samples (for example, a housekeeping gene which is known to be equivalent in the two samples and known not to have significant alternative splicing between the samples), you may be able to use this information to normalize. In my experience, it's difficult to normalize because many of the junctions have a very small n (less than 5 reads covering the junction) so random variation in the placement of reads across the mRNA has a huge impact on whether the junction is discovered in a sample or not.

                      @Camg :
                      -- I agree that the conversion step is quite lengthy. It also uses up a lot of disk space. The step is there to ensure proper input/output from Bowtie. As I recall (I'm out on maternity leave and low on sleep, so I don't promise to be 100% right), Bowtie changed it's behavior regarding input/output formats. It used to output with quality scores in the same format as the input, but then it changed to always output with Phred quality scores. By converting to Phred quality scores for every run, HMMSplicer will work regardless of the Bowtie version, thus reducing the chance that someone will try running HMMSplicer with an older version of Bowtie and run into a bug.
                      -- I'm concerned about your comment that the junction edges are incorrect. Are you sure you are interpreting the BED format properly? The start and stop coordinates in the BED file (the second and third columns) refer to the start and stop of the read covering the junction. The left junction edge will be the chromosome start (column 2) plus the size of the first read segment (column 11, part 1). The right junction edge will be the chromosome start plus the second block start (column 12, part 2). I know this is not ideal for interpreting HMMSplicer results as junctions, but BED format is fairly universal and easily loaded into the UCSC genome browser so that's why the output is in this format.
                      -- To force HMMSplicer to only find GT-AG junctions, change SPLICE_SITES in ConfigVals.py as follows:
                      # the splice sites considered 'good'. Add both the splice junction and it's reverse complement ("GT-AG" and "CT-AC")
                      SPLICE_SITES = ["GT-AG", "CT-AC"]

                      I hope this helps!

                      Comment


                      • #26
                        mdimon,

                        Thanks for clarifying about the need for the quality score conversion.

                        I was indeed reading the BED file incorrectly (I'm new to this type of analysis). I assumed that the coordinates were for the intron that was found, I wasn't aware that you had to add the numbers from the second last column. Just out of curiosity, what are the numbers in the last column representing? On the UCSC FAQ for BED format it says that it's the block starts and the description makes it sound like it's referring to the area of contiguous coverage. Is that right?
                        Now that I'm reading the file correctly, it did find the introns I was looking for, as well as a few extra that are probably not true introns. So far this program has worked much better than TopHat and MapSplice for my data (short introns).

                        Thanks so much for your help.

                        Comment


                        • #27
                          Do you have any plans to implement this on Galaxy - it would massively increase the number of users and the amount of experimenting people could do.

                          Comment


                          • #28
                            One quick question, does HMMSplicer identify spliced reads spanning more than 1 junction?

                            Comment


                            • #29
                              Re: Are you sure the quality flag is set right?

                              I was also getting this error when running hmmSplicer with the -q t option on single-end reads from an Illumina HiSeq.
                              ...
                              File "Tools/python/Python-2.7/lib/python2.7/runpy.py", line 72, in _run_code
                              exec code in run_globals
                              File "Tools/hmmSplicer-0.9.5/runHMM.py", line 927, in <module>
                              sys.exit(main(None))
                              File "Tools/hmmSplicer-0.9.5/runHMM.py", line 304, in main
                              readLength = verifyQualityType(inputFastq, qualityFormat)
                              File "Tools/hmmSplicer-0.9.5/runHMM.py", line 410, in verifyQualityType
                              raise hmmErrors.InvalidQuality("Found an illegal quality value of %s in %s (quality string: %s, position: %s).\nAre you sure the quality flag is set right?" % (x, titleStr, qualityStr, i))
                              hmmErrors.InvalidQuality: Found an illegal quality value of 41 in HWI-ST640_0086:5:1101:5832:2153#0/1 (quality string: bbeeeeegggggfghiihfhhhiii, position: 15).
                              Are you sure the quality flag is set right?
                              ...


                              The "fix" was to change line 410 in runHMM.py from:
                              if x > 40 ...
                              to
                              if x > 62 ...

                              I think the cause is quality scores on the HiSeq have an expanded range of up to 126 (I got 62 by subtracting 126 - 64). Though I'm having trouble finding the place where I read this.

                              Comment


                              • #30
                                I am using hmmSplicer-0.9.5 to map reads (SRR066576.fastq) to the Pfalciparum genome (PlasmoDB-9.1) and the software is issuing the following ERROR MESSAGE:
                                'Could not find find any reads in "results/Pfalciparum/tmp/bowtieHalf.fastq"'

                                No alignments
                                /home/roylab/roylab_bin/hmmSplicer-0.9.5/hmmWithQuality.py:658: RuntimeWarning: invalid value encountered in divide
                                pi_bar /= K

                                The program terminates without finding any splice junctions. That is contrary to previously published results.

                                It looks like the software is not generating a file of half-reads for bowtie mapping?

                                Anyone have any insight to this error and steps to correct it?

                                Thanks!
                                Manuel J Torres
                                Evolutionary Genomics Lab
                                San Francisco State University
                                San Francisco CA 94132

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM
                                • seqadmin
                                  Techniques and Challenges in Conservation Genomics
                                  by seqadmin



                                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                  Avian Conservation
                                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                  03-08-2024, 10:41 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, Yesterday, 06:37 PM
                                0 responses
                                8 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, Yesterday, 06:07 PM
                                0 responses
                                8 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-22-2024, 10:03 AM
                                0 responses
                                49 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-21-2024, 07:32 AM
                                0 responses
                                66 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X