Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem with MIRA

    Hello all

    I am new in bioinformatics and linux, and right now I am starting my "training" with some 454 RNAseq data. The starting data are 454 RNA sequencing reads from 10 different individuals (3 runs each). Until now I have converted my fasta/qual to fastq files and then collapsed the fastq files from the different reads of each individual into a single one, before proceeding with the quality control analysis. All went smooth and the outputs seem to be ok. Now I want to procceed with the assembly and to do this I plan to use MIRA as implemented in Geneious. I have uploaded the trimmed/clipped fastq files and when I select them and try to do an assembly with the default parameters I get the following error message:

    Fatal error (may be due to problems of the input data or parameters):

    ********************************************************************************
    * Some read names were found more than once (see log above). This usually *
    * hints to a serious problem with your input and should really, really be *
    * fixed. You can choose to ignore this error with , but this will *
    * almost certainly lead to problems with result files (ACE and CAF for sure, *
    * maybe also SAM) and probably to other unexpected effects. *

    I have already cecked whether I could have put together fastq files more than once, but when I looked at my scripts there are no errors. I tried assembling files by pairs to see which are the problematic ones, and I get this error message with a few of them so now I need to check where is the problem and if it is actually true that I have repeated read names in these files (although this shouldn't be the case). I would like to find some script that allows me to extract just the read names from these files, so I can then compare them and check if I there are repeated read names between files, but I cannot find anything useful anywhere. Does anyone knows how can I do this, and also anyone has any guess on why is Mira reporting this error?

    Thanks in advance

    Olalla

  • #2
    I'm wondering if the trimming process may not have clipped the read names - are you able to check that?

    Also, mira works better if you do NOT preprocess your data - as mira will trim internally itself.

    HTH

    Comment


    • #3
      Hello John

      Thanks for your reply. The read names ate not clipped. And I guess they shouldnt have same names. I just would need a way to retrieve the list of read names (or at least those matching between files), so I can after that check where is the problem. Any idea about which command can I use for that or of there is any script/program doing that?

      I will also try to run MIRA on the non preprocessed samples, and see if the results are different.
      Thanks for the suggestion

      Thanks again

      Comment


      • #4
        I'm not sure how to answer. Your read names should not have any overlaps. How did you generate your fastq files? I tend to use mira's sff_extract. But there are lots of good sff extractors out there.
        Last edited by JohnN; 10-28-2014, 07:42 AM. Reason: Added sentences about sff_extract

        Comment


        • #5
          Well, I started from the fna and qual files. Basically, I do have ten individuals, and for each one I have results from three runs, so I first did conversion from fna to fastq, then concatenated all files from same individual in a single fastq file, and then I did the QC analysis on those concatenated fastq files. As I told you, everything went fine until I got to the assembly step. Unless this is a problem with MIRA plugin in geneious, I also do not see the reason why I should have repeated red names, as they should be unique strings of characters. By now I extracted just the read names from the files and I am going to compare them by pairs so I can really see whether there are actual repeated read names.... I will see.

          Comment


          • #6
            Looking at your steps, the only place where the read names could be messed up could be in your fna/qual to fastq conversion...

            Could you not take the SFF files from the 454 assembly and convert them directly to fastq using sff_extract or the another tool?

            Comment


            • #7
              Yes, exactly that is what I though, but the scripts are correct (no names messed up), so the only possibility is some mistake when converting sff files into fna and qual. At the moment I do not have access to these files, as they are data that were not mine but from my supervisor (I am just doing the analysis at the moment), but I could have them (I guess).

              Thanks a lot for your suggestions

              Olalla

              Comment


              • #8
                MIRA simply does not manage long headlines. So, parse your fastq as follow :

                @M00266:130:000000000-A334F:1:1101:15377:1607 1:N:0:5

                to :

                @1:1101:15377:1607/1

                the missing first part of the headline should be common to all sequences of your file.
                I have a parser for that if you need it, but just a few sed command lines will do the job quickly.

                Comment


                • #9
                  Or use:

                  parameters = COMMON_SETTINGS -NW:mrnl=0

                  to parse long read names

                  Comment


                  • #10
                    Ok, I think that I have found the source of the problem, finally

                    So, the read names in my fastq files appear as follows:
                    GIMXFMA02G21Y1 length=60 xy=2788_0299 region=2 run=R_2010_06_09_09_17_36_ that is, long names with spaces. So I think that what MIRA is doing is juts taking the first 14 characters as the read name (e.g. GIMXFMA02G21Y1), which in fact are repeated among many of the files that I do have (I have found common lines when comparing files including only these names in the lines). However, when I search for common lines in files including all information in headers (like GIMXFMA02G21Y1 length=60 xy=2788_0299 region=2 run=R_2010_06_09_09_17_36_), the output of the search is that there are no common lines between any of the files, so I think that I should first eliminate spaces, maybe replacing them by ":". Any suggestion/script on how to do this would be very appreciated.

                    Again, many thanks for your comments and suggestions

                    Olalla

                    Comment


                    • #11
                      Or just run it including the parameters in my previous post above, and Mira will accept the long file names.

                      Comment


                      • #12
                        What you could also do is simply:

                        Code:
                        cat ./orginalFile.fastq | sed -e 's/ /_/g' > ./formattedFile.fastq
                        This will replace every space with an underscore (or to whatever you prefer).
                        The command actually doesn't distinguish between header, sequence, comment or qual lines. You should, however, be save to ignore this as your sequence and qual lines must not contain any spaces and for the comment line, it doesn't really matter.

                        Comment


                        • #13
                          Hello John

                          I already did that, but the error message persists... As I told you, the problem in my case I think that is with the whitespaces. The program seems to stop after finding the first white space in the read name (after the 14-character string that is common in many cases). So what I do need to do now is to replace the white spaces in the read names by colons, and then maybe use the option that you suggest so the program ignores long read names.

                          So what I need to find now is how to delete these white spaces in the read name lines and substitute them... there is where I am now stucked :/ The problem when you start with linux is that it takes lots of time to find adequate commands and/or scripts to do whatever you need

                          Thanks

                          Olalla

                          Comment


                          • #14
                            Hey whatsoever... thanks for this!! I wil try it now

                            Comment


                            • #15
                              The read name GIMXFMA02G21Y1 looks like a Roche 454 read name, but it should be unique and only occur once in your FASTQ file. If you are saying it appears several times then it makes sense that MIRA is complaining about duplicates.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              31 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              32 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              28 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              53 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X