Seqanswers Leaderboard Ad

**JohnN** · 10-27-2014, 10:01 AM

I'm wondering if the trimming process may not have clipped the read names - are you able to check that?

Also, mira works better if you do NOT preprocess your data - as mira will trim internally itself.

HTH

**Olalla** · 10-27-2014, 05:21 PM

Hello John

Thanks for your reply. The read names ate not clipped. And I guess they shouldnt have same names. I just would need a way to retrieve the list of read names (or at least those matching between files), so I can after that check where is the problem. Any idea about which command can I use for that or of there is any script/program doing that?

I will also try to run MIRA on the non preprocessed samples, and see if the results are different.
Thanks for the suggestion

Thanks again

**JohnN** · 10-28-2014, 06:25 AM

I'm not sure how to answer. Your read names should not have any overlaps. How did you generate your fastq files? I tend to use mira's sff_extract. But there are lots of good sff extractors out there.

**Olalla** · 10-29-2014, 07:16 AM

Well, I started from the fna and qual files. Basically, I do have ten individuals, and for each one I have results from three runs, so I first did conversion from fna to fastq, then concatenated all files from same individual in a single fastq file, and then I did the QC analysis on those concatenated fastq files. As I told you, everything went fine until I got to the assembly step. Unless this is a problem with MIRA plugin in geneious, I also do not see the reason why I should have repeated red names, as they should be unique strings of characters. By now I extracted just the read names from the files and I am going to compare them by pairs so I can really see whether there are actual repeated read names.... I will see.

**JohnN** · 10-29-2014, 08:25 AM

Looking at your steps, the only place where the read names could be messed up could be in your fna/qual to fastq conversion...

Could you not take the SFF files from the 454 assembly and convert them directly to fastq using sff_extract or the another tool?

**Olalla** · 10-29-2014, 08:51 AM

Yes, exactly that is what I though, but the scripts are correct (no names messed up), so the only possibility is some mistake when converting sff files into fna and qual. At the moment I do not have access to these files, as they are data that were not mine but from my supervisor (I am just doing the analysis at the moment), but I could have them (I guess).

Thanks a lot for your suggestions

Olalla

**Yves** · 10-30-2014, 08:47 AM

MIRA simply does not manage long headlines. So, parse your fastq as follow :

@M00266:130:000000000-A334F:1:1101:15377:1607 1:N:0:5

to :

@1:1101:15377:1607/1

the missing first part of the headline should be common to all sequences of your file.
I have a parser for that if you need it, but just a few sed command lines will do the job quickly.

**JohnN** · 10-30-2014, 10:03 AM

Or use:

parameters = COMMON_SETTINGS -NW:mrnl=0

to parse long read names

**Olalla** · 10-31-2014, 04:03 AM

Ok, I think that I have found the source of the problem, finally

So, the read names in my fastq files appear as follows:
GIMXFMA02G21Y1 length=60 xy=2788_0299 region=2 run=R_2010_06_09_09_17_36_ that is, long names with spaces. So I think that what MIRA is doing is juts taking the first 14 characters as the read name (e.g. GIMXFMA02G21Y1), which in fact are repeated among many of the files that I do have (I have found common lines when comparing files including only these names in the lines). However, when I search for common lines in files including all information in headers (like GIMXFMA02G21Y1 length=60 xy=2788_0299 region=2 run=R_2010_06_09_09_17_36_), the output of the search is that there are no common lines between any of the files, so I think that I should first eliminate spaces, maybe replacing them by ":". Any suggestion/script on how to do this would be very appreciated.

Again, many thanks for your comments and suggestions

Olalla

**JohnN** · 10-31-2014, 04:26 AM

Or just run it including the parameters in my previous post above, and Mira will accept the long file names.

**WhatsOEver** · 10-31-2014, 04:49 AM

What you could also do is simply:

Code:

cat ./orginalFile.fastq | sed -e 's/ /_/g' > ./formattedFile.fastq

This will replace every space with an underscore (or to whatever you prefer).
The command actually doesn't distinguish between header, sequence, comment or qual lines. You should, however, be save to ignore this as your sequence and qual lines must not contain any spaces and for the comment line, it doesn't really matter.

**Olalla** · 10-31-2014, 05:04 AM

Hello John

I already did that, but the error message persists... As I told you, the problem in my case I think that is with the whitespaces. The program seems to stop after finding the first white space in the read name (after the 14-character string that is common in many cases). So what I do need to do now is to replace the white spaces in the read names by colons, and then maybe use the option that you suggest so the program ignores long read names.

So what I need to find now is how to delete these white spaces in the read name lines and substitute them... there is where I am now stucked :/ The problem when you start with linux is that it takes lots of time to find adequate commands and/or scripts to do whatever you need

Thanks

Olalla

**Olalla** · 10-31-2014, 05:05 AM

Hey whatsoever... thanks for this!! I wil try it now

**maubp** · 11-01-2014, 12:49 PM

The read name GIMXFMA02G21Y1 looks like a Roche 454 read name, but it should be unique and only occur once in your FASTQ file. If you are saying it appears several times then it makes sense that MIRA is complaining about duplicates.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 31 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Problem with MIRA

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News