SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Any suggestion for calculating overall Phred scale quality score for a sequence? sulicon Bioinformatics 3 07-27-2011 09:10 PM
Anyone knows sequence quality score 0-99? holywoool Bioinformatics 4 11-02-2010 08:35 PM
Quality trimmming / Mask low quality bases? bbimber Bioinformatics 9 03-25-2010 02:40 PM
PubMed: A score system for quality evaluation of RNA sequence tags: an improvement fo Newsbot! Literature Watch 0 06-09-2009 06:00 AM
Fastq quliaty score and MAQ output quality score baohua100 Bioinformatics 1 02-19-2009 10:21 AM

Reply
 
Thread Tools
Old 06-24-2016, 05:49 AM   #1
xiangwulu
Member
 
Location: ireland

Join Date: Apr 2014
Posts: 18
Default All sequence bases have the same quality score.

Hi all,
I am doing some analysis on the dataset here:

https://trace.ddbj.nig.ac.jp/DRASear...acc=ERX1434776

Some basic info for the data without looking into above link:
----
Illumina Genome Analyzer IIx paired end sequencing
shotgun sequencing
WGS
Pseudomonas fluorescens
Paired-end
----

When I search for 'Genome Analyzer IIx', could find the quality encoding information. I have seen that the quality scores for all bases are '?', e.g.

@ERR1363506.14 226/1
GTCCACTACAGGTCGAAGCCGAAGGCGACGAGTTGCGTGTTTACGCGCCCAATCGTTTTGTTCTCGACTGGGTCAACGAGAAGTACCTGAGCCGCGTGCT
+
????????????????????????????????????????????????????????????????????????????????????????????????????

My question is:
Is it normal to have a identical quality score for all bases?
When I analysis the data, some bio tools report errors that it cannot detect the quality offset or quality encoding, is above the cause of the errors?

Thanks.
xiangwulu is offline   Reply With Quote
Old 06-24-2016, 06:25 AM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,131
Default

This is an odd dataset.

First of all there are three files for a PE dataset (I thought one was a file for the barcode/tags, but that does not appear to be the case). The fastq headers are non-standard and then there is that issue of every Q-score set to ? for the entire dataset in all three files.

You should try to find out more information (directly from the submitter, if you can) before spending time analyzing this data.
GenoMax is offline   Reply With Quote
Old 06-24-2016, 08:15 AM   #3
xiangwulu
Member
 
Location: ireland

Join Date: Apr 2014
Posts: 18
Default

Quote:
Originally Posted by GenoMax View Post
This is an odd dataset.

First of all there are three files for a PE dataset (I thought one was a file for the barcode/tags, but that does not appear to be the case). The fastq headers are non-standard and then there is that issue of every Q-score set to ? for the entire dataset in all three files.

You should try to find out more information (directly from the submitter, if you can) before spending time analyzing this data.

Thanks your answer.

This data can be found from DRASearch, NCBI SRA, and EBI.
All these sources of these data has strange quality values.
However I wasn't able to find the contact info of the submitter, but I email EBI help, and got reply as follow:

Quote:
CRAM files are compressed NGS read files. The sequences can are retrieved byusing the reference but quality scores are quantised into a smaller range in
order to use less space. It looks like the compression on this cram file is such
that all quality scores average into the same value. These are probably low
value quality scores, or the quality scores were not available in the first
place.
I would just leave the data, or set the --offset =33 for the tool, just to pass the analysis.
xiangwulu is offline   Reply With Quote
Old 06-24-2016, 08:20 AM   #4
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,131
Default

Ok. So we have an explanation for the Q-scores but what about the presence of 3 files, all of which have the same length sequence data?

Edit: I think the third file is likely of single reads that had the mate discarded during trimming. You can check on that possibility to see if the headers there are not present in _1 or _2 file.

Last edited by GenoMax; 06-24-2016 at 08:37 AM.
GenoMax is offline   Reply With Quote
Old 06-24-2016, 08:31 AM   #5
xiangwulu
Member
 
Location: ireland

Join Date: Apr 2014
Posts: 18
Default

Quote:
Originally Posted by GenoMax View Post
Ok. So we have an explanation for the Q-scores but what about the presence of 3 files, all of which have the same length sequence data?
Usually, splitting the .sra files of pair-end reads using fastq-dump from SRA-toolkit,

a parameter --split-3 is used to do this:


Legacy 3-file splitting for mate-pairs: First 2 biological reads satisfying dumping conditions are placed in files *_1.fastq and *_2.fastq If only 1 biological read is dumpable - it is placed in *.fastq.

so the smaller file is usually called unmapped sequence, which contains the sequence which the mate pair sequence cannot be found.

https://www.biostars.org/p/11111/
http://www.ncbi.nlm.nih.gov/Traces/s...ew=toolkit_doc
xiangwulu is offline   Reply With Quote
Old 06-24-2016, 08:38 AM   #6
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,131
Default

See the edit I just made to the post above.
GenoMax is offline   Reply With Quote
Old 06-24-2016, 08:45 AM   #7
xiangwulu
Member
 
Location: ireland

Join Date: Apr 2014
Posts: 18
Default

Quote:
Originally Posted by GenoMax View Post
See the edit I just made to the post above.
Saw it.
I think there is no trimming involved at/before that stage. The third file is a collection of unloved ones.
xiangwulu is offline   Reply With Quote
Reply

Tags
quality illumina

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:46 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO