Unconfigured Ad

**GenoMax** · 06-24-2016, 05:25 AM

This is an odd dataset.

First of all there are three files for a PE dataset (I thought one was a file for the barcode/tags, but that does not appear to be the case). The fastq headers are non-standard and then there is that issue of every Q-score set to ? for the entire dataset in all three files.

You should try to find out more information (directly from the submitter, if you can) before spending time analyzing this data.

**xiangwulu** · 06-24-2016, 07:15 AM

Originally posted by GenoMax View Post

This is an odd dataset.

First of all there are three files for a PE dataset (I thought one was a file for the barcode/tags, but that does not appear to be the case). The fastq headers are non-standard and then there is that issue of every Q-score set to ? for the entire dataset in all three files.

You should try to find out more information (directly from the submitter, if you can) before spending time analyzing this data.

Thanks your answer.

This data can be found from DRASearch, NCBI SRA, and EBI.
All these sources of these data has strange quality values.
However I wasn't able to find the contact info of the submitter, but I email EBI help, and got reply as follow:

CRAM files are compressed NGS read files. The sequences can are retrieved byusing the reference but quality scores are quantised into a smaller range in
order to use less space. It looks like the compression on this cram file is such
that all quality scores average into the same value. These are probably low
value quality scores, or the quality scores were not available in the first
place.

I would just leave the data, or set the --offset =33 for the tool, just to pass the analysis.

**GenoMax** · 06-24-2016, 07:20 AM

Ok. So we have an explanation for the Q-scores but ~~what about the presence of 3 files, all of which have the same length sequence data?~~

Edit: I think the third file is likely of single reads that had the mate discarded during trimming. You can check on that possibility to see if the headers there are not present in _1 or _2 file.

**xiangwulu** · 06-24-2016, 07:31 AM

Originally posted by GenoMax View Post

Ok. So we have an explanation for the Q-scores but what about the presence of 3 files, all of which have the same length sequence data?

Usually, splitting the .sra files of pair-end reads using fastq-dump from SRA-toolkit,

a parameter --split-3 is used to do this:

Legacy 3-file splitting for mate-pairs: First 2 biological reads satisfying dumping conditions are placed in files *_1.fastq and *_2.fastq If only 1 biological read is dumpable - it is placed in *.fastq.

so the smaller file is usually called unmapped sequence, which contains the sequence which the mate pair sequence cannot be found.

Just a moment...

https://www.biostars.org/p/11111/

Home

http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc

SRA Tools. Contribute to ncbi/sra-tools development by creating an account on GitHub.

**GenoMax** · 06-24-2016, 07:38 AM

See the edit I just made to the post above.

**xiangwulu** · 06-24-2016, 07:45 AM

Originally posted by GenoMax View Post

See the edit I just made to the post above.

Saw it.
I think there is no trimming involved at/before that stage. The third file is a collection of unloved ones.

Topics	Statistics	Last Post
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, Today, 10:09 AM	0 responses 9 views 0 reactions	Last Post by SEQadmin2 Today, 10:09 AM
A New Single-Cell Method Maps DNA-Protein Interactions by SEQadmin2 Started by SEQadmin2, Yesterday, 08:59 AM	0 responses 16 views 0 reactions	Last Post by SEQadmin2 Yesterday, 08:59 AM
Long-Read RNA Sequencing Uncovers a Hidden Layer of Immune Cell Regulation by SEQadmin2 Started by SEQadmin2, 06-02-2026, 12:03 PM	0 responses 24 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 12:03 PM
DNA Methylation Study Reveals How Epigenetic Changes Pass Between Generations by SEQadmin2 Started by SEQadmin2, 06-02-2026, 11:40 AM	0 responses 21 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 11:40 AM

Unconfigured Ad

All sequence bases have the same quality score.

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News