Seqanswers Leaderboard Ad

**blancha** · 07-11-2014, 03:26 AM

Your argument that TopHat has changed the quality scores would be more convincing if you posted the same sequence before and after alignment with TopHat.
You've posted two different sequences, with obviously different quality scores.
Can you post the same sequence, before and after alignment with TopHat?

TopHat2 does have optional arguments to handle quality scores in the Solexa scale, incidentally.

--solexa-quals Use the Solexa scale for quality values in FASTQ files.
--solexa1.3-quals As of the Illumina GA pipeline version 1.3, quality scores are encoded in Phred-scaled base-64. Use this option for FASTQ files from pipeline 1.3 or later.

**kmcarr** · 07-11-2014, 04:20 AM

Originally posted by pengchy View Post

fastq

Code:

@FCC22UBACXX:2:1101:1463:2233#ACACGCGG/1
TGGTCTTCTAAATATTGTCTGAGGGCTCCGTAAGCCTGTGTTTTAGCAC
+
___acdeegee[efghgbbhgfdehhhfffffhZa^fggbaefhhghhh

after tophat2 mapping, the fastq base quality was changed into:

Code:

FCC22UBACXX:2:1115:5744:8409#ACACGCGG   256     EQ110773        344     3       49M     *       0       0       ATAAAACCCGACAAAAGCTGTTCGGAAAGCTCTACGGGCTCGACCGGCA     CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJJJJJJJIJHFDDD       AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:MD:Z:49  YT:Z:UU NH:i:2  CC:Z:EQ123351   CP:i:3745       HI:i:0

also, if the input was solexa-quality, the base quality in the output bam will be 0-40, which will influence the downstream analysis.

Any idea?
Thank you.

Tophat has done nothing to change the encoding. Quality scores in BAM files are stored as numeric values, not ASCII characters like they are in FASTQ files. When you view the contents of a BAM file with samtools view it will convert those numbers to ASCII characters for display and it will always use the Sanger Phred+33 encoding for display. It is only important that when you are using FASTQ files as input you correctly identify what encoding method is used in that FASTQ, then the correct q-score numbers will be stored in the BAM and everything downstream will work fine.

**blancha** · 07-11-2014, 04:58 AM

Very informative answer kmcarr.
Much better than mine.

**pengchy** · 07-11-2014, 06:16 AM

Originally posted by kmcarr View Post

Tophat has done nothing to change the encoding. Quality scores in BAM files are stored as numeric values, not ASCII characters like they are in FASTQ files. When you view the contents of a BAM file with samtools view it will convert those numbers to ASCII characters for display and it will always use the Sanger Phred+33 encoding for display. It is only important that when you are using FASTQ files as input you correctly identify what encoding method is used in that FASTQ, then the correct q-score numbers will be stored in the BAM and everything downstream will work fine.

Thank kmcarr and blancha's reply.

More question:
When I use the fastq with the base quality like this:

Code:

@FCC22UBACXX:2:1101:1463:2233#ACACGCGG
TGGTCTTCTAAATATTGTCTGAGGGCTCCGTAAGCCTGTGTTTTAGCAC
+
@@@BDEFFHFF<FGHIHCCIHGEFIIIGGGGGI;B?GHHCBFGIIHIII

the output bam will be like this:

Code:

FCC22UBACXX:2:2112:11384:88500#ACACGCGG 0       AF024514.1      14      50      49M     *       0       0       ATATTGCTTCTATTTCGGTTTTGTTCAAGCGTTGACCGTTGCAGGCGCT     %$%(((((*****++++"(*+++*+++++++++++++++++++++++++       AS:i:-4 XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:MD:Z:17A1A29     YT:Z:UU NH:i:1
FCC22UBACXX:2:1313:19247:88511#ACACGCGG 0       AF024514.1      15      50      49M     *       0       0       TATTGCTTCTATTTCGATTTTGTTCAAGCGTTGACCGTTGCAGGCGCTT     $$$(((((&((&(*))'%(**+&(*++*&)''')&)+++%()))'&()%       AS:i:-2 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:MD:Z:18A30       YT:Z:UU NH:i:1

Is it normal? I have used this bam file to call snp, none snp has been found, while several sites can be manually detected. Is it caused by this base quality?

**kmcarr** · 07-11-2014, 12:02 PM

Originally posted by pengchy View Post

Thank kmcarr and blancha's reply.

More question:
When I use the fastq with the base quality like this:

Code:

@FCC22UBACXX:2:1101:1463:2233#ACACGCGG
TGGTCTTCTAAATATTGTCTGAGGGCTCCGTAAGCCTGTGTTTTAGCAC
+
@@@BDEFFHFF<FGHIHCCIHGEFIIIGGGGGI;B?GHHCBFGIIHIII

the output bam will be like this:

Code:

FCC22UBACXX:2:2112:11384:88500#ACACGCGG 0       AF024514.1      14      50      49M     *       0       0       ATATTGCTTCTATTTCGGTTTTGTTCAAGCGTTGACCGTTGCAGGCGCT     %$%(((((*****++++"(*+++*+++++++++++++++++++++++++       AS:i:-4 XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:MD:Z:17A1A29     YT:Z:UU NH:i:1
FCC22UBACXX:2:1313:19247:88511#ACACGCGG 0       AF024514.1      15      50      49M     *       0       0       TATTGCTTCTATTTCGATTTTGTTCAAGCGTTGACCGTTGCAGGCGCTT     $$$(((((&((&(*))'%(**+&(*++*&)''')&)+++%()))'&()%       AS:i:-2 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:MD:Z:18A30       YT:Z:UU NH:i:1

Is it normal? I have used this bam file to call snp, none snp has been found, while several sites can be manually detected. Is it caused by this base quality?

Pengchy,

Again, you are showing us different reads from the FastQ file and from the BAM file. To best understand what is going on you really need to show us the FastQ record and the BAM record for the SAME read(s).

Do you know what the quality encoding was in your input FastQ? You really need to have this information before you start your analysis.

What version of Tophat are you using?

When you ran Tophat did you use a command line parameter to tell Tophat what q-score encoding was used for the input FastQ or did you leave it as default?

**pengchy** · 07-12-2014, 06:13 AM

Originally posted by kmcarr View Post

Pengchy,

Again, you are showing us different reads from the FastQ file and from the BAM file. To best understand what is going on you really need to show us the FastQ record and the BAM record for the SAME read(s).

Do you know what the quality encoding was in your input FastQ? You really need to have this information before you start your analysis.

What version of Tophat are you using?

When you ran Tophat did you use a command line parameter to tell Tophat what q-score encoding was used for the input FastQ or did you leave it as default?

Hi kmcarr, thank you for the reply.

Actually, I have done my work in the steps:
1. mapped fastq of the reads with quality solexa 1.3, like:

Code:

@FCC22UBACXX:2:2112:11384:88500#ACACGCGG/1
ATATTGCTTCTATTTCGGTTTTGTTCAAGCGTTGACCGTTGCAGGCGCT
+
a_aeeeeegggggiiiiWeghiighhhiiiihiiiiiiihhhiiihihi

the read was mapped by tophat2 with parameters --solexa1.3-quals

2. Then the output file "unmapped.bam" of tophat2 were converted to fastq format by bedtools-2.17.0/bin/bamToFastq, the output file of this read is:

Code:

@FCC22UBACXX:2:2112:11384:88500#ACACGCGG
ATATTGCTTCTATTTCGGTTTTGTTCAAGCGTTGACCGTTGCAGGCGCT
+
B@BFFFFFHHHHHJJJJ8FHIJJHIIIJJJJIJJJJJJJIIIJJJIJIJ

the base quality characters of bamToFastq output is the same to the "samtools view".

3. The base quality is phred+33, the reads were mapped by tophat2 with the parameters:

Code:

/path/to/tophat2 --solexa-quals --library-type fr-unstranded -o /path/to/output/directory/ /genome/index /path/to/fastq.gz

The "samtools view " of the output bam file give the following base quality like

Code:

FCC22UBACXX:2:2112:11384:88500#ACACGCGG 0       AF024514.1      14      50      49M     *       0       0       ATATTGCTTCTATTTCGGTTTTGTTCAAGCGTTGACCGTTGCAGGCGCT     %$%(((((*****++++"(*+++*+++++++++++++++++++++++++       AS:i:-4 XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:MD:Z:17A1A29     YT:Z:UU NH:i:1

Here the read from the original fastq and bam file is same. I have checked many reads manually, the output bam file all have the similar quality characters.

I don't know this phenomenon is caused by Tophat2 or "samtools view".

I have correctly set the --solexa-quals for tophat2.

We can have a check for the first character of these reads:
the original: "a" == 97
the first mapping result: "B"== 66
the second mapping result: "%" == 37

We can got that the first transformation has minus the ASCII by 31 and the second by 29.

I guess the "samtools view" always transform the input bam ASCII by minus some value.

**kmcarr** · 07-12-2014, 06:32 AM

Originally posted by pengchy View Post

Hi kmcarr, thank you for the reply.

Actually, I have done my work in the steps:
1. mapped fastq of the reads with quality solexa 1.3, like:

Code:

@FCC22UBACXX:2:2112:11384:88500#ACACGCGG/1
ATATTGCTTCTATTTCGGTTTTGTTCAAGCGTTGACCGTTGCAGGCGCT
+
a_aeeeeegggggiiiiWeghiighhhiiiihiiiiiiihhhiiihihi

the read was mapped by tophat2 with parameters --solexa1.3-quals

2. Then the output file "unmapped.bam" of tophat2 were converted to fastq format by bedtools-2.17.0/bin/bamToFastq, the output file of this read is:

Code:

@FCC22UBACXX:2:2112:11384:88500#ACACGCGG
ATATTGCTTCTATTTCGGTTTTGTTCAAGCGTTGACCGTTGCAGGCGCT
+
B@BFFFFFHHHHHJJJJ8FHIJJHIIIJJJJIJJJJJJJIIIJJJIJIJ

the base quality characters of bamToFastq output is the same to the "samtools view".

3. The base quality is phred+33, the reads were mapped by tophat2 with the parameters:

Code:

/path/to/tophat2 [COLOR="Red"]--solexa-quals[/COLOR] --library-type fr-unstranded -o /path/to/output/directory/ /genome/index /path/to/fastq.gz

The "samtools view " of the output bam file give the following base quality like

Code:

FCC22UBACXX:2:2112:11384:88500#ACACGCGG 0       AF024514.1      14      50      49M     *       0       0       ATATTGCTTCTATTTCGGTTTTGTTCAAGCGTTGACCGTTGCAGGCGCT     %$%(((((*****++++"(*+++*+++++++++++++++++++++++++       AS:i:-4 XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:MD:Z:17A1A29     YT:Z:UU NH:i:1

Here the read from the original fastq and bam file is same. I have checked many reads manually, the output bam file all have the similar quality characters.

I don't know this phenomenon is caused by Tophat2 or "samtools view".

I have correctly set the --solexa-quals for tophat2 (No, this is incorrect).

We can have a check for the first character of these reads:
the original: "a" == 97
the first mapping result: "B"== 66
the second mapping result: "%" == 37

We can got that the first transformation has minus the ASCII by 31 and the second by 29.

I guess the "samtools view" always transform the input bam ASCII by minus some value.

So if I am following your logic you:

1. Map your original reads to some reference with Tophat2.

2. Extract the unmapped reads from unmapped.bam to a new FastQ

3. Map the unmapped reads to some reference with Tophat2. I am going to assume that this mapping is to some different reference since trying to align unmapped reads back to the same reference wouldn't make sense.

I have highlighted the problem in red above. You correctly state "The base quality is phred+33", but then you incorrectly set "--solexa-quals" in your tophat command, which is telling tophat that your FastQ file is phred+64. Drop "--solexa-quals" from your second tophat alignment (step 3). Phred+33 is the default for tophat2 so there is no command line option used to specify it.

**pengchy** · 07-12-2014, 06:55 AM

Originally posted by kmcarr View Post

So if I am following your logic you:

1. Map your original reads to some reference with Tophat2.

2. Extract the unmapped reads from unmapped.bam to a new FastQ

3. Map the unmapped reads to some reference with Tophat2. I am going to assume that this mapping is to some different reference since trying to align unmapped reads back to the same reference wouldn't make sense.

I have highlighted the problem in red above. You correctly state "The base quality is phred+33", but then you incorrectly set "--solexa-quals" in your tophat command, which is telling tophat that your FastQ file is phred+64. Drop "--solexa-quals" from your second tophat alignment (step 3). Phred+33 is the default for tophat2 so there is no command line option used to specify it.

thank you kmcarr, as you found, I have wrongly set the quality parameter for tophat2.

Your suggestion is very useful and help me a lot.

Best,
Pengcheng

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Today, 08:47 AM	0 responses 10 views 0 likes	Last Post by seqadmin Today, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

base quality encoding changed after tophat2 mapping

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News