SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Introducing BBMerge: A paired-end read merger Brian Bushnell Bioinformatics 132 06-19-2020 04:15 AM
Converter for vcf to bed format ketan_bnf Bioinformatics 4 09-03-2013 05:43 AM
Need Sequence Format Converter byou678 Bioinformatics 5 10-23-2012 01:17 PM
BOAT aligner output format converter? rahul.m.dhodapkar Bioinformatics 0 06-30-2010 07:28 AM
MAQ .map alignment format converter fadista Bioinformatics 0 10-24-2008 06:27 AM

Reply
 
Thread Tools
Old 05-04-2021, 02:57 AM   #41
Poshi
Junior Member
 
Location: At home

Join Date: May 2010
Posts: 5
Default Chastity filter processing

I posted this message as a ticket in BBmap repository, but given the fact that I saw very little movement there I'm crossposting the same issue here. I hope I'm not bothering anyone.

When processing Illumina >1.8 reads, the reads are marked as filtered out or not. This is known as chastity filter. Usually, those reads are taken away and not used, but some times they are found in the FastQ files for some reason.

When using the reformat.sh tool to convert FastQ files to SAM files, there's a parameter that allows us to discard reads that contains ' 1:Y:' or ' 2:Y:'. But when the reads are not discarded, they are included in the SAM file and this information is lost. And this is a bug, as there is a place in the SAM file to keep this information and with the current implementation the information is wrong.

All reads whose chastity filter is 'Y' should have the SAM flag 512 set (which means that "read fails platform/vendor quality checks"). All other reads should have this flag not set. This should work also in the opposite direction, where a read with this flag set should generate a FastQ file with an 'Y'.

Related to this bug I have another comment. Documentation for the chastityfilter parameter says that it will discard all reads with ' 1:Y:' or ' 2:Y:'. That's good, but what happens with reads with other numbers like ' 3:Y:'? I'm having files with this nomenclature, so it would be better to really parse the fields and discard reads with an 'Y' in the second field, keeping the first field as is.

Did anyone had those issues? How did you overcome them?
Poshi is offline   Reply With Quote
Old 05-04-2021, 06:18 AM   #42
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,132
Default

Quote:
That's good, but what happens with reads with other numbers like ' 3:Y:'? I'm having files with this nomenclature
Files with this nomenclature have become available in last few years where technologies like 10x are creating separate files for index reads. While this does not solve the problem permanently, you could change "3:Y" to "2:Y" temporarily and then change it back after using reformat.sh.
GenoMax is offline   Reply With Quote
Old 05-06-2021, 02:05 AM   #43
Poshi
Junior Member
 
Location: At home

Join Date: May 2010
Posts: 5
Default

Quote:
Originally Posted by GenoMax View Post
Files with this nomenclature have become available in last few years where technologies like 10x are creating separate files for index reads. While this does not solve the problem permanently, you could change "3:Y" to "2:Y" temporarily and then change it back after using reformat.sh.
Sure. In fact, this is not a big deal to me. We decided that we will ignore this number and generate it in a more standard way (first end -> 1, second end -> 2, first UMI -> 3, second UMI -> 4), independently of their original numbering.

For context: we are converting FastQ files into unmapped CRAMs for storage, and the FastQ to SAM intermediate conversion is done with reformat.sh. My main issue here is keeping the QC vendor flag in place.

I already have a workaround, as I also have to keep other things like the UMIs (if present) and the barcode. But these bits of information imply adding tags, which are optional, so I'm not complaining about them. But the QC vendor flag is not optional. It is there. And not filling it means you are assigning a "QC vendor pass" independently of the information in the input.

In any case, if someone wants to take a look at how to keep all information from the FastQ file into a SAM file, the four fields if the comment in the ID line are candidates:
  • The read end, which could be deducted later if you accept to standardize the output
  • The QC vendor flag, which will be coded in the FLAGS field
  • The control bits, which should be zero and potentially ignored (with no clear place to store them in case it is needed)
  • The index barcode, which should be stored as the BC:Z: tag

The other data present in the FastQ file is already present in the SAM (read name, sequence and qualities).
Poshi is offline   Reply With Quote
Reply

Tags
ascii33, ascii64, bbduk, bbmap, bbtools, fasta, fastq, interleavei33, quality trim, reformat, scarf, subsample

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:35 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO