My first variant calling workflow

Eurioste

Junior Member

Join Date: Jun 2017

Posts: 5
- Share
- Tweet
#1

My first variant calling workflow

06-30-2017, 08:50 AM

Hello, I'm currently learning how to process data from NGS using the Galaxy platform. This is the first time I work with NGS data and I find myself currently overwhelmed with the abundance of different variant call workflows and available tools. I have molecular biology background and I'm learning this on my own through on-line courses so I wish to have some feedback in case I'm not making mistakes. While I can code in python, I wish to make this workflow in Galaxy as part of a course.

For the purpose of learning, I was given FASTQ raw reads from an Illumina MiSeq, sequenced as paired ends to 125bp in length. The data refers to targetted re-sequencing data for a father, mother and child trio.I need to create a workflow to identify polymorphic sites in all three individuals.

I started a workflow based on the references bellow:

folk.universitetetioslo.no

http://folk.uio.no/jonkl/StuffForMBV-INFx410/Articles/AAltmann.pdf

Page Not Found

https://www.biomedcentral.com/content/supplementary/1756-0500-7-314-S1.pdf

My current incomplete attempt is available at the link bellow. Some steps from the references were skipped for the sake of simplicity. I'm making my best effort to actually understand what each step really does and why to use it. You can import the worklow on Galaxy for better view:

404 Not Found

https://usegalaxy.org/u/eurioste/w/variant-calling-on-trio

Briefly, the paired end reads had 3' 10 bps trimmed (based on FASTQ report, not in the workflow), resulting in high quality reads of about 140bps. The paired reads for each individual with were aligned to the reference human_g1k_v37 with BWA-MEN, generating different read group informations. The resulting alignment BAM for each individual was pre-processed with Picard sorting, removal of ambiguous reads and duplicates and update of mate-pair information. I'm omitting indel re-alignment and base quality recalibration on purpose. The resulting 3 BAMs could be used for variant calling, but now I have some questions.

I'm expected to count the number of variants of different types above a certain quality threshold.

I'm in doubt if was it a good choice to align the data for each individual separately. Is it correct to do variant calling in each individual separately? May I still merge these BAM files with Picard and do variant calling, will they retain the correct alignment information? Or I should merge the read information before the alignment? Can these alter the results of the workflow? I've read about converting FASTQ to SAM/BAM and merging them in an unmapped BAM before the alignment and subsequent pre-processing. Do I really need to do it?

Is my workflow actually producing useful data? Please let me know if I'm making a mistake, I'm a little confused if what I did is right. Make sure you describe things well because I'm still unfamiliar with NGS data processing.

Thanks in advance

Eduardo
Tags: None

Previous template Next

Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM
Strategies for Sequencing Challenging Samples

by seqadmin

Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
- Channel: Articles
03-22-2024, 06:39 AM

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 25 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 24 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

My first variant calling workflow

Latest Articles

ad_right_rmr

News