Seqanswers Leaderboard Ad
Collapse
X
Collapse
-
Variant Analysis and Genome Assembly: Recommended Tools for Next-Level Sequencing Analysis
Continuing from our previous article, we share variant analysis and genome assembly tools recommended by our experts Dr. Medhat Mahmoud, Postdoctoral Research Fellow at Baylor College of Medicine, and Dr. Ming "Tommy" Tang, Director of Computational Biology at Immunitas and author of From Cell Line to Command Line.
Variant detection and analysis tools
Mahmoud classifies variant detection work into two main groups: short variants (<50 base-pairs), which include single nucleotide variants (SNVs) and insertions and deletions (indels); and longer variants (≥50 bp) such as structural variants (SV). Similarly, he divides variant analysis tools into two categories, one tailored for short-read data and another specifically designed to handle long-read data.
One exception to this separation is PRINCESS, a comprehensive variant analysis tool that takes the reads, aligns them using several available tools, and then calls short and long variants while additionally phasing them. PRINCESS can detect haplotype-resolved SNVs, SVs, and methylation events. Mahmoud is a developer of this powerful tool, which has the framework to perform QC and long-read analysis.
Short variants with short reads
Our next recommendation comes from Tang, who suggests using GATK (Genomic Analysis ToolKit) for variant analysis. This analysis toolkit is an industry standard for variant discovery, and it provides a wide range of tools for different variant workflows. In addition, Tang explains that Illumina's analysis platform, DRAGEN (Dynamic Read Analysis for GENomics), is another great tool if one has access to it. The combination of these two resources forms DRAGEN-GATK, which can further streamline and improve the variant analysis process.
Mahmoud recommends two more resources for short variant work using short reads. The first is FreeBayes, a haplotype-based variant detector. It can detect variants in regions with low read coverage and is well-suited for large-scale sequencing projects. The other recommendation is for samtools, one of the most well-known variant detection platforms. Instead of a single tool, samtools is a collection of comprehensive programs used for read alignment and variant calling. This bioinformatics toolset can process and analyze DNA sequence alignment data, enabling various operations such as format conversion, filtering, and variant calling.
Short variants with long reads
Beginning with DeepVariant, Mahmoud suggests several tools that can be used with sequencing data generated from long-read instruments. DeepVariant can work with short- and long-read data, and it uses a deep learning-based variant caller that is capable of detecting variants in complex regions. The next tool, Clair, is specifically used for calling variants with single-molecule sequencing data. It is a germline small variant caller that uses pileup data and deep neural networks. The creators of Clair have also more recently released an updated version, Clair3, and a Nanopore-specific variant caller, Clair3-trio, which is designed for trio variant calling.
Two other highly utilized variant callers for long reads are Longshot and Medaka. Longshot uses haplotype information from the long-read data to correctly detect and phase SNVs in diploid genomes. Alternatively, Medaka is an ONT-specific tool designed for creating consensus sequences and variant calls. Users should also note that the diploid variant calling workflow for Medaka has been deprecated and it’s recommended to use Clair3 instead.
Structural variants with short reads
Parliament2 stands as a consensus SV framework that combines multiple top-performing methods to efficiently identify high-quality SVs from short-read DNA sequencing data on a large scale. Another popular tool named DELLY is specifically made for detecting various types of SVs, including deletions, tandem duplications, inversions, and translocations. It utilizes paired-end and split-read data to accurately identify these structural variations.
LUMPY, a commonly employed tool for detecting structural variants, takes paired-end and split-read data to detect structural variants. It also incorporates read-depth information, enhancing its ability to identify SVs accurately. Finally, Manta is a versatile solution for SV detection that utilizes both paired-end and split-read data to detect a wide range of structural variants, such as deletions, insertions, inversions, and complex rearrangements.
Structural variants with long reads
The first tool Mahmoud suggests for detecting structural variants from long-read data is Sniffles. There is now a newer version called Sniffles2, which offers a complete redesign with enhanced capabilities for germline SV calling. It also facilitates family and population SV calling on a larger scale and introduces innovative approaches for identifying mosaic SVs. In addition, cuteSV is a long-read-based approach that enables in-depth analysis of the complex signatures of structural variants inferred from read alignments. Originally developed for constructing the syndip benchmark dataset, Dipcall is a variant-calling pipeline that operates based on a reference, specifically designed for a pair of phased haplotype assemblies. The last resource, PBSV, is actually a suite of tools for PacBio long-read sequencing data. These tools call and analyze SVs in diploid genomes, with single-sample calling and joint (multi-sample) calling provided.
Genome assembly and analysis tools
Assembling genomes involves different tools depending on the read lengths used for the process. True to their name, assemblies from short reads utilize smaller DNA fragments that are generally high in coverage but have a limited ability to resolve complex genomic regions. Conversely, long-read assemblies use longer DNA fragments, allowing for higher resolution of complex genomic regions but typically have lower coverage.
Short-read assemblies
For short-read genome assemblies, Mahmoud recommends SPAdes, ABySS, Velvet, and SOAPdenovo2. SPAdes is known for its ability to handle diverse sequencing data types and produce high-quality assemblies. ABySS employs a de Bruijn graph approach and is particularly adept at handling large and complex genomes. Velvet stands out for its fast and memory-efficient performance, making it suitable for small to medium-sized genomes. Additionally, SOAPdenovo2 is specifically designed to handle large and complex genomes while aiming to minimize errors during the assembly process. Each of these assemblers offers valuable tools for researchers working with different genomic data types and sizes, catering to various assembly needs.
Long-read assemblies
There are several influential tools Mahmoud advocates for long-read assembly. Canu is a popular choice that can effectively handle various types of long-read data and produce high-quality assemblies. Shasta, along with its polishing algorithms MarginPolish and HELEN, is a de novo long-read assembler that offers reliable assembly solutions. Specifically designed for long-read data, Flye is a tool recognized for its ability to generate highly accurate assemblies. For metagenome assembly, metaFlye provides a scalable solution using repeat graphs. Lastly, wtdbg2 is a de novo assembler that employs a repeat graph approach, making it well-suited for handling long-read data.
Attached is a PDF containing links to the websites, GitHub pages, and original publications for each resource. If you use a tool that wasn’t listed in this article, log in and tell us about the tool in the comments below! And don’t forget to read our final article on tool recommendations.
Tags: None
- Likes 1
Please sign into your account to post comments.
About the Author
Collapse
Benjamin Atha holds a B.A. in biology from Hood College and an M.S. in biological sciences from Towson University. With over 9 years of hands-on laboratory experience, he's well-versed in next-generation sequencing systems. Ben is currently the editor for SEQanswers.
Find out more about seqadmin
Latest Articles
Collapse
-
by seqadmin
Spatial biology is an exciting field that encompasses a wide range of techniques and technologies aimed at mapping the organization and interactions of various biomolecules in their native environments. As this area of research progresses, new tools and methodologies are being introduced, accompanied by efforts to establish benchmarking standards and drive technological innovation.
3D Genomics
While spatial biology often involves studying proteins and RNAs in their...-
Channel: Articles
01-01-2025, 07:30 PM -
-
by seqadmin
Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...-
Channel: Articles
12-16-2024, 07:57 AM -
-
by seqadmin
Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.
Long-Read Sequencing
Long-read sequencing has seen remarkable advancements,...-
Channel: Articles
12-02-2024, 01:49 PM -
ad_right_rmr
Collapse
News
Collapse