Seqanswers Leaderboard Ad
Collapse
X
Collapse
-
Variant Analysis and Genome Assembly: Recommended Tools for Next-Level Sequencing Analysis
Continuing from our previous article, we share variant analysis and genome assembly tools recommended by our experts Dr. Medhat Mahmoud, Postdoctoral Research Fellow at Baylor College of Medicine, and Dr. Ming "Tommy" Tang, Director of Computational Biology at Immunitas and author of From Cell Line to Command Line.
Variant detection and analysis tools
Mahmoud classifies variant detection work into two main groups: short variants (<50 base-pairs), which include single nucleotide variants (SNVs) and insertions and deletions (indels); and longer variants (≥50 bp) such as structural variants (SV). Similarly, he divides variant analysis tools into two categories, one tailored for short-read data and another specifically designed to handle long-read data.
One exception to this separation is PRINCESS, a comprehensive variant analysis tool that takes the reads, aligns them using several available tools, and then calls short and long variants while additionally phasing them. PRINCESS can detect haplotype-resolved SNVs, SVs, and methylation events. Mahmoud is a developer of this powerful tool, which has the framework to perform QC and long-read analysis.
Short variants with short reads
Our next recommendation comes from Tang, who suggests using GATK (Genomic Analysis ToolKit) for variant analysis. This analysis toolkit is an industry standard for variant discovery, and it provides a wide range of tools for different variant workflows. In addition, Tang explains that Illumina's analysis platform, DRAGEN (Dynamic Read Analysis for GENomics), is another great tool if one has access to it. The combination of these two resources forms DRAGEN-GATK, which can further streamline and improve the variant analysis process.
Mahmoud recommends two more resources for short variant work using short reads. The first is FreeBayes, a haplotype-based variant detector. It can detect variants in regions with low read coverage and is well-suited for large-scale sequencing projects. The other recommendation is for samtools, one of the most well-known variant detection platforms. Instead of a single tool, samtools is a collection of comprehensive programs used for read alignment and variant calling. This bioinformatics toolset can process and analyze DNA sequence alignment data, enabling various operations such as format conversion, filtering, and variant calling.
Short variants with long reads
Beginning with DeepVariant, Mahmoud suggests several tools that can be used with sequencing data generated from long-read instruments. DeepVariant can work with short- and long-read data, and it uses a deep learning-based variant caller that is capable of detecting variants in complex regions. The next tool, Clair, is specifically used for calling variants with single-molecule sequencing data. It is a germline small variant caller that uses pileup data and deep neural networks. The creators of Clair have also more recently released an updated version, Clair3, and a Nanopore-specific variant caller, Clair3-trio, which is designed for trio variant calling.
Two other highly utilized variant callers for long reads are Longshot and Medaka. Longshot uses haplotype information from the long-read data to correctly detect and phase SNVs in diploid genomes. Alternatively, Medaka is an ONT-specific tool designed for creating consensus sequences and variant calls. Users should also note that the diploid variant calling workflow for Medaka has been deprecated and it’s recommended to use Clair3 instead.
Structural variants with short reads
Parliament2 stands as a consensus SV framework that combines multiple top-performing methods to efficiently identify high-quality SVs from short-read DNA sequencing data on a large scale. Another popular tool named DELLY is specifically made for detecting various types of SVs, including deletions, tandem duplications, inversions, and translocations. It utilizes paired-end and split-read data to accurately identify these structural variations.
LUMPY, a commonly employed tool for detecting structural variants, takes paired-end and split-read data to detect structural variants. It also incorporates read-depth information, enhancing its ability to identify SVs accurately. Finally, Manta is a versatile solution for SV detection that utilizes both paired-end and split-read data to detect a wide range of structural variants, such as deletions, insertions, inversions, and complex rearrangements.
Structural variants with long reads
The first tool Mahmoud suggests for detecting structural variants from long-read data is Sniffles. There is now a newer version called Sniffles2, which offers a complete redesign with enhanced capabilities for germline SV calling. It also facilitates family and population SV calling on a larger scale and introduces innovative approaches for identifying mosaic SVs. In addition, cuteSV is a long-read-based approach that enables in-depth analysis of the complex signatures of structural variants inferred from read alignments. Originally developed for constructing the syndip benchmark dataset, Dipcall is a variant-calling pipeline that operates based on a reference, specifically designed for a pair of phased haplotype assemblies. The last resource, PBSV, is actually a suite of tools for PacBio long-read sequencing data. These tools call and analyze SVs in diploid genomes, with single-sample calling and joint (multi-sample) calling provided.
Genome assembly and analysis tools
Assembling genomes involves different tools depending on the read lengths used for the process. True to their name, assemblies from short reads utilize smaller DNA fragments that are generally high in coverage but have a limited ability to resolve complex genomic regions. Conversely, long-read assemblies use longer DNA fragments, allowing for higher resolution of complex genomic regions but typically have lower coverage.
Short-read assemblies
For short-read genome assemblies, Mahmoud recommends SPAdes, ABySS, Velvet, and SOAPdenovo2. SPAdes is known for its ability to handle diverse sequencing data types and produce high-quality assemblies. ABySS employs a de Bruijn graph approach and is particularly adept at handling large and complex genomes. Velvet stands out for its fast and memory-efficient performance, making it suitable for small to medium-sized genomes. Additionally, SOAPdenovo2 is specifically designed to handle large and complex genomes while aiming to minimize errors during the assembly process. Each of these assemblers offers valuable tools for researchers working with different genomic data types and sizes, catering to various assembly needs.
Long-read assemblies
There are several influential tools Mahmoud advocates for long-read assembly. Canu is a popular choice that can effectively handle various types of long-read data and produce high-quality assemblies. Shasta, along with its polishing algorithms MarginPolish and HELEN, is a de novo long-read assembler that offers reliable assembly solutions. Specifically designed for long-read data, Flye is a tool recognized for its ability to generate highly accurate assemblies. For metagenome assembly, metaFlye provides a scalable solution using repeat graphs. Lastly, wtdbg2 is a de novo assembler that employs a repeat graph approach, making it well-suited for handling long-read data.
Attached is a PDF containing links to the websites, GitHub pages, and original publications for each resource. If you use a tool that wasn’t listed in this article, log in and tell us about the tool in the comments below! And don’t forget to read our final article on tool recommendations.
Tags: None
- Likes 1
Please sign into your account to post comments.
About the Author
Collapse
Benjamin Atha holds a B.A. in biology from Hood College and an M.S. in biological sciences from Towson University. With over 9 years of hands-on laboratory experience, he's well-versed in next-generation sequencing systems. Ben is currently the editor for SEQanswers.
Find out more about seqadmin
Latest Articles
Collapse
-
by seqadmin
The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...-
Channel: Articles
11-06-2024, 07:24 PM -
-
by seqadmin
Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...-
Channel: Articles
10-18-2024, 07:11 AM -
-
by seqadmin
Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.
Nobel Prize for MicroRNA Discovery
This week,...-
Channel: Articles
10-07-2024, 08:07 AM -
ad_right_rmr
Collapse
News
Collapse
-
by seqadminA recent preprint introduced an innovative method called Transcriptome Timestamping (T2), which uses endogenous RNA editing to infer the temporal history...
-
Channel: News
11-21-2024, 09:19 AM -