TaxonSampler

TaxonSampler

A web platform for taxonomic sampling and genome data integration

Overview

TaxonSampler is an open-source web platform that integrates taxonomic classifications from the Catalogue of Life (COL) with genome assembly metadata from the NCBI Datasets API.

The tool addresses a common bottleneck in large-scale genomic analyses: the manual and error-prone process of selecting species with available genome assemblies that meet specific quality criteria, while maintaining taxonomic representativeness across clades of interest.

Key Features

Taxonomic Integration

Imports and reconciles COL and NCBI taxonomies with 26+ ranks

Genome Metadata

Assembly level, contig N50, genome size, GC content, annotation

Sampling Wizard

Three-step guided workflow with scope, filters, and export

Multi-format Export

JSON, TXT, Newick, and XLSX with genome metadata

Motivation

Comparative genomic analyses frequently require assembling a representative subset of species from broad taxonomic groups. Two competing objectives must be balanced:

Genome Quality

Assemblies must meet minimum standards for contiguity (contig N50), completeness (assembly level), and annotation.

Taxonomic Coverage

Selected species should span the phylogenetic breadth of the target clade to avoid systematic sampling bias.

Performing this selection manually is time-consuming and prone to inconsistencies, particularly when hundreds or thousands of candidate species are involved. TaxonSampler automates the entire pipeline.

Methodology

Taxonomy Construction

The taxonomic hierarchy is constructed as a trie (prefix tree) from database records, then serialized to a D3.js-compatible JSON format. This allows:

  • Efficient prefix-based searches across millions of species
  • Automatic merging of manual entries with COL data
  • Intermediate rank resolution for incomplete classifications
Sampling Strategies

TaxonSampler supports multiple sampling approaches:

StrategyDescriptionUse Case
Quality Filtering Apply hard thresholds on genome metrics High-quality phylogenomics
Taxonomic Diversity Select N species per family/genus Broad evolutionary surveys
Weighted Scoring Rank genomes by combined quality metrics Best-available selection
Hybrid Filter first, then maximize diversity Balanced datasets

References & Data Sources

Catalogue of Life (COL)

The most comprehensive and authoritative global index of species. Provides taxonomic classifications via ChecklistBank API.

catalogueoflife.org →
NCBI Datasets API

Programmatic access to genome assembly metadata from GenBank and RefSeq. Assembly statistics updated continuously.

ncbi.nlm.nih.gov/datasets →
ETE Toolkit

Python framework for phylogenetic tree analysis. Used for Newick format export.

etetoolkit.org →

Version

TaxonSampler 1.0.0
Last Updated March 2026
License MIT

Technology Stack

Python 3.13 Django 6.0 PostgreSQL Redis D3.js v5 Tabler UI Celery 5.4 ETE3

Citation

If you use TaxonSampler in your research, please cite:

Contact

For questions, bug reports, or feature requests, please open an issue on GitHub or contact the maintainers directly.