A web platform for taxonomic sampling and genome data integration
TaxonSampler is an open-source web platform that integrates taxonomic classifications from the Catalogue of Life (COL) with genome assembly metadata from the NCBI Datasets API.
The tool addresses a common bottleneck in large-scale genomic analyses: the manual and error-prone process of selecting species with available genome assemblies that meet specific quality criteria, while maintaining taxonomic representativeness across clades of interest.
Imports and reconciles COL and NCBI taxonomies with 26+ ranks
Assembly level, contig N50, genome size, GC content, annotation
Three-step guided workflow with scope, filters, and export
JSON, TXT, Newick, and XLSX with genome metadata
Comparative genomic analyses frequently require assembling a representative subset of species from broad taxonomic groups. Two competing objectives must be balanced:
Assemblies must meet minimum standards for contiguity (contig N50), completeness (assembly level), and annotation.
Selected species should span the phylogenetic breadth of the target clade to avoid systematic sampling bias.
Performing this selection manually is time-consuming and prone to inconsistencies, particularly when hundreds or thousands of candidate species are involved. TaxonSampler automates the entire pipeline.
The taxonomic hierarchy is constructed as a trie (prefix tree) from database records, then serialized to a D3.js-compatible JSON format. This allows:
TaxonSampler supports multiple sampling approaches:
| Strategy | Description | Use Case |
|---|---|---|
| Quality Filtering | Apply hard thresholds on genome metrics | High-quality phylogenomics |
| Taxonomic Diversity | Select N species per family/genus | Broad evolutionary surveys |
| Weighted Scoring | Rank genomes by combined quality metrics | Best-available selection |
| Hybrid | Filter first, then maximize diversity | Balanced datasets |
The most comprehensive and authoritative global index of species. Provides taxonomic classifications via ChecklistBank API.
catalogueoflife.org →Programmatic access to genome assembly metadata from GenBank and RefSeq. Assembly statistics updated continuously.
ncbi.nlm.nih.gov/datasets →Python framework for phylogenetic tree analysis. Used for Newick format export.
etetoolkit.org →If you use TaxonSampler in your research, please cite:
For questions, bug reports, or feature requests, please open an issue on GitHub or contact the maintainers directly.