Learn the fundamentals of navigating TaxonSampler and understanding its core workflow for taxonomic sampling.
In the header menu, click Taxonomy to open the interactive hierarchical tree view. This displays the complete taxonomic classification imported from the Catalogue of Life (COL), showing only species that have genome assemblies in NCBI.
Click on any node to expand its children. The tree follows the standard Linnaean hierarchy (Domain → Kingdom → Phylum → Class → Order → Family → Genus → Species). Use the search bar to locate specific taxa by scientific name, or the Advanced Search modal to build complex filter queries.
Use the Fit button to recentre the view, Collapse all to reset the tree, and the fullscreen button (bottom-right) for an expanded workspace. Pan with click-drag and zoom with the scroll wheel.
Double-click on a species node to open a detail modal showing: COL ID, NCBI Taxonomy ID (TaxID), full classification path (Domain → Species), and linked genome assemblies with accession numbers, assembly level, scaffold N50, and RefSeq category.
The Filters panel on the right side of the Taxonomy page guides you through a three-step wizard to configure and execute your sampling. Here's how each step works.
Click a node in the tree, then press "Use selected node" in the Filters panel. This defines the root taxon — all descendant species will be candidates. The scope label and species count update instantly.
Within your scope, you can add specific clades as targets with the "Add selected node" button. This restricts sampling to only species belonging to those subgroups. Target chips appear below and can be removed individually or cleared with the broom button.
Click Next to advance to Step 2.
Enter the maximum number of species in your final sample. Leave empty to include all species. The "Available" count shows how many species exist within your scope.
Use the dual slider (D–K–P–C–O–F–G–S) to choose the taxonomic range for grouping. The left handle sets the highest rank and the right handle the lowest. For example, Family → Species means quota allocation groups species by family.
Select one of four strategies: Natural order, Quality Random, Stratified Proportional (default), or Balanced Hierarchical. See Tutorial 3 for details on each.
Click "Run sampling". Results appear immediately in the Selection tab with a species table showing clade, quality score, assembly level, and more. The Clades allocation panel shows the distribution (Available / Quota / Sampled per clade).
Select Scoring (weighted composite score, picks best assembly per species) or Hard filtering (strict thresholds, removes species whose best assembly doesn't qualify).
Check/uncheck assembly levels (Complete Genome, Chromosome, Scaffold) and optionally set a minimum coverage depth. The "Assembly overview" panel updates with species count, average score, and average coverage.
Click "Apply filter & continue" to refine results, or "Skip filtering" to keep all assemblies. The Selection tab updates accordingly.
TaxonSampler offers four sampling strategies in Step 2. Each distributes the sample differently across the clades within your taxonomic range.
| Strategy | Description | Best for |
|---|---|---|
| Natural order | Selects species in natural accession order (as stored in the database) | Simple lists, reproducible baseline |
| Quality Random | Random selection within balanced quotas, prioritising species with higher-quality genomes | Balanced coverage with quality preference |
| Stratified Proportional | Quota per clade is proportional to clade size (species-rich clades contribute more) | Fair representation reflecting natural diversity |
| Balanced Hierarchical | Equal quota per clade regardless of size | Preventing overrepresentation of species-rich groups |
All strategies distribute quotas across clades defined by the Taxonomic range slider. For example, if your scope is Mammalia and you set the range to Family → Species, each mammalian family becomes a group that receives a quota.
Stratified Proportional assigns quota proportional to each clade's species count (e.g., a family with 200 species gets 10× the quota of a family with 20). Balanced Hierarchical gives each clade the same quota regardless, ensuring rare lineages are equally represented.
Like Balanced, but within each quota it prefers species with better genome quality scores. Ideal when you want both broad coverage and high data quality.
After running sampling, check the Clades allocation panel in the Selection tab. It shows Available / Quota / Sampled per clade so you can verify the distribution meets your needs.
TaxonSampler computes two quality scores for each genome assembly. Understanding them helps you interpret the results in the Selection table.
Computed during sampling, this score ranks species by overall genome quality:
| Metric | Weight | Description |
|---|---|---|
| Assembly Level | 20% | Complete Genome > Chromosome > Scaffold > Contig |
| RefSeq Category | 15% | Reference genome > Representative > na |
| NCBI Quality Score | 20% | NCBI's own assembly quality assessment |
| Scaffold N50 | 15% | Higher N50 = better contiguity |
| Coverage | 10% | Sequencing depth |
| BUSCO Complete | 15% | Gene content completeness assessment |
| Gene Annotation | 5% | Whether genes have been annotated |
Computed during assembly filtering, this score focuses on assembly characteristics:
| Metric | Weight | Description |
|---|---|---|
| Coverage | 35% | Sequencing depth |
| Scaffold N50 | 30% | Assembly contiguity |
| Quality Score | 35% | NCBI quality assessment |
In the Selection table, scores are displayed with colour-coded badges:
Once you have a selection in the Selection tab, TaxonSampler offers multiple export formats and NCBI download scripts.
Full sampling configuration and results in machine-readable format. Ideal for reproducibility — you can re-import this JSON file later using the Import button to recreate the exact same selection.
Full metadata spreadsheet with species names, accession numbers, quality metrics, clade assignments, and taxonomy. Perfect for the methods section of a paper.
Plain text species list, one per line. Useful for quick sharing or as input for other tools.
Phylogenetic tree in Newick format based on the taxonomic hierarchy. Can be opened in FigTree, iTOL, or ETE3 for visualisation.
Copies the species list directly, ready to paste into documents or scripts.
TaxonSampler generates shell scripts that batch-download data from NCBI for all species in your selection:
Downloads genome assembly FASTA files for all selected accessions using NCBI datasets CLI.
Downloads predicted protein sequences (FASTA amino acid).
Downloads annotated GenBank flat files.
Combined script that downloads genomes, proteomes, and GenBank files in one run.
After sampling, the Newick tab (next to Selection) shows a visual tree of selected species generated with ETE3. You can download it as SVG (for figures) or Newick (for phylogenetic software).
A complete end-to-end example: selecting high-quality beetle genomes for a comparative genomics study of Coleoptera metabolism.
You need ~100 beetle genomes for a pan-Coleoptera gene family analysis. Requirements:
Open Taxonomy in the sidebar, use the search bar to find Coleoptera, and click the node to select it.
In the Filters panel, click "Use selected node". The scope updates to Coleoptera and shows the total available species. Optionally add targets for specific families of interest.
Click "Run sampling". Switch to the Selection tab to inspect results. Check the Clades allocation panel to verify families are represented.
Check the updated Selection table. If the count dropped too much, use Reset and try again with relaxed filters (e.g., allow Scaffold-level, lower coverage). You can also manually add individual species using the "Add species from scope" search box.
A dataset of ~100 high-quality beetle genomes distributed across beetle families, with full metadata for each assembly including accession numbers ready for batch download from NCBI.