Learn the fundamentals of navigating TaxonSampler and understanding its core workflow for taxonomic sampling.

Steps
1
Access the Taxonomy Tree

In the header menu, click Taxonomy to open the interactive hierarchical tree view. This displays the complete taxonomic classification imported from the Catalogue of Life (COL), showing only species that have genome assemblies in NCBI.

2
Navigate the Taxonomic Hierarchy

Click on any node to expand its children. The tree follows the standard Linnaean hierarchy (Domain → Kingdom → Phylum → Class → Order → Family → Genus → Species). Use the search bar to locate specific taxa by scientific name, or the Advanced Search modal to build complex filter queries.

3
Use Tree Controls

Use the Fit button to recentre the view, Collapse all to reset the tree, and the fullscreen button (bottom-right) for an expanded workspace. Pan with click-drag and zoom with the scroll wheel.

4
View Taxon Details

Double-click on a species node to open a detail modal showing: COL ID, NCBI Taxonomy ID (TaxID), full classification path (Domain → Species), and linked genome assemblies with accession numbers, assembly level, scaffold N50, and RefSeq category.

Note: TaxonSampler only displays species with at least one genome assembly deposited in NCBI. Species without sequenced genomes are not shown in the tree. Node labels show coloured badges indicating COL status: A Accepted, S Synonym, M Manual match.

The Filters panel on the right side of the Taxonomy page guides you through a three-step wizard to configure and execute your sampling. Here's how each step works.

Step 1 — Scope & Targets
1
Set Your Scope

Click a node in the tree, then press "Use selected node" in the Filters panel. This defines the root taxon — all descendant species will be candidates. The scope label and species count update instantly.

2
Add Targets (Optional)

Within your scope, you can add specific clades as targets with the "Add selected node" button. This restricts sampling to only species belonging to those subgroups. Target chips appear below and can be removed individually or cleared with the broom button.

3
Proceed

Click Next to advance to Step 2.


Step 2 — Sampling Configuration
4
Set Max Sample Size

Enter the maximum number of species in your final sample. Leave empty to include all species. The "Available" count shows how many species exist within your scope.

5
Adjust Taxonomic Range

Use the dual slider (D–K–P–C–O–F–G–S) to choose the taxonomic range for grouping. The left handle sets the highest rank and the right handle the lowest. For example, Family → Species means quota allocation groups species by family.

6
Choose a Strategy

Select one of four strategies: Natural order, Quality Random, Stratified Proportional (default), or Balanced Hierarchical. See Tutorial 3 for details on each.

7
Run Sampling

Click "Run sampling". Results appear immediately in the Selection tab with a species table showing clade, quality score, assembly level, and more. The Clades allocation panel shows the distribution (Available / Quota / Sampled per clade).


Step 3 — Assembly Filtering (Optional)
8
Choose Filtering Mode

Select Scoring (weighted composite score, picks best assembly per species) or Hard filtering (strict thresholds, removes species whose best assembly doesn't qualify).

9
Configure Filters

Check/uncheck assembly levels (Complete Genome, Chromosome, Scaffold) and optionally set a minimum coverage depth. The "Assembly overview" panel updates with species count, average score, and average coverage.

10
Apply or Skip

Click "Apply filter & continue" to refine results, or "Skip filtering" to keep all assemblies. The Selection tab updates accordingly.

Locked state: After running a sampling, scope and targets are locked. Use the Reset button in the green alert banner to start over with a new scope.

TaxonSampler offers four sampling strategies in Step 2. Each distributes the sample differently across the clades within your taxonomic range.

Strategy Comparison
Strategy Description Best for
Natural order Selects species in natural accession order (as stored in the database) Simple lists, reproducible baseline
Quality Random Random selection within balanced quotas, prioritising species with higher-quality genomes Balanced coverage with quality preference
Stratified Proportional Quota per clade is proportional to clade size (species-rich clades contribute more) Fair representation reflecting natural diversity
Balanced Hierarchical Equal quota per clade regardless of size Preventing overrepresentation of species-rich groups
How to Choose
1
Defining the Taxonomic Range

All strategies distribute quotas across clades defined by the Taxonomic range slider. For example, if your scope is Mammalia and you set the range to Family → Species, each mammalian family becomes a group that receives a quota.

2
Proportional vs Balanced

Stratified Proportional assigns quota proportional to each clade's species count (e.g., a family with 200 species gets 10× the quota of a family with 20). Balanced Hierarchical gives each clade the same quota regardless, ensuring rare lineages are equally represented.

3
Quality Random

Like Balanced, but within each quota it prefers species with better genome quality scores. Ideal when you want both broad coverage and high data quality.

4
Review Allocation

After running sampling, check the Clades allocation panel in the Selection tab. It shows Available / Quota / Sampled per clade so you can verify the distribution meets your needs.

Tip: Start with Stratified Proportional (the default). If you find that species-rich clades dominate your sample, switch to Balanced Hierarchical for equal representation.

TaxonSampler computes two quality scores for each genome assembly. Understanding them helps you interpret the results in the Selection table.

Video: Understanding Quality Scores
~7 min
Quality Score (Step 2)

Computed during sampling, this score ranks species by overall genome quality:

Metric Weight Description
Assembly Level 20% Complete Genome > Chromosome > Scaffold > Contig
RefSeq Category 15% Reference genome > Representative > na
NCBI Quality Score 20% NCBI's own assembly quality assessment
Scaffold N50 15% Higher N50 = better contiguity
Coverage 10% Sequencing depth
BUSCO Complete 15% Gene content completeness assessment
Gene Annotation 5% Whether genes have been annotated
Assembly Score (Step 3)

Computed during assembly filtering, this score focuses on assembly characteristics:

Metric Weight Description
Coverage 35% Sequencing depth
Scaffold N50 30% Assembly contiguity
Quality Score 35% NCBI quality assessment
Score Colour Coding

In the Selection table, scores are displayed with colour-coded badges:

≥ 0.6 — High ≥ 0.3 — Medium < 0.3 — Low
Tip: The Quality Random strategy uses the Quality Score to preferentially pick species with better genomes. If you need the highest-quality assemblies regardless of taxonomic balance, sort by Quality in the Selection table after sampling.

Once you have a selection in the Selection tab, TaxonSampler offers multiple export formats and NCBI download scripts.

Video: Exporting Results & Downloading Genomes
~6 min
Export Formats
1
JSON

Full sampling configuration and results in machine-readable format. Ideal for reproducibility — you can re-import this JSON file later using the Import button to recreate the exact same selection.

2
Excel (.xlsx)

Full metadata spreadsheet with species names, accession numbers, quality metrics, clade assignments, and taxonomy. Perfect for the methods section of a paper.

3
List (.txt)

Plain text species list, one per line. Useful for quick sharing or as input for other tools.

4
Newick (species)

Phylogenetic tree in Newick format based on the taxonomic hierarchy. Can be opened in FigTree, iTOL, or ETE3 for visualisation.

5
Copy to Clipboard

Copies the species list directly, ready to paste into documents or scripts.

NCBI Download Scripts

TaxonSampler generates shell scripts that batch-download data from NCBI for all species in your selection:

Genomes (FASTA)

Downloads genome assembly FASTA files for all selected accessions using NCBI datasets CLI.

Proteomes (FAA)

Downloads predicted protein sequences (FASTA amino acid).

GenBank (GBFF)

Downloads annotated GenBank flat files.

All Data

Combined script that downloads genomes, proteomes, and GenBank files in one run.

Newick Tab

After sampling, the Newick tab (next to Selection) shows a visual tree of selected species generated with ETE3. You can download it as SVG (for figures) or Newick (for phylogenetic software).

Reproducibility: Always export the JSON configuration alongside your data files. This captures your exact scope, targets, strategy, taxonomic range, and filters — enabling anyone to reproduce your sampling.

A complete end-to-end example: selecting high-quality beetle genomes for a comparative genomics study of Coleoptera metabolism.

Video: Complete Research Workflow — Coleoptera Case Study
~15 min
Scenario

You need ~100 beetle genomes for a pan-Coleoptera gene family analysis. Requirements:

  • Chromosome-level or complete assemblies preferred
  • Minimum sequencing coverage 30×
  • Broad representation across beetle families
  • Annotated genomes preferred
Workflow
1
Navigate to Coleoptera

Open Taxonomy in the sidebar, use the search bar to find Coleoptera, and click the node to select it.

2
Set as Scope (Step 1)

In the Filters panel, click "Use selected node". The scope updates to Coleoptera and shows the total available species. Optionally add targets for specific families of interest.

3
Configure Sampling (Step 2)
  • Max sample size: 100
  • Taxonomic range: Family → Species (drag handles on the slider)
  • Strategy: Stratified Proportional (to fairly represent each family)
4
Run Sampling

Click "Run sampling". Switch to the Selection tab to inspect results. Check the Clades allocation panel to verify families are represented.

5
Apply Assembly Filters (Step 3)
  • Uncheck Scaffold (keep only Complete Genome and Chromosome)
  • Set Min. coverage: 30
  • Click "Apply filter & continue"
6
Review and Adjust

Check the updated Selection table. If the count dropped too much, use Reset and try again with relaxed filters (e.g., allow Scaffold-level, lower coverage). You can also manually add individual species using the "Add species from scope" search box.

7
Export
  • Excel: Full metadata for the methods section
  • Newick: Phylogenetic tree for supplementary figures
  • JSON: Sampling configuration for reproducibility
  • Download script (Genomes): Batch-download all FASTA files from NCBI
Reproducibility: Include the exported JSON file as supplementary material in your publication. Anyone can import it into TaxonSampler to reproduce your exact dataset selection.
Expected Output

A dataset of ~100 high-quality beetle genomes distributed across beetle families, with full metadata for each assembly including accession numbers ready for batch download from NCBI.