Tutorials

Learn the fundamentals of navigating TaxonSampler and understanding its core workflow for taxonomic sampling.

Steps

1

Access the Taxonomy Tree

In the header menu, click Taxonomy to open the interactive hierarchical tree view. This displays the complete taxonomic classification imported from the Catalogue of Life (COL), showing only species that have genome assemblies in NCBI.

2

Navigate the Taxonomic Hierarchy

Click on any node to expand its children. The tree follows the standard Linnaean hierarchy (Domain → Kingdom → Phylum → Class → Order → Family → Genus → Species). Use the search bar to locate specific taxa by scientific name, or the Advanced Search modal to build complex filter queries.

3

Use Tree Controls

Use the Fit button to recentre the view, Collapse all to reset the tree, and the fullscreen button (bottom-right) for an expanded workspace. Pan with click-drag and zoom with the scroll wheel.

4

View Taxon Details

Double-click on a species node to open a detail modal showing: COL ID, NCBI Taxonomy ID (TaxID), full classification path (Domain → Species), and linked genome assemblies with accession numbers, assembly level, scaffold N50, and RefSeq category.

Note: TaxonSampler only displays species with at least one genome assembly deposited in NCBI. Species without sequenced genomes are not shown in the tree. Node labels show coloured badges indicating COL status: A Accepted, S Synonym, M Manual match.

The Filters panel on the right side of the Taxonomy page guides you through a three-step wizard to configure and execute your sampling. Here's how each step works.

Step 1 — Scope & Targets

1

Set Your Scope

Click a node in the tree, then press "Use selected node" in the Filters panel. This defines the root taxon — all descendant species will be candidates. The scope label and species count update instantly.

2

Add Targets (Optional)

Within your scope, you can add specific clades as targets with the "Add selected node" button. This restricts sampling to only species belonging to those subgroups. Target chips appear below and can be removed individually or cleared with the broom button.

3

Proceed

Click Next to advance to Step 2.

Step 2 — Sampling Configuration

4

Set Max Sample Size

Enter the maximum number of species in your final sample. Leave empty to include all species. The "Available" count shows how many species exist within your scope.

5

Adjust Taxonomic Range

Use the dual slider (D–K–P–C–O–F–G–S) to choose the taxonomic range for grouping. The left handle sets the highest rank and the right handle the lowest. For example, Family → Species means quota allocation groups species by family.

6

Choose a Strategy

Select one of four strategies: Natural order, Quality Random, Stratified Proportional (default), or Balanced Hierarchical. See Tutorial 3 for details on each.

7

Run Sampling

Click "Run sampling". Results appear immediately in the Selection tab with a species table showing clade, quality score, assembly level, and more. The Clades allocation panel shows the distribution (Available / Quota / Sampled per clade).

Step 3 — Assembly Filtering (Optional)

8

Choose Filtering Mode

Select Scoring (weighted composite score, picks best assembly per species) or Hard filtering (strict thresholds, removes species whose best assembly doesn't qualify).

9

Configure Filters

Check/uncheck assembly levels (Complete Genome, Chromosome, Scaffold) and optionally set a minimum coverage depth. The "Assembly overview" panel updates with species count, average score, and average coverage.

10

Apply or Skip

Click "Apply filter & continue" to refine results, or "Skip filtering" to keep all assemblies. The Selection tab updates accordingly.

Locked state: After running a sampling, scope and targets are locked. Use the Reset button in the green alert banner to start over with a new scope.

TaxonSampler offers four sampling strategies in Step 2. Each distributes the sample differently across the clades within your taxonomic range.

Strategy Comparison

Strategy	Description	Best for
Natural order	Selects species in natural accession order (as stored in the database)	Simple lists, reproducible baseline
Quality Random	Random selection within balanced quotas, prioritising species with higher-quality genomes	Balanced coverage with quality preference
Stratified Proportional	Quota per clade is proportional to clade size (species-rich clades contribute more)	Fair representation reflecting natural diversity
Balanced Hierarchical	Equal quota per clade regardless of size	Preventing overrepresentation of species-rich groups

How to Choose

1

Defining the Taxonomic Range

All strategies distribute quotas across clades defined by the Taxonomic range slider. For example, if your scope is Mammalia and you set the range to Family → Species, each mammalian family becomes a group that receives a quota.

2

Proportional vs Balanced

Stratified Proportional assigns quota proportional to each clade's species count (e.g., a family with 200 species gets 10× the quota of a family with 20). Balanced Hierarchical gives each clade the same quota regardless, ensuring rare lineages are equally represented.

3

Quality Random

Like Balanced, but within each quota it prefers species with better genome quality scores. Ideal when you want both broad coverage and high data quality.

4

Review Allocation

After running sampling, check the Clades allocation panel in the Selection tab. It shows Available / Quota / Sampled per clade so you can verify the distribution meets your needs.

Tip: Start with Stratified Proportional (the default). If you find that species-rich clades dominate your sample, switch to Balanced Hierarchical for equal representation.

TaxonSampler computes two quality scores for each genome assembly. Understanding them helps you interpret the results in the Selection table.

Video: Understanding Quality Scores

~7 min

Quality Score (Step 2)

Computed during sampling, this score ranks species by overall genome quality:

Metric	Weight	Description
Assembly Level	20%	Complete Genome > Chromosome > Scaffold > Contig
RefSeq Category	15%	Reference genome > Representative > na
NCBI Quality Score	20%	NCBI's own assembly quality assessment
Scaffold N50	15%	Higher N50 = better contiguity
Coverage	10%	Sequencing depth
BUSCO Complete	15%	Gene content completeness assessment
Gene Annotation	5%	Whether genes have been annotated

Assembly Score (Step 3)

Computed during assembly filtering, this score focuses on assembly characteristics:

Metric	Weight	Description
Coverage	35%	Sequencing depth
Scaffold N50	30%	Assembly contiguity
Quality Score	35%	NCBI quality assessment

Score Colour Coding

In the Selection table, scores are displayed with colour-coded badges:

≥ 0.6 — High ≥ 0.3 — Medium < 0.3 — Low

Tip: The Quality Random strategy uses the Quality Score to preferentially pick species with better genomes. If you need the highest-quality assemblies regardless of taxonomic balance, sort by Quality in the Selection table after sampling.

Once you have a selection in the Selection tab, TaxonSampler offers multiple export formats and NCBI download scripts.

Video: Exporting Results & Downloading Genomes

~6 min

Export Formats

1

JSON

Full sampling configuration and results in machine-readable format. Ideal for reproducibility — you can re-import this JSON file later using the Import button to recreate the exact same selection.

2

Excel (.xlsx)

Full metadata spreadsheet with species names, accession numbers, quality metrics, clade assignments, and taxonomy. Perfect for the methods section of a paper.

3

List (.txt)

Plain text species list, one per line. Useful for quick sharing or as input for other tools.

4

Newick (species)

Phylogenetic tree in Newick format based on the taxonomic hierarchy. Can be opened in FigTree, iTOL, or ETE3 for visualisation.

5

Copy to Clipboard

Copies the species list directly, ready to paste into documents or scripts.

NCBI Download Scripts

TaxonSampler generates shell scripts that batch-download data from NCBI for all species in your selection:

Genomes (FASTA)

Downloads genome assembly FASTA files for all selected accessions using NCBI datasets CLI.

Proteomes (FAA)

Downloads predicted protein sequences (FASTA amino acid).

GenBank (GBFF)

Downloads annotated GenBank flat files.

All Data

Combined script that downloads genomes, proteomes, and GenBank files in one run.

Newick Tab

After sampling, the Newick tab (next to Selection) shows a visual tree of selected species generated with ETE3. You can download it as SVG (for figures) or Newick (for phylogenetic software).

Reproducibility: Always export the JSON configuration alongside your data files. This captures your exact scope, targets, strategy, taxonomic range, and filters — enabling anyone to reproduce your sampling.

A complete end-to-end example: selecting high-quality beetle genomes for a comparative genomics study of Coleoptera metabolism.

Video: Complete Research Workflow — Coleoptera Case Study

~15 min

Scenario

You need ~100 beetle genomes for a pan-Coleoptera gene family analysis. Requirements:

Chromosome-level or complete assemblies preferred
Minimum sequencing coverage 30×
Broad representation across beetle families
Annotated genomes preferred

Workflow

1

Navigate to Coleoptera

Open Taxonomy in the sidebar, use the search bar to find Coleoptera, and click the node to select it.

2

Set as Scope (Step 1)

In the Filters panel, click "Use selected node". The scope updates to Coleoptera and shows the total available species. Optionally add targets for specific families of interest.

3

Configure Sampling (Step 2)

Max sample size: 100
Taxonomic range: Family → Species (drag handles on the slider)
Strategy: Stratified Proportional (to fairly represent each family)

4

Run Sampling

Click "Run sampling". Switch to the Selection tab to inspect results. Check the Clades allocation panel to verify families are represented.

5

Apply Assembly Filters (Step 3)

Uncheck Scaffold (keep only Complete Genome and Chromosome)
Set Min. coverage: 30
Click "Apply filter & continue"

6

Review and Adjust

Check the updated Selection table. If the count dropped too much, use Reset and try again with relaxed filters (e.g., allow Scaffold-level, lower coverage). You can also manually add individual species using the "Add species from scope" search box.

7

Export

Excel: Full metadata for the methods section
Newick: Phylogenetic tree for supplementary figures
JSON: Sampling configuration for reproducibility
Download script (Genomes): Batch-download all FASTA files from NCBI

Reproducibility: Include the exported JSON file as supplementary material in your publication. Anyone can import it into TaxonSampler to reproduce your exact dataset selection.

Expected Output

A dataset of ~100 high-quality beetle genomes distributed across beetle families, with full metadata for each assembly including accession numbers ready for batch download from NCBI.

Contents

Steps

Step 1 — Scope & Targets

Step 2 — Sampling Configuration

Step 3 — Assembly Filtering (Optional)

Strategy Comparison

How to Choose

Quality Score (Step 2)

Assembly Score (Step 3)

Score Colour Coding

Export Formats

NCBI Download Scripts

Newick Tab

Scenario

Workflow

Expected Output

Need More Help?

Tutorials

Contents

Getting Started with TaxonSampler Beginner

Steps

The 3-Step Sampling Wizard Beginner

Step 1 — Scope & Targets

Step 2 — Sampling Configuration

Step 3 — Assembly Filtering (Optional)

Sampling Strategies Intermediate

Strategy Comparison

How to Choose

Understanding Quality Scores Intermediate

Quality Score (Step 2)

Assembly Score (Step 3)

Score Colour Coding

Export & Download Beginner

Export Formats

NCBI Download Scripts

Newick Tab

Full Research Workflow Example Advanced

Scenario

Workflow

Expected Output

Need More Help?