tidyclust 0.3.0

I’m thrilled to announce that tidyclust 0.3.0 is now on CRAN. tidyclust allows you to fit, interact, and evaluate unsupervised clustering models inside the tidymodels framework. You can install it with:

install.packages("tidyclust")

This release brings three new families of clustering models, closer alignment with the rest of tidymodels, and a number of smaller improvements. This post covers the highlights of new models and engines, tighter integration with tidymodels, and updated parallel processing support. You can read the full list of changes in the release notes .

To get started, load the tidymodels and tidyclust packages:

library(tidymodels)
library(tidyclust)

For the examples, we will be using the penguins data set. We are imputing missing values and normalizing the predictors since distance and density-based methods are sensitive to scale and missingness.

data(penguins, package = "modeldata")

rec_spec <- recipe(~ bill_length_mm + bill_depth_mm + flipper_length_mm + body_mass_g,
                   data = penguins) |>
  step_impute_mean(all_numeric_predictors()) |>
  step_normalize(all_numeric_predictors())

New models and engines#

Until now, tidyclust supported partition-based methods (k_means()) and hierarchical clustering (hier_clust()). This release expands that considerably with three new model specifications.

db_clust() fits density-based clustering models (DBSCAN), with engines for both "dbscan" and "hdbscan". Density-based methods are well suited to finding clusters of arbitrary shape and identifying outliers as noise.

To use tidyclust, we first make a specification for the clustering method, combine it with the recipe, and fit the workflow as usual.

db_spec <- db_clust(radius = 0.8, min_points = 5) |>
  set_engine("dbscan")

db_fit <- workflow(rec_spec, db_spec) |>
  fit(data = penguins)

db_fit

══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: db_clust()

── Preprocessor ────────────────────────────────────────────────────────────────
2 Recipe Steps

• step_impute_mean()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
DBSCAN clustering for 344 objects.
Parameters: eps = 0.8, minPts = 5
Using euclidean distances and borderpoints = TRUE
The clustering contains 2 cluster(s) and 5 noise points.

  0   1   2 
  5 218 121 

Available fields: cluster, eps, minPts, metric, borderPoints

Notice the Outlier level: points in low-density regions aren’t forced into a cluster. This is also our first new engine where you don’t specify how many clusters to find up front.

gm_clust() fits Gaussian mixture models via the "mclust" engine. Rather than assigning each observation to a single cluster, mixture models describe the data as a combination of Gaussian components.

gm_spec <- gm_clust(num_clusters = 3) |> 
  set_engine("mclust")

gm_fit <- workflow(rec_spec, gm_spec) |>
  fit(data = penguins)

tidy(gm_fit)

# A tibble: 3 × 8
  component  size proportion variance mean.bill_length_mm mean.bill_depth_mm
      <int> <int>      <dbl>    <dbl>               <dbl>              <dbl>
1         1   160      0.458    0.288              -0.892              0.524
2         2    61      0.184    0.288               0.938              0.834
3         3   123      0.358    0.288               0.658             -1.10 
# ℹ 2 more variables: mean.flipper_length_mm <dbl>, mean.body_mass_g <dbl>

Each row summarizes one of the fitted Gaussian components: its size, mixing proportion, and the per-predictor means that describe where it sits.

mean_shift() fits mean shift models, which iteratively shift observations toward regions of high density and determine the number of clusters automatically. Engines "LPCM" and "meanShiftR" are supported.

ms_spec <- mean_shift(bandwidth = 0.2) |> 
  set_engine("LPCM")

ms_fit <- workflow(rec_spec, ms_spec) |>
  fit(data = penguins)

extract_cluster_assignment(ms_fit)

# A tibble: 344 × 1
   .cluster 
   <fct>    
 1 Cluster_1
 2 Cluster_1
 3 Cluster_1
 4 Cluster_1
 5 Cluster_1
 6 Cluster_1
 7 Cluster_1
 8 Cluster_1
 9 Cluster_1
10 Cluster_1
# ℹ 334 more rows

This returns a cluster label for each observation, with the number of clusters discovered automatically rather than set in advance like how it was in db_clust().

Each of these models comes with its respective dials parameters for tuning.

Closer alignment with tidymodels#

A major theme of this release is removing the seams between tidyclust and the rest of tidymodels. Tidyclust was previously written as a slightly modified copy of the internals of the tune package. We took some time to align the internals of tune such that we could use the same code paths for tidyclust and tune. This means that we have a smaller maintenance burden and new features in tune should be more easily available in tidyclust.

The biggest change is that finalize_model_tidyclust() and finalize_workflow_tidyclust() are now deprecated. The corresponding functions in tune tune::finalize_model() and tune::finalize_workflow() now support cluster_spec objects natively, so there is no longer a need for tidyclust-specific variants.

Parallel processing in tune_cluster() now supports the mirai package in addition to future. This matches the parallelism approach used across tidymodels. As part of this change, the foreach package is no longer supported for parallel processing—use future or mirai instead. The .config column produced by tune_cluster() has also changed from the Preprocessor{num}_Model{num} pattern to pre{num}_mod{num}_post{num} to align with updates in tune.

A few other improvements worth calling out:

A new “Getting started with tidyclust” vignette.
butcher support for cluster_fit objects, so you can strip training data and environment references from fitted models before saving them.
extract_cluster_assignment(), extract_centroids(), and predict() gain a labels argument for supplying custom cluster labels.
hier_clust() gains a dist_fun argument for specifying a custom distance function.