Skip to content

Conversation

@stephaniereinders
Copy link
Member

Data objects, functions, and test fixtures and tests are now all updated so that cluster columns are sorted numerically.

I uploaded the R scripts for handwriterRF to Claude and asked for a function map. The result is an html file that I added to the data-raw package. The file won't be included in the package.
…er_profiles()`

`compare_documents()` now does the following:

1. handles NULL values
2. handles samples in different folders with the same name
3. creates project directory
4. runs basic checks
5. copies samples to project directory
6. estimates the writer profiles
7. compares the writer profiles with `compare_writer_profiles()`
8. cleans up project directory

This reduces repetative code.
`get_cluster_cols()` is used by functions other than those in the distances.R script, so move it to utils.R
Some functions currently grab label columns by name, E.g. "docname", "writer", etc. But input data frames might have a variety of label columns, so `get_label_cols()` is an internal helper that will grab all columns that do not start with "cluster".
…ols()` and `get_label_cols()`

Change `split_clusters_and_labels()` to call `get_cluster_cols()` and `get_label_cols()`. And update `get_single_dist()` and `absolute_dist()` to call `split_clusters_and_labels()`.
…numerically

Fix train, validation, test, random_forest, and ref_scores so that cluster columns are sorted numerically.

Fix `get_cluster_cols()` to sort cluster columns numerically.

Create new test fixtures with updated data objects and functions.
@stephaniereinders stephaniereinders linked an issue Aug 10, 2025 that may be closed by this pull request
@stephaniereinders stephaniereinders merged commit 3dfa492 into main Aug 10, 2025
0 of 6 checks passed
@stephaniereinders stephaniereinders deleted the 43-sort-cluster-cols branch August 10, 2025 21:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Retrain model with cluster columns sorted numerically

2 participants