TulipaEnergy · abelsiqueira · Apr 14, 2025
diff --git a/codecov.yml b/codecov.yml
@@ -0,0 +1,9 @@
+coverage:
+  status:
+    project:
+      default:
+        target: 95%
+    patch:
+      default:
+        target: 95%
+  range: "90...95"
diff --git a/docs/Project.toml b/docs/Project.toml
@@ -1,7 +1,9 @@
 [deps]
 DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
 Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
+DuckDB = "d2f5444f-75bc-4fdf-ac35-56f514c445e1"
 LiveServer = "16fef848-5104-11e9-1b77-fb7a48bbb589"
+Plots = "91a5bcdd-55d7-5caf-9e0b-520d859cae80"
 TulipaClustering = "314fac8b-c762-4aa3-9d12-851379729163"
 
 [compat]

diff --git a/docs/src/10-tutorial.md b/docs/src/10-tutorial.md
@@ -0,0 +1,169 @@
+# Tutorial
+
+## Explanation
+
+To simplify, let's consider a single profile, for a single year.
+Let's denote it as $p_i$, where $i = 1,\dots,N$.
+The clustering process consists of:
+
+1. Split `N` into (let's assume equal) _periods_ of size `m = period_duration`.
+   We can rename $p_i$ as
+
+   $$p_{j,k}, \qquad \text{where} \qquad j = 1,\dots,m, \quad k = 1,\dots,N/m.$$
+2. Compute `num_rps` representative periods
+
+   $$r_{j,\ell}, \qquad \text{where} \qquad j = 1,\dots,m, \qquad \ell = 1,\dots,\text{num\_rps}.$$
+3. During computation of the representative periods, we obtained weight
+   $w_{k,\ell}$ between the period $k$ and the representative period $\ell$,
+   such that
+
+   $$p_{j,k} = \sum_{\ell = 1}^{\text{num\_rps}} r_{j,\ell} \ w_{k,\ell}, \qquad \forall j = 1,\dots,m, \quad k = 1,\dots,N/m$$
+
+## High level API/DuckDB API
+
+!!! note "High level API"
+    This tutorial focuses on the highest level of the API, which requires the
+    use of a DuckDB connection.
+
+The high-level API of TulipaClustering focuses on using TulipaClustering as part of the [Tulipa workflow](@ref TODO).
+This API consists of three main functions: [`transform_wide_to_long!`](@ref), [`cluster!`](@ref), and [`dummy_cluster!`](@ref).
+In this tutorial we'll use all three.
+
+Normally, you will have the DuckDB connection from the larger Tulipa workflow,
+so here we will create a temporary connection with fake data to show an example
+of the workflow. You can look into the source code of this documentation to see
+how to create this fake data.
+
+```@setup duckdb_example
+using DuckDB
+
+connection = DBInterface.connect(DuckDB.DB)
+DuckDB.query(
+  connection,
+  "CREATE TABLE profiles_wide AS
+  SELECT
+      2030 AS year,
+          i + 24 * (p - 1) AS timestep,
+      4 + 0.3 * cos(4 * 3.14 * i / 24) + random() * 0.2 AS avail,
+      solar_rand * greatest(0, (5 + random()) * cos(2 * 3.14 * (i - 12.5) / 24)) AS solar,
+      3.6 + 3.6 * sin(3.14 * i / 24) ^ 2 * (1 + 0.3 * random()) AS demand,
+  FROM
+    generate_series(1, 24) AS _timestep(i)
+  CROSS JOIN (
+    SELECT p, RANDOM() AS solar_rand
+    FROM generate_series(1, 7 * 4) AS _period(p)
+  )
+  ORDER BY timestep
+  ",
+)
+```
+
+Here is the content of that connection:
+
+```@example duckdb_example
+using DataFrames, DuckDB
+
+nice_query(str) = DataFrame(DuckDB.query(connection, str))
+nice_query("show tables")
+```
+
+And here is the first rows of `profiles_wide`:
+
+```@example duckdb_example
+nice_query("from profiles_wide limit 10")
+```
+
+And finally, this is the plot of the data:
+
+```@example duckdb_example
+using Plots
+
+table = DuckDB.query(connection, "from profiles_wide")
+plot(size=(800, 400))
+timestep = [row.timestep for row in table]
+for profile_name in (:avail, :solar, :demand)
+    value = [row[profile_name] for row in table]
+    plot!(timestep, value, lab=string(profile_name))
+end
+plot!()
+```
+
+## Transform a wide profiles table into a long table
+
+!!! warning "Required"
+    The long table format is a requirement of TulipaClustering, even for the dummy clustering example.
+
+In this context, a wide table is a table where each new profile occupies a new column. A long table is a table where the profile names are stacked in a column with the corresponding values in a separate column.
+Given the name of the source table (in this case, `profiles_wide`), we can create a long table with the following call:
+
+```@example duckdb_example
+using TulipaClustering
+
+transform_wide_to_long!(connection, "profiles_wide", "input_profiles")
+
+nice_query("FROM input_profiles LIMIT 10")
+```
+
+The name `input_profiles` was chosen to conform with the expectations of the `TulipaEnergyModel.jl` format.
+
+## Dummy Clustering
+
+A dummy cluster will essentially ignore the clustering and create the necessary tables for the next steps in the Tulipa workflow.
+
+```@example duckdb_example
+for table_name in (
+    "cluster_rep_periods_data",
+    "cluster_rep_periods_mapping",
+    "cluster_profiles_rep_periods",
+)
+    DuckDB.query(connection, "DROP TABLE IF EXISTS $table_name")
+end
+
+clusters = dummy_cluster!(connection)
+
+nice_query("FROM cluster_rep_periods_data LIMIT 5")
+```
+
+```@example duckdb_example
+nice_query("FROM cluster_rep_periods_mapping LIMIT 5")
+```
+
+```@example duckdb_example
+nice_query("FROM cluster_profiles_rep_periods LIMIT 5")
+```
+
+## Clustering
+
+We can perform a real clustering by using the [`cluster!`](@ref) function with two extra arguments (see [Explanation](@ref) for their deeped meaning):
+
+- `period_duration`: How long are the split periods;
+- `num_rps`: How many representative periods.
+
+```@example duckdb_example
+period_duration = 24
+num_rps = 3
+
+for table_name in (
+    "cluster_rep_periods_data",
+    "cluster_rep_periods_mapping",
+    "cluster_profiles_rep_periods",
+)
+    DuckDB.query(connection, "DROP TABLE IF EXISTS $table_name")
+end
+
+clusters = cluster!(connection, period_duration, num_rps)
+
+nice_query("FROM cluster_rep_periods_data LIMIT 5")
+```
+
+```@example duckdb_example
+nice_query("FROM cluster_rep_periods_mapping LIMIT 5")
+```
+
+```@example duckdb_example
+nice_query("FROM cluster_profiles_rep_periods LIMIT 5")
+```
+
+## [TODO](@id TODO)
+
+- [ ] Link to TulipaWorkflow
diff --git a/src/TulipaClustering.jl b/src/TulipaClustering.jl
@@ -12,8 +12,10 @@ using SparseArrays
 using Statistics
 
 include("structures.jl")
+include("data-validation.jl")
 include("io.jl")
 include("weight_fitting.jl")
 include("cluster.jl")
+include("convenience.jl")
 
 end
diff --git a/src/convenience.jl b/src/convenience.jl
@@ -0,0 +1,162 @@
+export cluster!, dummy_cluster!, transform_wide_to_long!
+
+"""
+    cluster!(
+        connection,
+        period_duration,
+        num_rps;
+        input_profile_table_name = "input.profiles",
+        database_schema = "",
+        kwargs...,
+    )
+
+Convenience function to cluster the table named in `input_profile_table_name`
+using `period_duration` and `num_rps`. The resulting tables
+`profiles_rep_periods`, `rep_periods_mapping`, and
+`rep_periods_data` are loaded into `connection` in the `database_schema`, if
+given, and enriched with `year` information.
+
+This function extract the table, then calls [`split_into_periods!`](@ref),
+[`find_representative_periods`](@ref), [`fit_rep_period_weights!`](@ref), and
+finally `write_clustering_result_to_tables`.
+"""
+function cluster!(
+  connection,
+  period_duration,
+  num_rps;
+  input_database_schema = "input",
+  input_profile_table_name = "profiles",
+  database_schema = "cluster",
+  drop_incomplete_last_period::Bool = false,
+  method::Symbol = :k_means,
+  distance::SemiMetric = SqEuclidean(),
+  weight_type::Symbol = :convex,
+  tol::Float64 = 1e-2,
+  niters::Int = 100,
+  learning_rate::Float64 = 0.001,
+  adaptive_grad::Bool = false,
+)
+  prefix = ""
+  if database_schema != ""
+    DBInterface.execute(connection, "CREATE SCHEMA IF NOT EXISTS $database_schema")
+    prefix = "$database_schema."
+  end
+  validate_data!(
+    connection;
+    input_database_schema,
+    table_names = Dict("profiles" => input_profile_table_name),
+  )
+
+  if input_database_schema != ""
+    input_profile_table_name = "$input_database_schema.$input_profile_table_name"
+  end
+  df = DuckDB.query(
+    connection,
+    "SELECT * FROM $input_profile_table_name
+    ",
+  ) |> DataFrame
+  split_into_periods!(df; period_duration)
+  clusters =
+    find_representative_periods(df, num_rps; drop_incomplete_last_period, method, distance)
+  fit_rep_period_weights!(clusters; weight_type, tol, niters, learning_rate, adaptive_grad)
+
+  for table_name in
+      ("rep_periods_data", "rep_periods_mapping", "profiles_rep_periods", "timeframe_data")
+    DuckDB.query(connection, "DROP TABLE IF EXISTS $prefix$table_name")
+  end
+  write_clustering_result_to_tables(connection, clusters; database_schema)
+
+  return clusters
+end
+
+"""
+    dummy_cluster!(connection)
+
+Convenience function to create the necessary columns and tables when clustering
+is not required.
+
+This is essentially creating a single representative period with the size of
+the whole profile.
+See [`cluster!`](@ref) for more details of what is created.
+"""
+function dummy_cluster!(
+  connection;
+  input_database_schema = "input",
+  input_profile_table_name = "profiles",
+  kwargs...,
+)
+  table_name = if input_database_schema != ""
+    "$input_database_schema.$input_profile_table_name"
+  else
+    input_profile_table_name
+  end
+  period_duration = only([
+    row.max_timestep for row in
+    DuckDB.query(connection, "SELECT MAX(timestep) AS max_timestep FROM $table_name")
+  ])
+  cluster!(connection, period_duration, 1; kwargs...)
+end
+
+"""
+    transform_wide_to_long!(
+        connection,
+        wide_table_name,
+        long_table_name;
+    )
+
+Convenience function to convert a table in wide format to long format using DuckDB.
+Originally aimed at converting a profile table like the following:
+
+| year | timestep | name1 | name2 | ⋯  | name2 |
+| ---- | -------- | ----- | ----- | -- | ----- |
+| 2030 |        1 |   1.0 |   2.5 | ⋯  |   0.0 |
+| 2030 |        2 |   1.5 |   2.6 | ⋯  |   0.0 |
+| 2030 |        3 |   2.0 |   2.6 | ⋯  |   0.0 |
+
+To a table like the following:
+
+| year | timestep | profile_name | value |
+| ---- | -------- | ------------ | ----- |
+| 2030 |        1 |        name1 |   1.0 |
+| 2030 |        2 |        name1 |   1.5 |
+| 2030 |        3 |        name1 |   2.0 |
+| 2030 |        1 |        name2 |   2.5 |
+| 2030 |        2 |        name2 |   2.6 |
+| 2030 |        3 |        name2 |   2.6 |
+|    ⋮ |        ⋮ |            ⋮ |     ⋮ |
+| 2030 |        1 |        name3 |   0.0 |
+| 2030 |        2 |        name3 |   0.0 |
+| 2030 |        3 |        name3 |   0.0 |
+
+This conversion is done using the `UNPIVOT` SQL command from DuckDB.
+
+## Keyword arguments
+
+- `exclude_columns = ["year", "timestep"]`: Which tables to exclude from the conversion
+- `name_column = "profile_name"`: Name of the new column that contains the names of the old columns
+- `value_column = "value"`: Name of the new column that holds the values from the old columns
+"""
+function transform_wide_to_long!(
+  connection,
+  wide_table_name,
+  long_table_name;
+  exclude_columns = ["year", "timestep"],
+  name_column = "profile_name",
+  value_column = "value",
+)
+  @assert length(exclude_columns) > 0
+  exclude_str = join(exclude_columns, ", ")
+  DuckDB.query(
+    connection,
+    "CREATE TABLE $long_table_name AS
+    UNPIVOT $wide_table_name
+    ON COLUMNS(* EXCLUDE ($exclude_str))
+    INTO
+        NAME $name_column
+        VALUE $value_column
+    ORDER BY $name_column, $exclude_str
+    ",
+  )
+
+  return
+end