Skip to content

[WIP] Updates relevant to the workflow/data pipeline #90

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions codecov.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
coverage:
status:
project:
default:
target: 95%
patch:
default:
target: 95%
range: "90...95"
2 changes: 2 additions & 0 deletions docs/Project.toml
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
[deps]
DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
DuckDB = "d2f5444f-75bc-4fdf-ac35-56f514c445e1"
LiveServer = "16fef848-5104-11e9-1b77-fb7a48bbb589"
Plots = "91a5bcdd-55d7-5caf-9e0b-520d859cae80"
TulipaClustering = "314fac8b-c762-4aa3-9d12-851379729163"

[compat]
Expand Down
169 changes: 169 additions & 0 deletions docs/src/10-tutorial.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,169 @@
# Tutorial

## Explanation

To simplify, let's consider a single profile, for a single year.
Let's denote it as $p_i$, where $i = 1,\dots,N$.
The clustering process consists of:

1. Split `N` into (let's assume equal) _periods_ of size `m = period_duration`.
We can rename $p_i$ as

$$p_{j,k}, \qquad \text{where} \qquad j = 1,\dots,m, \quad k = 1,\dots,N/m.$$
2. Compute `num_rps` representative periods

$$r_{j,\ell}, \qquad \text{where} \qquad j = 1,\dots,m, \qquad \ell = 1,\dots,\text{num\_rps}.$$
3. During computation of the representative periods, we obtained weight
$w_{k,\ell}$ between the period $k$ and the representative period $\ell$,
such that

$$p_{j,k} = \sum_{\ell = 1}^{\text{num\_rps}} r_{j,\ell} \ w_{k,\ell}, \qquad \forall j = 1,\dots,m, \quad k = 1,\dots,N/m$$

## High level API/DuckDB API

!!! note "High level API"
This tutorial focuses on the highest level of the API, which requires the
use of a DuckDB connection.

The high-level API of TulipaClustering focuses on using TulipaClustering as part of the [Tulipa workflow](@ref TODO).
This API consists of three main functions: [`transform_wide_to_long!`](@ref), [`cluster!`](@ref), and [`dummy_cluster!`](@ref).
In this tutorial we'll use all three.

Normally, you will have the DuckDB connection from the larger Tulipa workflow,
so here we will create a temporary connection with fake data to show an example
of the workflow. You can look into the source code of this documentation to see
how to create this fake data.

```@setup duckdb_example
using DuckDB

connection = DBInterface.connect(DuckDB.DB)
DuckDB.query(
connection,
"CREATE TABLE profiles_wide AS
SELECT
2030 AS year,
i + 24 * (p - 1) AS timestep,
4 + 0.3 * cos(4 * 3.14 * i / 24) + random() * 0.2 AS avail,
solar_rand * greatest(0, (5 + random()) * cos(2 * 3.14 * (i - 12.5) / 24)) AS solar,
3.6 + 3.6 * sin(3.14 * i / 24) ^ 2 * (1 + 0.3 * random()) AS demand,
FROM
generate_series(1, 24) AS _timestep(i)
CROSS JOIN (
SELECT p, RANDOM() AS solar_rand
FROM generate_series(1, 7 * 4) AS _period(p)
)
ORDER BY timestep
",
)
```

Here is the content of that connection:

```@example duckdb_example
using DataFrames, DuckDB

nice_query(str) = DataFrame(DuckDB.query(connection, str))
nice_query("show tables")
```

And here is the first rows of `profiles_wide`:

```@example duckdb_example
nice_query("from profiles_wide limit 10")
```

And finally, this is the plot of the data:

```@example duckdb_example
using Plots

table = DuckDB.query(connection, "from profiles_wide")
plot(size=(800, 400))
timestep = [row.timestep for row in table]
for profile_name in (:avail, :solar, :demand)
value = [row[profile_name] for row in table]
plot!(timestep, value, lab=string(profile_name))
end
plot!()
```

## Transform a wide profiles table into a long table

!!! warning "Required"
The long table format is a requirement of TulipaClustering, even for the dummy clustering example.

In this context, a wide table is a table where each new profile occupies a new column. A long table is a table where the profile names are stacked in a column with the corresponding values in a separate column.
Given the name of the source table (in this case, `profiles_wide`), we can create a long table with the following call:

```@example duckdb_example
using TulipaClustering

transform_wide_to_long!(connection, "profiles_wide", "input_profiles")

nice_query("FROM input_profiles LIMIT 10")
```

The name `input_profiles` was chosen to conform with the expectations of the `TulipaEnergyModel.jl` format.

## Dummy Clustering

A dummy cluster will essentially ignore the clustering and create the necessary tables for the next steps in the Tulipa workflow.

```@example duckdb_example
for table_name in (
"cluster_rep_periods_data",
"cluster_rep_periods_mapping",
"cluster_profiles_rep_periods",
)
DuckDB.query(connection, "DROP TABLE IF EXISTS $table_name")
end

clusters = dummy_cluster!(connection)

nice_query("FROM cluster_rep_periods_data LIMIT 5")
```

```@example duckdb_example
nice_query("FROM cluster_rep_periods_mapping LIMIT 5")
```

```@example duckdb_example
nice_query("FROM cluster_profiles_rep_periods LIMIT 5")
```

## Clustering

We can perform a real clustering by using the [`cluster!`](@ref) function with two extra arguments (see [Explanation](@ref) for their deeped meaning):

- `period_duration`: How long are the split periods;
- `num_rps`: How many representative periods.

```@example duckdb_example
period_duration = 24
num_rps = 3

for table_name in (
"cluster_rep_periods_data",
"cluster_rep_periods_mapping",
"cluster_profiles_rep_periods",
)
DuckDB.query(connection, "DROP TABLE IF EXISTS $table_name")
end

clusters = cluster!(connection, period_duration, num_rps)

nice_query("FROM cluster_rep_periods_data LIMIT 5")
```

```@example duckdb_example
nice_query("FROM cluster_rep_periods_mapping LIMIT 5")
```

```@example duckdb_example
nice_query("FROM cluster_profiles_rep_periods LIMIT 5")
```

## [TODO](@id TODO)

- [ ] Link to TulipaWorkflow
2 changes: 2 additions & 0 deletions src/TulipaClustering.jl
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,10 @@ using SparseArrays
using Statistics

include("structures.jl")
include("data-validation.jl")
include("io.jl")
include("weight_fitting.jl")
include("cluster.jl")
include("convenience.jl")

end
162 changes: 162 additions & 0 deletions src/convenience.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,162 @@
export cluster!, dummy_cluster!, transform_wide_to_long!

"""
cluster!(
connection,
period_duration,
num_rps;
input_profile_table_name = "input.profiles",
database_schema = "",
kwargs...,
)

Convenience function to cluster the table named in `input_profile_table_name`
using `period_duration` and `num_rps`. The resulting tables
`profiles_rep_periods`, `rep_periods_mapping`, and
`rep_periods_data` are loaded into `connection` in the `database_schema`, if
given, and enriched with `year` information.

This function extract the table, then calls [`split_into_periods!`](@ref),
[`find_representative_periods`](@ref), [`fit_rep_period_weights!`](@ref), and
finally `write_clustering_result_to_tables`.
"""
function cluster!(
connection,
period_duration,
num_rps;
input_database_schema = "input",
input_profile_table_name = "profiles",
database_schema = "cluster",
drop_incomplete_last_period::Bool = false,
method::Symbol = :k_means,
distance::SemiMetric = SqEuclidean(),
weight_type::Symbol = :convex,
tol::Float64 = 1e-2,
niters::Int = 100,
learning_rate::Float64 = 0.001,
adaptive_grad::Bool = false,
)
prefix = ""
if database_schema != ""
DBInterface.execute(connection, "CREATE SCHEMA IF NOT EXISTS $database_schema")
prefix = "$database_schema."
end
validate_data!(
connection;
input_database_schema,
table_names = Dict("profiles" => input_profile_table_name),
)

if input_database_schema != ""
input_profile_table_name = "$input_database_schema.$input_profile_table_name"
end
df = DuckDB.query(
connection,
"SELECT * FROM $input_profile_table_name
",
) |> DataFrame
split_into_periods!(df; period_duration)
clusters =
find_representative_periods(df, num_rps; drop_incomplete_last_period, method, distance)
fit_rep_period_weights!(clusters; weight_type, tol, niters, learning_rate, adaptive_grad)

for table_name in
("rep_periods_data", "rep_periods_mapping", "profiles_rep_periods", "timeframe_data")
DuckDB.query(connection, "DROP TABLE IF EXISTS $prefix$table_name")
end
write_clustering_result_to_tables(connection, clusters; database_schema)

return clusters
end

"""
dummy_cluster!(connection)

Convenience function to create the necessary columns and tables when clustering
is not required.

This is essentially creating a single representative period with the size of
the whole profile.
See [`cluster!`](@ref) for more details of what is created.
"""
function dummy_cluster!(
connection;
input_database_schema = "input",
input_profile_table_name = "profiles",
kwargs...,
)
table_name = if input_database_schema != ""
"$input_database_schema.$input_profile_table_name"
else
input_profile_table_name
end
period_duration = only([
row.max_timestep for row in
DuckDB.query(connection, "SELECT MAX(timestep) AS max_timestep FROM $table_name")
])
cluster!(connection, period_duration, 1; kwargs...)
end

"""
transform_wide_to_long!(
connection,
wide_table_name,
long_table_name;
)

Convenience function to convert a table in wide format to long format using DuckDB.
Originally aimed at converting a profile table like the following:

| year | timestep | name1 | name2 | ⋯ | name2 |
| ---- | -------- | ----- | ----- | -- | ----- |
| 2030 | 1 | 1.0 | 2.5 | ⋯ | 0.0 |
| 2030 | 2 | 1.5 | 2.6 | ⋯ | 0.0 |
| 2030 | 3 | 2.0 | 2.6 | ⋯ | 0.0 |

To a table like the following:

| year | timestep | profile_name | value |
| ---- | -------- | ------------ | ----- |
| 2030 | 1 | name1 | 1.0 |
| 2030 | 2 | name1 | 1.5 |
| 2030 | 3 | name1 | 2.0 |
| 2030 | 1 | name2 | 2.5 |
| 2030 | 2 | name2 | 2.6 |
| 2030 | 3 | name2 | 2.6 |
| ⋮ | ⋮ | ⋮ | ⋮ |
| 2030 | 1 | name3 | 0.0 |
| 2030 | 2 | name3 | 0.0 |
| 2030 | 3 | name3 | 0.0 |

This conversion is done using the `UNPIVOT` SQL command from DuckDB.

## Keyword arguments

- `exclude_columns = ["year", "timestep"]`: Which tables to exclude from the conversion
- `name_column = "profile_name"`: Name of the new column that contains the names of the old columns
- `value_column = "value"`: Name of the new column that holds the values from the old columns
"""
function transform_wide_to_long!(
connection,
wide_table_name,
long_table_name;
exclude_columns = ["year", "timestep"],
name_column = "profile_name",
value_column = "value",
)
@assert length(exclude_columns) > 0
exclude_str = join(exclude_columns, ", ")
DuckDB.query(
connection,
"CREATE TABLE $long_table_name AS
UNPIVOT $wide_table_name
ON COLUMNS(* EXCLUDE ($exclude_str))
INTO
NAME $name_column
VALUE $value_column
ORDER BY $name_column, $exclude_str
",
)

return
end
Loading
Loading