Skip to content

Commit 78ad2b1

Browse files
committed
[WIP] Updates relevant to the workflow/data pipeline
1 parent b8a5a8b commit 78ad2b1

File tree

13 files changed

+715
-12
lines changed

13 files changed

+715
-12
lines changed

codecov.yml

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
coverage:
2+
status:
3+
project:
4+
default:
5+
target: 95%
6+
patch:
7+
default:
8+
target: 95%
9+
range: "90...95"

docs/Project.toml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,9 @@
11
[deps]
22
DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
33
Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
4+
DuckDB = "d2f5444f-75bc-4fdf-ac35-56f514c445e1"
45
LiveServer = "16fef848-5104-11e9-1b77-fb7a48bbb589"
6+
Plots = "91a5bcdd-55d7-5caf-9e0b-520d859cae80"
57
TulipaClustering = "314fac8b-c762-4aa3-9d12-851379729163"
68

79
[compat]

docs/src/10-tutorial.md

Lines changed: 169 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,169 @@
1+
# Tutorial
2+
3+
## Explanation
4+
5+
To simplify, let's consider a single profile, for a single year.
6+
Let's denote it as $p_i$, where $i = 1,\dots,N$.
7+
The clustering process consists of:
8+
9+
1. Split `N` into (let's assume equal) _periods_ of size `m = period_duration`.
10+
We can rename $p_i$ as
11+
12+
$$p_{j,k}, \qquad \text{where} \qquad j = 1,\dots,m, \quad k = 1,\dots,N/m.$$
13+
2. Compute `num_rps` representative periods
14+
15+
$$r_{j,\ell}, \qquad \text{where} \qquad j = 1,\dots,m, \qquad \ell = 1,\dots,\text{num\_rps}.$$
16+
3. During computation of the representative periods, we obtained weight
17+
$w_{k,\ell}$ between the period $k$ and the representative period $\ell$,
18+
such that
19+
20+
$$p_{j,k} = \sum_{\ell = 1}^{\text{num\_rps}} r_{j,\ell} \ w_{k,\ell}, \qquad \forall j = 1,\dots,m, \quad k = 1,\dots,N/m$$
21+
22+
## High level API/DuckDB API
23+
24+
!!! note "High level API"
25+
This tutorial focuses on the highest level of the API, which requires the
26+
use of a DuckDB connection.
27+
28+
The high-level API of TulipaClustering focuses on using TulipaClustering as part of the [Tulipa workflow](@ref TODO).
29+
This API consists of three main functions: [`transform_wide_to_long!`](@ref), [`cluster!`](@ref), and [`dummy_cluster!`](@ref).
30+
In this tutorial we'll use all three.
31+
32+
Normally, you will have the DuckDB connection from the larger Tulipa workflow,
33+
so here we will create a temporary connection with fake data to show an example
34+
of the workflow. You can look into the source code of this documentation to see
35+
how to create this fake data.
36+
37+
```@setup duckdb_example
38+
using DuckDB
39+
40+
connection = DBInterface.connect(DuckDB.DB)
41+
DuckDB.query(
42+
connection,
43+
"CREATE TABLE profiles_wide AS
44+
SELECT
45+
2030 AS year,
46+
i + 24 * (p - 1) AS timestep,
47+
4 + 0.3 * cos(4 * 3.14 * i / 24) + random() * 0.2 AS avail,
48+
solar_rand * greatest(0, (5 + random()) * cos(2 * 3.14 * (i - 12.5) / 24)) AS solar,
49+
3.6 + 3.6 * sin(3.14 * i / 24) ^ 2 * (1 + 0.3 * random()) AS demand,
50+
FROM
51+
generate_series(1, 24) AS _timestep(i)
52+
CROSS JOIN (
53+
SELECT p, RANDOM() AS solar_rand
54+
FROM generate_series(1, 7 * 4) AS _period(p)
55+
)
56+
ORDER BY timestep
57+
",
58+
)
59+
```
60+
61+
Here is the content of that connection:
62+
63+
```@example duckdb_example
64+
using DataFrames, DuckDB
65+
66+
nice_query(str) = DataFrame(DuckDB.query(connection, str))
67+
nice_query("show tables")
68+
```
69+
70+
And here is the first rows of `profiles_wide`:
71+
72+
```@example duckdb_example
73+
nice_query("from profiles_wide limit 10")
74+
```
75+
76+
And finally, this is the plot of the data:
77+
78+
```@example duckdb_example
79+
using Plots
80+
81+
table = DuckDB.query(connection, "from profiles_wide")
82+
plot(size=(800, 400))
83+
timestep = [row.timestep for row in table]
84+
for profile_name in (:avail, :solar, :demand)
85+
value = [row[profile_name] for row in table]
86+
plot!(timestep, value, lab=string(profile_name))
87+
end
88+
plot!()
89+
```
90+
91+
## Transform a wide profiles table into a long table
92+
93+
!!! warning "Required"
94+
The long table format is a requirement of TulipaClustering, even for the dummy clustering example.
95+
96+
In this context, a wide table is a table where each new profile occupies a new column. A long table is a table where the profile names are stacked in a column with the corresponding values in a separate column.
97+
Given the name of the source table (in this case, `profiles_wide`), we can create a long table with the following call:
98+
99+
```@example duckdb_example
100+
using TulipaClustering
101+
102+
transform_wide_to_long!(connection, "profiles_wide", "input_profiles")
103+
104+
nice_query("FROM input_profiles LIMIT 10")
105+
```
106+
107+
The name `input_profiles` was chosen to conform with the expectations of the `TulipaEnergyModel.jl` format.
108+
109+
## Dummy Clustering
110+
111+
A dummy cluster will essentially ignore the clustering and create the necessary tables for the next steps in the Tulipa workflow.
112+
113+
```@example duckdb_example
114+
for table_name in (
115+
"cluster_rep_periods_data",
116+
"cluster_rep_periods_mapping",
117+
"cluster_profiles_rep_periods",
118+
)
119+
DuckDB.query(connection, "DROP TABLE IF EXISTS $table_name")
120+
end
121+
122+
clusters = dummy_cluster!(connection)
123+
124+
nice_query("FROM cluster_rep_periods_data LIMIT 5")
125+
```
126+
127+
```@example duckdb_example
128+
nice_query("FROM cluster_rep_periods_mapping LIMIT 5")
129+
```
130+
131+
```@example duckdb_example
132+
nice_query("FROM cluster_profiles_rep_periods LIMIT 5")
133+
```
134+
135+
## Clustering
136+
137+
We can perform a real clustering by using the [`cluster!`](@ref) function with two extra arguments (see [Explanation](@ref) for their deeped meaning):
138+
139+
- `period_duration`: How long are the split periods;
140+
- `num_rps`: How many representative periods.
141+
142+
```@example duckdb_example
143+
period_duration = 24
144+
num_rps = 3
145+
146+
for table_name in (
147+
"cluster_rep_periods_data",
148+
"cluster_rep_periods_mapping",
149+
"cluster_profiles_rep_periods",
150+
)
151+
DuckDB.query(connection, "DROP TABLE IF EXISTS $table_name")
152+
end
153+
154+
clusters = cluster!(connection, period_duration, num_rps)
155+
156+
nice_query("FROM cluster_rep_periods_data LIMIT 5")
157+
```
158+
159+
```@example duckdb_example
160+
nice_query("FROM cluster_rep_periods_mapping LIMIT 5")
161+
```
162+
163+
```@example duckdb_example
164+
nice_query("FROM cluster_profiles_rep_periods LIMIT 5")
165+
```
166+
167+
## [TODO](@id TODO)
168+
169+
- [ ] Link to TulipaWorkflow

src/TulipaClustering.jl

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,8 +12,10 @@ using SparseArrays
1212
using Statistics
1313

1414
include("structures.jl")
15+
include("data-validation.jl")
1516
include("io.jl")
1617
include("weight_fitting.jl")
1718
include("cluster.jl")
19+
include("convenience.jl")
1820

1921
end

src/convenience.jl

Lines changed: 162 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,162 @@
1+
export cluster!, dummy_cluster!, transform_wide_to_long!
2+
3+
"""
4+
cluster!(
5+
connection,
6+
period_duration,
7+
num_rps;
8+
input_profile_table_name = "input.profiles",
9+
database_schema = "",
10+
kwargs...,
11+
)
12+
13+
Convenience function to cluster the table named in `input_profile_table_name`
14+
using `period_duration` and `num_rps`. The resulting tables
15+
`profiles_rep_periods`, `rep_periods_mapping`, and
16+
`rep_periods_data` are loaded into `connection` in the `database_schema`, if
17+
given, and enriched with `year` information.
18+
19+
This function extract the table, then calls [`split_into_periods!`](@ref),
20+
[`find_representative_periods`](@ref), [`fit_rep_period_weights!`](@ref), and
21+
finally `write_clustering_result_to_tables`.
22+
"""
23+
function cluster!(
24+
connection,
25+
period_duration,
26+
num_rps;
27+
input_database_schema = "input",
28+
input_profile_table_name = "profiles",
29+
database_schema = "cluster",
30+
drop_incomplete_last_period::Bool = false,
31+
method::Symbol = :k_means,
32+
distance::SemiMetric = SqEuclidean(),
33+
weight_type::Symbol = :convex,
34+
tol::Float64 = 1e-2,
35+
niters::Int = 100,
36+
learning_rate::Float64 = 0.001,
37+
adaptive_grad::Bool = false,
38+
)
39+
prefix = ""
40+
if database_schema != ""
41+
DBInterface.execute(connection, "CREATE SCHEMA IF NOT EXISTS $database_schema")
42+
prefix = "$database_schema."
43+
end
44+
validate_data!(
45+
connection;
46+
input_database_schema,
47+
table_names = Dict("profiles" => input_profile_table_name),
48+
)
49+
50+
if input_database_schema != ""
51+
input_profile_table_name = "$input_database_schema.$input_profile_table_name"
52+
end
53+
df = DuckDB.query(
54+
connection,
55+
"SELECT * FROM $input_profile_table_name
56+
",
57+
) |> DataFrame
58+
split_into_periods!(df; period_duration)
59+
clusters =
60+
find_representative_periods(df, num_rps; drop_incomplete_last_period, method, distance)
61+
fit_rep_period_weights!(clusters; weight_type, tol, niters, learning_rate, adaptive_grad)
62+
63+
for table_name in
64+
("rep_periods_data", "rep_periods_mapping", "profiles_rep_periods", "timeframe_data")
65+
DuckDB.query(connection, "DROP TABLE IF EXISTS $prefix$table_name")
66+
end
67+
write_clustering_result_to_tables(connection, clusters; database_schema)
68+
69+
return clusters
70+
end
71+
72+
"""
73+
dummy_cluster!(connection)
74+
75+
Convenience function to create the necessary columns and tables when clustering
76+
is not required.
77+
78+
This is essentially creating a single representative period with the size of
79+
the whole profile.
80+
See [`cluster!`](@ref) for more details of what is created.
81+
"""
82+
function dummy_cluster!(
83+
connection;
84+
input_database_schema = "input",
85+
input_profile_table_name = "profiles",
86+
kwargs...,
87+
)
88+
table_name = if input_database_schema != ""
89+
"$input_database_schema.$input_profile_table_name"
90+
else
91+
input_profile_table_name
92+
end
93+
period_duration = only([
94+
row.max_timestep for row in
95+
DuckDB.query(connection, "SELECT MAX(timestep) AS max_timestep FROM $table_name")
96+
])
97+
cluster!(connection, period_duration, 1; kwargs...)
98+
end
99+
100+
"""
101+
transform_wide_to_long!(
102+
connection,
103+
wide_table_name,
104+
long_table_name;
105+
)
106+
107+
Convenience function to convert a table in wide format to long format using DuckDB.
108+
Originally aimed at converting a profile table like the following:
109+
110+
| year | timestep | name1 | name2 | ⋯ | name2 |
111+
| ---- | -------- | ----- | ----- | -- | ----- |
112+
| 2030 | 1 | 1.0 | 2.5 | ⋯ | 0.0 |
113+
| 2030 | 2 | 1.5 | 2.6 | ⋯ | 0.0 |
114+
| 2030 | 3 | 2.0 | 2.6 | ⋯ | 0.0 |
115+
116+
To a table like the following:
117+
118+
| year | timestep | profile_name | value |
119+
| ---- | -------- | ------------ | ----- |
120+
| 2030 | 1 | name1 | 1.0 |
121+
| 2030 | 2 | name1 | 1.5 |
122+
| 2030 | 3 | name1 | 2.0 |
123+
| 2030 | 1 | name2 | 2.5 |
124+
| 2030 | 2 | name2 | 2.6 |
125+
| 2030 | 3 | name2 | 2.6 |
126+
| ⋮ | ⋮ | ⋮ | ⋮ |
127+
| 2030 | 1 | name3 | 0.0 |
128+
| 2030 | 2 | name3 | 0.0 |
129+
| 2030 | 3 | name3 | 0.0 |
130+
131+
This conversion is done using the `UNPIVOT` SQL command from DuckDB.
132+
133+
## Keyword arguments
134+
135+
- `exclude_columns = ["year", "timestep"]`: Which tables to exclude from the conversion
136+
- `name_column = "profile_name"`: Name of the new column that contains the names of the old columns
137+
- `value_column = "value"`: Name of the new column that holds the values from the old columns
138+
"""
139+
function transform_wide_to_long!(
140+
connection,
141+
wide_table_name,
142+
long_table_name;
143+
exclude_columns = ["year", "timestep"],
144+
name_column = "profile_name",
145+
value_column = "value",
146+
)
147+
@assert length(exclude_columns) > 0
148+
exclude_str = join(exclude_columns, ", ")
149+
DuckDB.query(
150+
connection,
151+
"CREATE TABLE $long_table_name AS
152+
UNPIVOT $wide_table_name
153+
ON COLUMNS(* EXCLUDE ($exclude_str))
154+
INTO
155+
NAME $name_column
156+
VALUE $value_column
157+
ORDER BY $name_column, $exclude_str
158+
",
159+
)
160+
161+
return
162+
end

0 commit comments

Comments
 (0)