Skip to content

Commit 4f5d6d1

Browse files
Merge pull request #3232 from alphagov/adr-009
ADR-009 - change link_sets pkey to content_id
2 parents 6f953c6 + c3ab60d commit 4f5d6d1

File tree

1 file changed

+307
-0
lines changed

1 file changed

+307
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,307 @@
1+
# Decision Record: Change link_sets primary key from serial to content_id
2+
3+
## Context
4+
5+
Publishing API has the concept of "link sets".
6+
7+
A link set comprises all the links belonging to a particular content_id. Links can have various link types and
8+
positions, but if they belong to the same content_id, they're part of the same link set.
9+
10+
Link sets are exposed through the API as a resource - you can get the link set of a content_id
11+
using [GET /v2/links/:content_id](https://github.com/alphagov/publishing-api/blob/main/docs/api.md#get-v2linkscontent_id),
12+
and you can update them
13+
using [PATCH /v2/links/:content_id](https://github.com/alphagov/publishing-api/blob/main/docs/api.md#patch-v2linkscontent_id).
14+
15+
Because there's a potential race condition between getting and updating a link set, they have a lock version column. For
16+
example, two processes fetch a link set which looks like this:
17+
18+
```json
19+
{
20+
"links": {
21+
"organisations": [
22+
"591436ab-c2ae-416f-a3c5-1901d633fbfb"
23+
]
24+
}
25+
}
26+
```
27+
28+
And the first process attempts to add `492b07ba-1a83-4163-9d1f-37fa8a088b77` as an organisations link, and the second
29+
process attempts to add `489743df-25bd-45b3-b799-5bfcab131cce`, the request that comes later should fail because it's not
30+
updating the latest version of the resource.
31+
32+
To enable this, link_sets have a stale_lock_version column, which is incremented on every update.
33+
34+
## Problem
35+
36+
The current database schema has link sets as a separate table, with a foreign key relationship to the links table.
37+
38+
```mermaid
39+
erDiagram
40+
Link {
41+
serial id PK
42+
integer link_set_id FK
43+
}
44+
Link }o--o| LinkSet: ""
45+
LinkSet {
46+
serial id PK
47+
uuid content_id
48+
integer stale_lock_version
49+
}
50+
```
51+
52+
Although this is nicely normalized, the presence of the link_sets table necessitates an additional join when following
53+
the links of a document. Assuming we already know the content_id we're starting from, and the link_type we're following,
54+
we need to do:
55+
56+
```
57+
select links.*
58+
from link_sets
59+
join links on link_sets.id = links.link_set_id
60+
and link_sets.content_id = ?
61+
and links.link_type = ?
62+
```
63+
64+
While postgres is usually pretty good at using appropriate indexes to join these two tables together, there are
65+
occasions when a query we would like to write can't be planned efficiently. For example, if we knew the edition id and
66+
content id of a particular edition, and we wanted to find both link set and edition links, we might want to do:
67+
68+
```
69+
select links.* from link_sets
70+
join links on link_sets.id = links.link_set_id
71+
where (link_sets.content_id = '77fb899a-a86c-4bf4-858f-229ea665f02d' or links.edition_id = 6488459)
72+
and links.link_type = 'organisations';
73+
```
74+
75+
However, because content_id and edition_id are on different tables, postgres can't join the two indexes together.
76+
Instead, it will come up with a rubbish query plan which takes tens of seconds to
77+
execute. [As described in this article](https://www.cybertec-postgresql.com/en/avoid-or-for-better-performance/), this
78+
can be worked around by doing a UNION ALL instead of an OR, but this increases the complexity of the query
79+
significantly.
80+
81+
The link sets table has two columns with unique constraints - the id column (which is a serial) and the content_id
82+
column (which is a uuid). This means there's a one-to-one equivalence between id and content_id.
83+
84+
The problem is that the links table refers to link_sets with a foreign key to its id. If it used a foreign
85+
key to content_id instead, we would be able to avoid the join in link expansion situations, and postgres would be able
86+
to use its [clever index combination techniques](https://www.postgresql.org/docs/current/indexes-bitmap-scans.html) (
87+
BitmapOr specifically) to come up with efficient plans.
88+
89+
## Solution
90+
91+
Glossing over the steps to make this change, which are below, the proposed solution is that we would replace the
92+
link_set_id column on links with a link_set_content_id column, and corresponding foreign key. So we'd have:
93+
94+
```mermaid
95+
erDiagram
96+
Link {
97+
serial id PK
98+
foreign_key link_set_content_id FK
99+
}
100+
Link }o--o| LinkSet: ""
101+
LinkSet {
102+
uuid content_id PK
103+
integer stale_lock_version
104+
}
105+
```
106+
107+
Link sets would still exist as its own model and table, as we need it for the lock version we use for optimistic locking.
108+
109+
Because the links table would now have a column with the content_id in it, we could avoid joining link_sets when doing
110+
link expansion, instead doing something like:
111+
112+
```
113+
select * from links
114+
where (links.link_set_content_id = '77fb899a-a86c-4bf4-858f-229ea665f02d' or links.edition_id = 6488459)
115+
and links.link_type = 'organisations';
116+
```
117+
118+
Even though this looks very similar to the query above, it's much more performant. This is for two reasons:
119+
120+
- We avoid the unnecessary join to link_sets
121+
- Both the link_set_content_id and edition_id columns are on the same table, so if they're both indexed postgres can do
122+
a BitmapOr on the indexes to efficiently find links where one or the other is true.
123+
124+
It would also simplify the code in publishing-api somewhat, by removing most of the places where we need
125+
to [joins(:link_set)](https://github.com/alphagov/publishing-api/blob/e4b7b185346a1242cb85dda4379ba22318ff0ab5/app/queries/links.rb#L107).
126+
127+
## Steps to implement
128+
129+
We'll need to coordinate changes to the database schema and changes to the code to avoid causing downtime.
130+
131+
We also need to make sure that the [gov-graph](https://docs.publishing.service.gov.uk/repos/govuk-knowledge-graph-gcp.html#content)
132+
code is updated so that it doesn't break during the migration, and so that it eventually uses the new column.
133+
134+
### 1 - Add the new column
135+
136+
Firstly, we'll need to add the new column:
137+
138+
```
139+
class AddLinkSetContentIdToLinks < ActiveRecord::Migration[8.0]
140+
def change
141+
add_column :links, :link_set_content_id, :uuid
142+
add_foreign_key :links, :link_sets, column: :link_set_content_id, primary_key: :content_id
143+
end
144+
end
145+
```
146+
147+
We can add the foreign key constraint immediately, because we never want to allow values in this column which don't
148+
exist in the link sets table.
149+
150+
At this point the old link_set_id column is still there, and the code will still be using that. The new column will
151+
initially be all nulls.
152+
153+
### 2 - Update the code to write to both columns
154+
155+
Whenever new links are added, we'll need to write both the link_set_id column and the link_set_content_id column. This
156+
is the first step towards fully populating the column.
157+
158+
For example,
159+
in [the patch link sets code](https://github.com/alphagov/publishing-api/blob/9447ded9df0678167735f433ecc2acba2c1be6ef/app/commands/v2/patch_link_set.rb#L21)
160+
we would need to write the link_set.content_id to the link_set_content_id on each link we create. There might be a few
161+
other places in the code where we create links, or if we're lucky this might be the only one.
162+
163+
### 3 - Populate the new column
164+
165+
We should be able to do something like this:
166+
167+
```
168+
update links
169+
set links.link_set_content_id = link_sets.content_id
170+
from link_sets
171+
where links.link_set_id is not null and links.link_set_id = link_sets.id;
172+
```
173+
174+
On my local database, this did `5,349,185 rows affected in 15 m 3 s 918 ms`. I'm not sure how much locking happened during
175+
this window - we might want to populate the links in batches if we're worried about locking other queries.
176+
177+
### 4 - Add indexes
178+
179+
There are three indexes on link_set_id in the current schema:
180+
181+
* index_links_on_link_set_id
182+
* index_links_on_link_set_id_and_link_type
183+
* index_links_on_link_set_id_and_target_content_id
184+
185+
The statistics on index scan usage suggests these are all used:
186+
187+
```shell
188+
kubectl exec deploy/publishing-api -- rake pg_extras:index_scans | awk 'NR<7 || /index_links_on_link_set_id/ {print} END {print}'
189+
```
190+
```
191+
+-------------------------------------------------------------------------------------------------------------------------+
192+
| Number of scans performed on indexes |
193+
+------------+----------------------+---------------------------------------------------------+------------+--------------+
194+
| schemaname | table | index | index_size | index_scans |
195+
+------------+----------------------+---------------------------------------------------------+------------+--------------+
196+
| public | links | index_links_on_link_set_id_and_target_content_id | 909 MB | 10190392530 |
197+
| public | links | index_links_on_link_set_id_and_link_type | 683 MB | 1233485761 |
198+
| public | links | index_links_on_link_set_id | 567 MB | 6082986136 |
199+
+------------+----------------------+---------------------------------------------------------+------------+--------------+
200+
```
201+
202+
So it makes sense to create the same indexes on link_set_content_id:
203+
204+
```
205+
add_index :links, [:link_set_content_id], algorithm: :concurrently
206+
add_index :links, [:link_set_content_id, :link_type], algorithm: :concurrently
207+
add_index :links, [:link_set_content_id, :target_content_id], algorithm: :concurrently
208+
```
209+
210+
Adding these indexes concurrently will take around twice as long, but will avoid locking the table.
211+
212+
Note that these indexes will be larger than the indexes on link_set_id, because uuids are bigger than serial ids.
213+
214+
### 5 - Check that the column is completely populated
215+
216+
As a sanity check, we can check that all the links that have link_set_id all have the correct values for link_set_content_id:
217+
218+
```
219+
select count(*) from links
220+
join link_sets on link_sets.id = links.link_set_id
221+
where link_sets.content_id != links.link_set_content_id;
222+
```
223+
224+
This should return zero. Note this query may take several minutes, as it needs to scan the whole links table.
225+
226+
### 6 - Update the code to use the new link_set_content_id column
227+
228+
This will be the biggest change to the publishing-api codebase. Everywhere where we join links and link_sets we'll need
229+
to decide whether to remove the join (because we only needed content_id from link_sets, and we can now use
230+
link_set_content_id), or to update the join to use the new column.
231+
232+
We should also update the LinkSet model so that it treats content_id as its primary key instead of id, and make sure
233+
that we update any situations where we're doing something like:
234+
235+
```
236+
LinkSet.find(id)
237+
```
238+
239+
To
240+
241+
```
242+
LinkSet.find(content_id)
243+
```
244+
245+
At this point, the code is using the new column, and we should get all the performance benefits it brings. The rest of
246+
the process is cleanup really, but it's worth doing to avoid leaving things in a confusing state.
247+
248+
### 7 - Drop indexes and constraints on link_set_id
249+
250+
The indexes shouldn't be used anymore, so dropping them should be safe. We can also drop the foreign key constraint from
251+
link_sets to links, as we don't care about enforcing it anymore.
252+
253+
### 8 - Stop writing to the link_set_id column
254+
255+
At this point, we can update the code not to populate the link_set_id column on the links table, so only the
256+
link_set_content_id column is used. We need to be sure that we've done step 6 properly at this point, and that the code
257+
isn't using link_set_id anywhere, or stopping writing to it could cause confusing bugs.
258+
259+
### 9 - Remove the link_set_id column from links
260+
261+
🔪
262+
263+
### 10 - Remove the id column from link_sets
264+
265+
Assuming we've already removed all the constraints and indexes that use id, it should now be safe enough to do this:
266+
267+
```sql
268+
-- 1. Drop the original primary key
269+
ALTER TABLE link_sets DROP CONSTRAINT link_sets_pkey
270+
271+
-- 2. Rename existing index
272+
ALTER INDEX index_link_sets_on_content_id RENAME TO link_sets_pkey
273+
274+
-- 3. Create new primary key using existing index
275+
ALTER TABLE link_sets ADD PRIMARY KEY USING INDEX link_sets_pkey
276+
277+
-- 4. Drop the old column
278+
ALTER TABLE link_sets DROP COLUMN id;
279+
```
280+
281+
We could do these in a rails migration using the execute method. Might be better to just
282+
do them in a SQL console though.
283+
284+
Probably best to update the schema.rb file at this point too.
285+
286+
### 11 - Update documentation
287+
288+
We should update the documentation to reflect the new column names, and to remove any references to the old column.
289+
For example [this diagram](https://docs.publishing.service.gov.uk/repos/publishing-api/model.html#diagram) should be
290+
updated to make it clear that the relationship between Link and content_id is direct, rather than through LinkSet.
291+
292+
### 12 - Celebrate 🎉
293+
294+
Link expansion is now easier to do! We need fewer joins, and we can get more efficient query plans! Hooray!
295+
296+
## Consequences
297+
298+
### Positive
299+
300+
- Link expansion SQL queries will be simpler (requiring fewer joins), and faster (making better use of indexes)
301+
- The publishing-api codebase will be simpler, as we'll be able to remove a lot of the join logic
302+
303+
### Negative
304+
305+
- Index size will increase, as the new indexes will be larger than the old ones
306+
- We'll have to make the changes in a number of steps, some of which may be somewhat risky. This should be manageable
307+
with careful planning and testing.

0 commit comments

Comments
 (0)