|
| 1 | +# Decision Record: Change link_sets primary key from serial to content_id |
| 2 | + |
| 3 | +## Context |
| 4 | + |
| 5 | +Publishing API has the concept of "link sets". |
| 6 | + |
| 7 | +A link set comprises all the links belonging to a particular content_id. Links can have various link types and |
| 8 | +positions, but if they belong to the same content_id, they're part of the same link set. |
| 9 | + |
| 10 | +Link sets are exposed through the API as a resource - you can get the link set of a content_id |
| 11 | +using [GET /v2/links/:content_id](https://github.com/alphagov/publishing-api/blob/main/docs/api.md#get-v2linkscontent_id), |
| 12 | +and you can update them |
| 13 | +using [PATCH /v2/links/:content_id](https://github.com/alphagov/publishing-api/blob/main/docs/api.md#patch-v2linkscontent_id). |
| 14 | + |
| 15 | +Because there's a potential race condition between getting and updating a link set, they have a lock version column. For |
| 16 | +example, two processes fetch a link set which looks like this: |
| 17 | + |
| 18 | +```json |
| 19 | +{ |
| 20 | + "links": { |
| 21 | + "organisations": [ |
| 22 | + "591436ab-c2ae-416f-a3c5-1901d633fbfb" |
| 23 | + ] |
| 24 | + } |
| 25 | +} |
| 26 | +``` |
| 27 | + |
| 28 | +And the first process attempts to add `492b07ba-1a83-4163-9d1f-37fa8a088b77` as an organisations link, and the second |
| 29 | +process attempts to add `489743df-25bd-45b3-b799-5bfcab131cce`, the request that comes later should fail because it's not |
| 30 | +updating the latest version of the resource. |
| 31 | + |
| 32 | +To enable this, link_sets have a stale_lock_version column, which is incremented on every update. |
| 33 | + |
| 34 | +## Problem |
| 35 | + |
| 36 | +The current database schema has link sets as a separate table, with a foreign key relationship to the links table. |
| 37 | + |
| 38 | +```mermaid |
| 39 | +erDiagram |
| 40 | + Link { |
| 41 | + serial id PK |
| 42 | + integer link_set_id FK |
| 43 | + } |
| 44 | + Link }o--o| LinkSet: "" |
| 45 | + LinkSet { |
| 46 | + serial id PK |
| 47 | + uuid content_id |
| 48 | + integer stale_lock_version |
| 49 | + } |
| 50 | +``` |
| 51 | + |
| 52 | +Although this is nicely normalized, the presence of the link_sets table necessitates an additional join when following |
| 53 | +the links of a document. Assuming we already know the content_id we're starting from, and the link_type we're following, |
| 54 | +we need to do: |
| 55 | + |
| 56 | +``` |
| 57 | +select links.* |
| 58 | +from link_sets |
| 59 | +join links on link_sets.id = links.link_set_id |
| 60 | +and link_sets.content_id = ? |
| 61 | +and links.link_type = ? |
| 62 | +``` |
| 63 | + |
| 64 | +While postgres is usually pretty good at using appropriate indexes to join these two tables together, there are |
| 65 | +occasions when a query we would like to write can't be planned efficiently. For example, if we knew the edition id and |
| 66 | +content id of a particular edition, and we wanted to find both link set and edition links, we might want to do: |
| 67 | + |
| 68 | +``` |
| 69 | +select links.* from link_sets |
| 70 | +join links on link_sets.id = links.link_set_id |
| 71 | +where (link_sets.content_id = '77fb899a-a86c-4bf4-858f-229ea665f02d' or links.edition_id = 6488459) |
| 72 | +and links.link_type = 'organisations'; |
| 73 | +``` |
| 74 | + |
| 75 | +However, because content_id and edition_id are on different tables, postgres can't join the two indexes together. |
| 76 | +Instead, it will come up with a rubbish query plan which takes tens of seconds to |
| 77 | +execute. [As described in this article](https://www.cybertec-postgresql.com/en/avoid-or-for-better-performance/), this |
| 78 | +can be worked around by doing a UNION ALL instead of an OR, but this increases the complexity of the query |
| 79 | +significantly. |
| 80 | + |
| 81 | +The link sets table has two columns with unique constraints - the id column (which is a serial) and the content_id |
| 82 | +column (which is a uuid). This means there's a one-to-one equivalence between id and content_id. |
| 83 | + |
| 84 | +The problem is that the links table refers to link_sets with a foreign key to its id. If it used a foreign |
| 85 | +key to content_id instead, we would be able to avoid the join in link expansion situations, and postgres would be able |
| 86 | +to use its [clever index combination techniques](https://www.postgresql.org/docs/current/indexes-bitmap-scans.html) ( |
| 87 | +BitmapOr specifically) to come up with efficient plans. |
| 88 | + |
| 89 | +## Solution |
| 90 | + |
| 91 | +Glossing over the steps to make this change, which are below, the proposed solution is that we would replace the |
| 92 | +link_set_id column on links with a link_set_content_id column, and corresponding foreign key. So we'd have: |
| 93 | + |
| 94 | +```mermaid |
| 95 | +erDiagram |
| 96 | + Link { |
| 97 | + serial id PK |
| 98 | + foreign_key link_set_content_id FK |
| 99 | + } |
| 100 | + Link }o--o| LinkSet: "" |
| 101 | + LinkSet { |
| 102 | + uuid content_id PK |
| 103 | + integer stale_lock_version |
| 104 | + } |
| 105 | +``` |
| 106 | + |
| 107 | +Link sets would still exist as its own model and table, as we need it for the lock version we use for optimistic locking. |
| 108 | + |
| 109 | +Because the links table would now have a column with the content_id in it, we could avoid joining link_sets when doing |
| 110 | +link expansion, instead doing something like: |
| 111 | + |
| 112 | +``` |
| 113 | +select * from links |
| 114 | +where (links.link_set_content_id = '77fb899a-a86c-4bf4-858f-229ea665f02d' or links.edition_id = 6488459) |
| 115 | +and links.link_type = 'organisations'; |
| 116 | +``` |
| 117 | + |
| 118 | +Even though this looks very similar to the query above, it's much more performant. This is for two reasons: |
| 119 | + |
| 120 | +- We avoid the unnecessary join to link_sets |
| 121 | +- Both the link_set_content_id and edition_id columns are on the same table, so if they're both indexed postgres can do |
| 122 | + a BitmapOr on the indexes to efficiently find links where one or the other is true. |
| 123 | + |
| 124 | +It would also simplify the code in publishing-api somewhat, by removing most of the places where we need |
| 125 | +to [joins(:link_set)](https://github.com/alphagov/publishing-api/blob/e4b7b185346a1242cb85dda4379ba22318ff0ab5/app/queries/links.rb#L107). |
| 126 | + |
| 127 | +## Steps to implement |
| 128 | + |
| 129 | +We'll need to coordinate changes to the database schema and changes to the code to avoid causing downtime. |
| 130 | + |
| 131 | +We also need to make sure that the [gov-graph](https://docs.publishing.service.gov.uk/repos/govuk-knowledge-graph-gcp.html#content) |
| 132 | +code is updated so that it doesn't break during the migration, and so that it eventually uses the new column. |
| 133 | + |
| 134 | +### 1 - Add the new column |
| 135 | + |
| 136 | +Firstly, we'll need to add the new column: |
| 137 | + |
| 138 | +``` |
| 139 | +class AddLinkSetContentIdToLinks < ActiveRecord::Migration[8.0] |
| 140 | + def change |
| 141 | + add_column :links, :link_set_content_id, :uuid |
| 142 | + add_foreign_key :links, :link_sets, column: :link_set_content_id, primary_key: :content_id |
| 143 | + end |
| 144 | +end |
| 145 | +``` |
| 146 | + |
| 147 | +We can add the foreign key constraint immediately, because we never want to allow values in this column which don't |
| 148 | +exist in the link sets table. |
| 149 | + |
| 150 | +At this point the old link_set_id column is still there, and the code will still be using that. The new column will |
| 151 | +initially be all nulls. |
| 152 | + |
| 153 | +### 2 - Update the code to write to both columns |
| 154 | + |
| 155 | +Whenever new links are added, we'll need to write both the link_set_id column and the link_set_content_id column. This |
| 156 | +is the first step towards fully populating the column. |
| 157 | + |
| 158 | +For example, |
| 159 | +in [the patch link sets code](https://github.com/alphagov/publishing-api/blob/9447ded9df0678167735f433ecc2acba2c1be6ef/app/commands/v2/patch_link_set.rb#L21) |
| 160 | +we would need to write the link_set.content_id to the link_set_content_id on each link we create. There might be a few |
| 161 | +other places in the code where we create links, or if we're lucky this might be the only one. |
| 162 | + |
| 163 | +### 3 - Populate the new column |
| 164 | + |
| 165 | +We should be able to do something like this: |
| 166 | + |
| 167 | +``` |
| 168 | +update links |
| 169 | +set links.link_set_content_id = link_sets.content_id |
| 170 | +from link_sets |
| 171 | +where links.link_set_id is not null and links.link_set_id = link_sets.id; |
| 172 | +``` |
| 173 | + |
| 174 | +On my local database, this did `5,349,185 rows affected in 15 m 3 s 918 ms`. I'm not sure how much locking happened during |
| 175 | +this window - we might want to populate the links in batches if we're worried about locking other queries. |
| 176 | + |
| 177 | +### 4 - Add indexes |
| 178 | + |
| 179 | +There are three indexes on link_set_id in the current schema: |
| 180 | + |
| 181 | +* index_links_on_link_set_id |
| 182 | +* index_links_on_link_set_id_and_link_type |
| 183 | +* index_links_on_link_set_id_and_target_content_id |
| 184 | + |
| 185 | +The statistics on index scan usage suggests these are all used: |
| 186 | + |
| 187 | +```shell |
| 188 | +kubectl exec deploy/publishing-api -- rake pg_extras:index_scans | awk 'NR<7 || /index_links_on_link_set_id/ {print} END {print}' |
| 189 | +``` |
| 190 | +``` |
| 191 | ++-------------------------------------------------------------------------------------------------------------------------+ |
| 192 | +| Number of scans performed on indexes | |
| 193 | ++------------+----------------------+---------------------------------------------------------+------------+--------------+ |
| 194 | +| schemaname | table | index | index_size | index_scans | |
| 195 | ++------------+----------------------+---------------------------------------------------------+------------+--------------+ |
| 196 | +| public | links | index_links_on_link_set_id_and_target_content_id | 909 MB | 10190392530 | |
| 197 | +| public | links | index_links_on_link_set_id_and_link_type | 683 MB | 1233485761 | |
| 198 | +| public | links | index_links_on_link_set_id | 567 MB | 6082986136 | |
| 199 | ++------------+----------------------+---------------------------------------------------------+------------+--------------+ |
| 200 | +``` |
| 201 | + |
| 202 | +So it makes sense to create the same indexes on link_set_content_id: |
| 203 | + |
| 204 | +``` |
| 205 | +add_index :links, [:link_set_content_id], algorithm: :concurrently |
| 206 | +add_index :links, [:link_set_content_id, :link_type], algorithm: :concurrently |
| 207 | +add_index :links, [:link_set_content_id, :target_content_id], algorithm: :concurrently |
| 208 | +``` |
| 209 | + |
| 210 | +Adding these indexes concurrently will take around twice as long, but will avoid locking the table. |
| 211 | + |
| 212 | +Note that these indexes will be larger than the indexes on link_set_id, because uuids are bigger than serial ids. |
| 213 | + |
| 214 | +### 5 - Check that the column is completely populated |
| 215 | + |
| 216 | +As a sanity check, we can check that all the links that have link_set_id all have the correct values for link_set_content_id: |
| 217 | + |
| 218 | +``` |
| 219 | +select count(*) from links |
| 220 | +join link_sets on link_sets.id = links.link_set_id |
| 221 | +where link_sets.content_id != links.link_set_content_id; |
| 222 | +``` |
| 223 | + |
| 224 | +This should return zero. Note this query may take several minutes, as it needs to scan the whole links table. |
| 225 | + |
| 226 | +### 6 - Update the code to use the new link_set_content_id column |
| 227 | + |
| 228 | +This will be the biggest change to the publishing-api codebase. Everywhere where we join links and link_sets we'll need |
| 229 | +to decide whether to remove the join (because we only needed content_id from link_sets, and we can now use |
| 230 | +link_set_content_id), or to update the join to use the new column. |
| 231 | + |
| 232 | +We should also update the LinkSet model so that it treats content_id as its primary key instead of id, and make sure |
| 233 | +that we update any situations where we're doing something like: |
| 234 | + |
| 235 | +``` |
| 236 | +LinkSet.find(id) |
| 237 | +``` |
| 238 | + |
| 239 | +To |
| 240 | + |
| 241 | +``` |
| 242 | +LinkSet.find(content_id) |
| 243 | +``` |
| 244 | + |
| 245 | +At this point, the code is using the new column, and we should get all the performance benefits it brings. The rest of |
| 246 | +the process is cleanup really, but it's worth doing to avoid leaving things in a confusing state. |
| 247 | + |
| 248 | +### 7 - Drop indexes and constraints on link_set_id |
| 249 | + |
| 250 | +The indexes shouldn't be used anymore, so dropping them should be safe. We can also drop the foreign key constraint from |
| 251 | +link_sets to links, as we don't care about enforcing it anymore. |
| 252 | + |
| 253 | +### 8 - Stop writing to the link_set_id column |
| 254 | + |
| 255 | +At this point, we can update the code not to populate the link_set_id column on the links table, so only the |
| 256 | +link_set_content_id column is used. We need to be sure that we've done step 6 properly at this point, and that the code |
| 257 | +isn't using link_set_id anywhere, or stopping writing to it could cause confusing bugs. |
| 258 | + |
| 259 | +### 9 - Remove the link_set_id column from links |
| 260 | + |
| 261 | +🔪 |
| 262 | + |
| 263 | +### 10 - Remove the id column from link_sets |
| 264 | + |
| 265 | +Assuming we've already removed all the constraints and indexes that use id, it should now be safe enough to do this: |
| 266 | + |
| 267 | +```sql |
| 268 | +-- 1. Drop the original primary key |
| 269 | +ALTER TABLE link_sets DROP CONSTRAINT link_sets_pkey |
| 270 | + |
| 271 | +-- 2. Rename existing index |
| 272 | +ALTER INDEX index_link_sets_on_content_id RENAME TO link_sets_pkey |
| 273 | + |
| 274 | +-- 3. Create new primary key using existing index |
| 275 | +ALTER TABLE link_sets ADD PRIMARY KEY USING INDEX link_sets_pkey |
| 276 | + |
| 277 | +-- 4. Drop the old column |
| 278 | +ALTER TABLE link_sets DROP COLUMN id; |
| 279 | +``` |
| 280 | + |
| 281 | +We could do these in a rails migration using the execute method. Might be better to just |
| 282 | +do them in a SQL console though. |
| 283 | + |
| 284 | +Probably best to update the schema.rb file at this point too. |
| 285 | + |
| 286 | +### 11 - Update documentation |
| 287 | + |
| 288 | +We should update the documentation to reflect the new column names, and to remove any references to the old column. |
| 289 | +For example [this diagram](https://docs.publishing.service.gov.uk/repos/publishing-api/model.html#diagram) should be |
| 290 | +updated to make it clear that the relationship between Link and content_id is direct, rather than through LinkSet. |
| 291 | + |
| 292 | +### 12 - Celebrate 🎉 |
| 293 | + |
| 294 | +Link expansion is now easier to do! We need fewer joins, and we can get more efficient query plans! Hooray! |
| 295 | + |
| 296 | +## Consequences |
| 297 | + |
| 298 | +### Positive |
| 299 | + |
| 300 | +- Link expansion SQL queries will be simpler (requiring fewer joins), and faster (making better use of indexes) |
| 301 | +- The publishing-api codebase will be simpler, as we'll be able to remove a lot of the join logic |
| 302 | + |
| 303 | +### Negative |
| 304 | + |
| 305 | +- Index size will increase, as the new indexes will be larger than the old ones |
| 306 | +- We'll have to make the changes in a number of steps, some of which may be somewhat risky. This should be manageable |
| 307 | + with careful planning and testing. |
0 commit comments