-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improved TUF artifact replication robustness #7519
base: main
Are you sure you want to change the base?
Conversation
// This is the equivalent of applying `#[serde(transparent)]`, but that has a | ||
// side effect of changing the JsonSchema derive to no longer emit a schema. | ||
impl Serialize for Generation { | ||
fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error> | ||
where | ||
S: serde::Serializer, | ||
{ | ||
self.0.serialize(serializer) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to call out this change -- I believe there is a bug in progenitor 0.9.x where newtype structs that do not have #[serde(transparent)]
cannot be serialized in a query string, and I need to go file an issue for it. But I think in practice it is more accurate to manually implement Serialize
in Omicron. In practice this change does not affect existing JSON serialization because serde_json treats newtype structs as their inner value.
…n-generation-numbers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've only taken a look at the docs so far, but looks solid! Thanks for writing that up. I'll finish the review later or tomorrow.
dataset on each M.2 device: the file name of each stored artifact is its | ||
SHA-256 hash. | ||
|
||
It also stores an _artifact configuration_ in memory: a list of all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this configuration should be stored on disk. Otherwise if there is a crash or restart of sled-agent or the sled, we have the same problem driving the use of a generation number in the first place. A nexus running at a prior generation may attempt to remove packages from a later generation.
In practice I think this means writing the expected artifact configuration first and then utilizing it to check for artifacts that need pruning or that haven't been delivered yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When @davepacheco and I were discussing this pattern we weren't convinced that storing the generation number or the entire configuration on disk is worth the work to get it right. The most damning part is that in order to keep the files consistent you would need to use something like an RwLock anyway, which means holding a lock that prevents anything else from happening while performing synchronous I/O operations. By keeping it in memory, the lock is only held for the shortest possible time within code that does not perform I/O.
On the other hand we do use this pattern already for the ledgers; I haven't read that code to see whether we've thought about what to do when either ledger is corrupted or how we maintain consistency with the filesystem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a correctness issue though, right? Without this work the scenario above can cause the same problems.
FWIW, I would still keep the generation in memory. I would just sync it to the ledger when it changes. Then if there is a crash, reload it upon startup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the other hand we do use this pattern already for the ledgers; I haven't read that code to see whether we've thought about what to do when either ledger is corrupted or how we maintain consistency with the filesystem.
There is an issue here if an m.2 dies and comes back to life, but other than that I think the ledger abstraction is sound. I wrote up the issue here a long time ago when ledgers were initially introduced. #2972 (review)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a ledger that includes only the generation number; this comment explains why I chose not to write the entire config:
omicron/sled-agent/src/artifact_store.rs
Lines 909 to 922 in 41c5d11
/// The format for the `ledger.json` file in the artifact dataset. | |
/// | |
/// We store only the generation number. This prevents us from using possibly | |
/// out-of-date information on which artifacts to delete in the delete | |
/// reconciler; no actions will take place until Nexus explicitly tells us what | |
/// the current state of the world is. | |
/// | |
/// We must store the most recent generation number to disk to prevent the case | |
/// where we start and a Nexus instance with out-of-date information tells us | |
/// about a configuration. | |
#[derive(Debug, Deserialize, Serialize)] | |
struct LedgerFormat { | |
generation: Generation, | |
} |
) -> Result<Inventory> { | ||
) -> Result<(ArtifactConfig, Inventory)> { | ||
let generation = | ||
self.datastore.update_tuf_generation_get(opctx).await?; | ||
let mut inventory = Inventory::default(); | ||
let mut paginator = Paginator::new(SQL_BATCH_SIZE); | ||
while let Some(p) = paginator.next() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we call this, the generation can change out underneath, or new artifacts can be added but we will already have read the old generation. This stems from the fact that the generation is not coupled to any set of artifacts and so you don't know in the database what artifact a generation is tied to. They are updated independently and read independently. There needs to be some kind of logical mapping exposed in the database.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you logically map a generation number to deleted artifacts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might also be possible to get the artifact list and the generation number simultaneously via a JOIN
or a transaction (I'm not sure, I don't dabble in CRDB consistency much...).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, this was based on my faulty understanding above where I didn't see that writes were in a transaction. I think you can slap the generation read and pagination in a transaction and this should solve the issue here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided that putting pagination within a transaction seemed arduous. So, I have added a logical mapping with a generation_added
column. This will ensure reading the list of artifacts in a generation will be consistent even if another generation is added in between pages.
The plan is still to delete artifact rows when all the repos referencing them are deleted, but we can change that plan and add a generation_deleted
column later instead, too.
I could use an extra (few) sets of eyes looking at the modified implementation of DataStore::update_tuf_repo_insert
(more specifically the insert_impl
function). In particular, we start the transaction by fetching the current generation and selecting the new generation, and filling in the generation_added
field in all the artifacts:
omicron/nexus/db-queries/src/db/datastore/update.rs
Lines 181 to 187 in 41c5d11
// Load the current generation from the database and increment it, then | |
// use that when creating the `TufRepoDescription`. If we determine there | |
// are any artifacts to be inserted, we update the generation to this value | |
// later. | |
let old_generation = get_generation(&conn).await?; | |
let new_generation = old_generation.next(); | |
let desc = TufRepoDescription::from_external(desc.clone(), new_generation); |
Then if we determine new artifacts are to be inserted, we put the new generation number:
omicron/nexus/db-queries/src/db/datastore/update.rs
Lines 311 to 325 in 41c5d11
if !new_artifacts.is_empty() { | |
// Since we are inserting new artifacts, we need to bump the | |
// generation number. | |
debug!(log, "setting new TUF repo generation"; | |
"generation" => new_generation, | |
); | |
put_generation(&conn, old_generation.into(), new_generation.into()) | |
.await?; | |
// Insert new artifacts into the database. | |
diesel::insert_into(dsl::tuf_artifact) | |
.values(new_artifacts) | |
.execute_async(&conn) | |
.await?; | |
} |
Which will only update the generation if it's currently the old generation, and returns an error if no rows were updated:
omicron/nexus/db-queries/src/db/datastore/update.rs
Lines 373 to 389 in 41c5d11
async fn put_generation( | |
conn: &async_bb8_diesel::Connection<crate::db::DbConnection>, | |
old_generation: nexus_db_model::Generation, | |
new_generation: nexus_db_model::Generation, | |
) -> Result<nexus_db_model::Generation, DieselError> { | |
use db::schema::tuf_generation::dsl; | |
// We use `get_result_async` instead of `execute_async` to check that we | |
// updated exactly one row. | |
diesel::update(dsl::tuf_generation.filter( | |
dsl::singleton.eq(true).and(dsl::generation.eq(old_generation)), | |
)) | |
.set(dsl::generation.eq(new_generation)) | |
.returning(dsl::generation) | |
.get_result_async(conn) | |
.await | |
} |
I'm not 100% sure this is the right way to do this; if the generation number is incremented by another transaction first, I would prefer to retry this transaction than return an unretryable error. I don't understand enough about whether CockroachDB would detect this as a transaction conflict and tell us to retry it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this strategy a lot and think it should work with the serializable constraints of the DB. Thanks for adding this support!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could use an extra (few) sets of eyes looking at the modified implementation of DataStore::update_tuf_repo_insert (more specifically the insert_impl function). In particular, we start the transaction by fetching the current generation and selecting the new generation, and filling in the generation_added field in all the artifacts:
I took a pretty close look and it looks great AFAICT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure this is the right way to do this; if the generation number is incremented by another transaction first, I would prefer to retry this transaction than return an unretryable error. I don't understand enough about whether CockroachDB would detect this as a transaction conflict and tell us to retry it.
I don't think CRDB will retry for us, but I'm not really sure. Would it be worth adding a test for this?
…n-generation-numbers
This most recent push only includes a merge from main and some of the docs nits; I'm going to be working on writing the generation number out to a ledger on the filesystem and making the artifact list query more consistent. |
sled-agent/src/artifact_store.rs
Outdated
// These paths are defined under the artifact storage dataset. They | ||
// cannot conflict with any artifact paths because all artifact paths are | ||
// hexadecimal-encoded SHA-256 checksums. | ||
const LEDGER_PATH: &str = "ledger.json"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though this lives in the update dataset, it may be helpful to give this a more specific name like artifacts_ledger.json
sled-agent/src/artifact_store.rs
Outdated
@@ -697,8 +906,35 @@ impl From<Error> for HttpError { | |||
} | |||
} | |||
|
|||
/// The format for the `ledger.json` file in the artifact dataset. | |||
/// | |||
/// We store only the generation number. This prevents us from using possibly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the concern around out-of-date artifacts during delete reconciliation?
The key safety parameter for reconcilliation should be that the updates cannot go backwards. If you always write the new configuration before you use it, you can probably avoid an issue of staleness.
sled-agent/src/artifact_store.rs
Outdated
// the guard is dropped. | ||
config: watch::Sender<Option<ArtifactConfig>>, | ||
// This mutex must only be unlocked in `put_config`. | ||
ledger: Mutex<Ledger<LedgerFormat>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's a good idea to use a tokio::Mutex
here. We've been trying to get away from them for the reasons described in RFD 397 and RFD 400.
#7347 is doing something very similar to this code, and after lots of review discussion we ended up using a tokio task to serialize requests instead, as suggested in RFD 400. That would probably work well here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mutex is held over an await, so I ended up opting for it. But I can swap it out for another task.
sled-agent/src/artifact_store.rs
Outdated
) -> Result<(), Error> { | ||
// A lock on the ledger Mutex is held until the end of this function. | ||
// Only one request to put a new configuration is processed at a time. | ||
let mut ledger = self.ledger.lock().await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this lock goes away in favor of a task, you can modify the ledger and then update the watch inside the task. Since the only place the generation can be updated is inside the task, you also likely don't need to use send_if_modified
and can instead write the ledger if the generation changes then unconditionally set the watch channel value. This last part seems pretty important. You don't want to set the value in the watch channel before committing the ledger, as that leads to a race condition, where the sled-agent can report to nexus the current configuration, then restart with the old configuration. While this may be safe with the current nexus code and retries, in my experience it's almost always better to persist the consistent state before changing the in-memory value. We discussed this in the context of omicron zones in #5086. In fact, now that John has merged in #7762 he'll be fixing that issue soon.
} | ||
return ControlFlow::Continue(BTreeMap::new()); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible that the sled-agent crashes here. If we don't persist the artifact list with the generation, then I believe we'll get back an Error::NoConfig
upon restart, if everything on the sled-agent side comes back in time. In this case we'll wait for the next call to the background task. This seems safe, but also a bit unnecessary if this race did occur. Instead the ledger could be read and the artifact list loaded directly from the file instead, if the ledger was always written before the watch config was updated.
Closes #7399.
Nexus now owns and maintains a generation number for the set of artifacts the system wants to be fully replicated, which is used by Sled Agent to prevent conflicts. The generation number is stored in a new singleton table based on the existing db_metadata singleton. I wrote up
docs/tuf-artifact-replication.adoc
to provide a top-level overview of the system and some of the conflicts that this refactor seeks to prevent.The Sled Agent artifact store APIs are modified. Two new APIs exist for getting and putting an "artifact configuration", which is the list of wanted artifacts and its associated generation number. The list request returns the current generation number as well, and the PUT and "copy from depot" requests require an up-to-date generation number in the query string. The delete API is removed in favor of Sled Agent managing deletions on its own whenever the configuration is updated.