Skip to content

Commit 1a19204

Browse files
committed
Address comments
1 parent 13c5ded commit 1a19204

File tree

6 files changed

+94
-29
lines changed

6 files changed

+94
-29
lines changed

docs/source/usage.rst

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -56,8 +56,8 @@ Additional optional arguments:
5656
than MAXSIZE except when individual input files exceed MAXSIZE (as
5757
individual files are never split up between different tar files).
5858
* ``--non-blocking`` Zstash will submit a Globus transfer and immediately create a subsequent tarball. That is, Zstash will not wait until the transfer completes to start creating a subsequent tarball. On machines where it takes more time to create a tarball than transfer it, each Globus transfer will have one file. On machines where it takes less time to create a tarball than transfer it, the first transfer will have one file, but the number of tarballs in subsequent transfers will grow finding dynamically the most optimal number of tarballs per transfer. NOTE: zstash is currently always non-blocking.
59-
* ``--error-on-duplicate-tar`` Raise an error if a tar file with the same name already exists in the database. If this flag is set, zstash will exit if it sees a duplicate tar. If it is not set, zstash's behavior will depend on whether or not the --overwrite-duplicate-tar flag is set.
60-
* ``--overwrite-duplicate-tars`` If a duplicate tar is encountered, overwrite the existing tar file with the new one (i.e., it will assume the latest tar is the correct one). If this flag is not set, zstash will permit multiple entries for the same tar in its database.
59+
* ``--error-on-duplicate-tar`` FOR ADVANCED USERS ONLY: Raise an error if a tar file with the same name already exists in the database. If this flag is set, zstash will exit if it sees a duplicate tar. If it is not set, zstash's behavior will depend on whether or not the --overwrite-duplicate-tar flag is set.
60+
* ``--overwrite-duplicate-tars`` FOR ADVANCED USERS ONLY: If a duplicate tar is encountered, overwrite the existing tar file with the new one (i.e., it will assume the latest tar is the correct one). If this flag is not set, zstash will permit multiple entries for the same tar in its database.
6161
* ``-v`` increases output verbosity.
6262

6363
Local tar files as well as the sqlite3 index database (index.db) will be stored
@@ -155,7 +155,7 @@ where
155155
an incomplete tar file, then the archive you're checking
156156
must have been created using ``zstash >= v1.1.0``.
157157
* ``--tars`` to specify specific tars to check. See below for example usage.
158-
* ``--error-on-duplicate-tar`` Raise an error if a tar file with the same name already exists in the database. If this flag is set, zstash will exit if it sees a duplicate tar. If it is not set, zstash will check if the sizes and md5sums match *at least one* of the tars.
158+
* ``--error-on-duplicate-tar`` FOR ADVANCED USERS ONLY: Raise an error if a tar file with the same name already exists in the database. If this flag is set, zstash will exit if it sees a duplicate tar. If it is not set, zstash will check if the size matches the *most recent* entry.
159159
* ``-v`` increases output verbosity.
160160
* ``[files]`` is a list of files to check (standard wildcards supported).
161161

@@ -243,8 +243,8 @@ where
243243
they have been extracted from the archive. Normally, they are deleted after
244244
successful transfer.
245245
* ``--non-blocking`` Zstash will submit a Globus transfer and immediately create a subsequent tarball. That is, Zstash will not wait until the transfer completes to start creating a subsequent tarball. On machines where it takes more time to create a tarball than transfer it, each Globus transfer will have one file. On machines where it takes less time to create a tarball than transfer it, the first transfer will have one file, but the number of tarballs in subsequent transfers will grow finding dynamically the most optimal number of tarballs per transfer. NOTE: zstash is currently always non-blocking.
246-
* ``--error-on-duplicate-tar`` Raise an error if a tar file with the same name already exists in the database. If this flag is set, zstash will exit if it sees a duplicate tar. If it is not set, zstash's behavior will depend on whether or not the --overwrite-duplicate-tar flag is set.
247-
* ``--overwrite-duplicate-tars`` If a duplicate tar is encountered, overwrite the existing tar file with the new one (i.e., it will assume the latest tar is the correct one). If this flag is not set, zstash will permit multiple entries for the same tar in its database.
246+
* ``--error-on-duplicate-tar`` FOR ADVANCED USERS ONLY: Raise an error if a tar file with the same name already exists in the database. If this flag is set, zstash will exit if it sees a duplicate tar. If it is not set, zstash's behavior will depend on whether or not the --overwrite-duplicate-tar flag is set.
247+
* ``--overwrite-duplicate-tars`` FOR ADVANCED USERS ONLY: If a duplicate tar is encountered, overwrite the existing tar file with the new one (i.e., it will assume the latest tar is the correct one). If this flag is not set, zstash will permit multiple entries for the same tar in its database.
248248
* ``-v`` increases output verbosity.
249249

250250
Note: in the event that an update includes revisions to files previously archived, ``zstash update``
@@ -324,7 +324,7 @@ where
324324
an incomplete tar file, then the archive you're extracting from
325325
must have been created using ``zstash >= v1.1.0``.
326326
* ``--tars`` to specify specific tars to extract. See "Check" above for example usage.
327-
* ``--error-on-duplicate-tar`` Raise an error if a tar file with the same name already exists in the database. If this flag is set, zstash will exit if it sees a duplicate tar. If it is not set, zstash will check if the sizes and md5sums match *at least one* of the tars.
327+
* ``--error-on-duplicate-tar`` FOR ADVANCED USERS ONLY: Raise an error if a tar file with the same name already exists in the database. If this flag is set, zstash will exit if it sees a duplicate tar. If it is not set, zstash will check if the size matches the *most recent* entry.
328328
* ``-v`` increases output verbosity.
329329
* ``[files]`` is a list of files to be extracted (standard wildcards supported).
330330

tests/scripts/database_corruption.bash

Lines changed: 48 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,7 @@ run_test_cases()
6060
# 5. `zstash create` with `--for-developers-force-database-corruption="simulate_row_existing"`. We simply add a duplicate tar, but `zstash check` with `--error-on-duplicate-tar` errors out because it finds two entries for the same tar.
6161
# 6. `zstash create` with `--for-developers-force-database-corruption="simulate_no_correct_size"` to construct a very bad database: two entries for the same tar, both with incorrect sizes. `zstash check` confirms that no entries match the actual file size.
6262
# 7. `zstash create` with `--for-developers-force-database-corruption="simulate_row_existing_bad_size"`. We add a duplicate tar, but with the wrong size. `zstash check` confirms that the other entry matches the actual file size, so it succeeds.
63+
# 8. `zstash create` with `--for-developers-force-database-corruption="simulate_bad_size_for_most_recent"` to construct two entries for the same tar, the most recent of which has an incorrect size. `zstash check` fails because the most recent size does not match, but it does log that one of the entries matches the actual file size.
6364

6465
# Standard cases ##########################################################
6566

@@ -201,7 +202,7 @@ run_test_cases()
201202
fi
202203
cd ${src_prefix}_check
203204
zstash check --hpss=${DST_DIR}/${case_name} --error-on-duplicate-tar 2>&1 | tee check.log
204-
grep "ERROR: Database corruption detected! Found 2 database entries for 000000.tar with sizes" check.log
205+
grep "ERROR: Database corruption detected! Found 2 database entries for 000000.tar, with sizes" check.log
205206
if [ $? != 0 ]; then
206207
((fail_count++))
207208
review_str+="${case_name}_check/check.log,"
@@ -225,17 +226,23 @@ run_test_cases()
225226
fi
226227
cd ${src_prefix}_check
227228
zstash check --hpss=${DST_DIR}/${case_name} 2>&1 | tee check.log
228-
grep "INFO: 000000.tar: No database entries match the actual file size:" check.log
229+
grep "WARNING: Database corruption detected! Found 2 database entries for 000000.tar, with sizes" check.log
229230
if [ $? != 0 ]; then
230231
((fail_count++))
231232
review_str+="${case_name}_check/check.log,"
232233
else
233234
((success_count++))
234235
fi
236+
grep "INFO: 000000.tar: No database entry matches the actual file size:" check.log
237+
if [ $? != 0 ]; then
238+
((fail_count++))
239+
else
240+
((success_count++))
241+
fi
235242

236243

237-
# Case 7: Duplicates detected! Allow them. Pass check because at least one of the sizes match.
238-
case_name="check_finds_a_matching_size"
244+
# Case 7: Duplicates detected! Allow them. Pass check because the most recent size matches.
245+
case_name="check_finds_most_recent_size_matches"
239246
src_prefix=${SRC_DIR}/${case_name}
240247
setup ${case_name} ${src_prefix}
241248
cd ${src_prefix}_create
@@ -261,13 +268,48 @@ run_test_cases()
261268
fi
262269
cd ${src_prefix}_check
263270
zstash check --hpss=${DST_DIR}/${case_name} 2>&1 | tee check.log
264-
grep "INFO: 000000.tar: Found a database entry with the same size as the actual file size:" check.log
271+
grep "WARNING: Database corruption detected! Found 2 database entries for 000000.tar, with sizes" check.log
265272
if [ $? != 0 ]; then
266273
((fail_count++))
267274
review_str+="${case_name}_check/check.log,"
268275
else
269276
((success_count++))
270277
fi
278+
grep "INFO: 000000.tar: The most recent database entry has the same size as the actual file size:" check.log
279+
if [ $? != 0 ]; then
280+
((fail_count++))
281+
else
282+
((success_count++))
283+
fi
284+
285+
# Case 8: Duplicates detected! Allow them. Error out on check because the most recent size doesn't match.
286+
case_name="check_finds_most_recent_size_does_not_match"
287+
src_prefix=${SRC_DIR}/${case_name}
288+
setup ${case_name} ${src_prefix}
289+
cd ${src_prefix}_create
290+
zstash create --hpss=${DST_DIR}/${case_name} --for-developers-force-database-corruption="simulate_bad_size_for_most_recent" zstash_demo 2>&1 | tee create.log
291+
grep "INFO: TESTING/DEBUGGING ONLY: Simulating bad size for most recent entry for 000000.tar." create.log
292+
if [ $? != 0 ]; then
293+
((fail_count++))
294+
review_str+="${case_name}_create/create.log,"
295+
else
296+
((success_count++))
297+
fi
298+
cd ${src_prefix}_check
299+
zstash check --hpss=${DST_DIR}/${case_name} 2>&1 | tee check.log
300+
grep "WARNING: Database corruption detected! Found 2 database entries for 000000.tar, with sizes" check.log
301+
if [ $? != 0 ]; then
302+
((fail_count++))
303+
review_str+="${case_name}_check/check.log,"
304+
else
305+
((success_count++))
306+
fi
307+
grep "INFO: 000000.tar: A database entry matches the actual file size," check.log
308+
if [ $? != 0 ]; then
309+
((fail_count++))
310+
else
311+
((success_count++))
312+
fi
271313

272314

273315
# Summary
@@ -278,6 +320,6 @@ run_test_cases()
278320

279321
run_test_cases
280322

281-
# Success count: 20
323+
# Success count: 25
282324
# Fail count: 0
283325
# Review:

zstash/create.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -169,17 +169,17 @@ def setup_create() -> Tuple[str, argparse.Namespace]:
169169
optional.add_argument(
170170
"--error-on-duplicate-tar",
171171
action="store_true",
172-
help="Raise an error if a tar file with the same name already exists in the database. If this flag is set, zstash will exit if it sees a duplicate tar. If it is not set, zstash's behavior will depend on whether or not the --overwrite-duplicate-tar flag is set.",
172+
help="FOR ADVANCED USERS ONLY: Raise an error if a tar file with the same name already exists in the database. If this flag is set, zstash will exit if it sees a duplicate tar. If it is not set, zstash's behavior will depend on whether or not the --overwrite-duplicate-tar flag is set.",
173173
)
174174
optional.add_argument(
175175
"--overwrite-duplicate-tars",
176176
action="store_true",
177-
help="If a duplicate tar is encountered, overwrite the existing tar file with the new one (i.e., it will assume the latest tar is the correct one). If this flag is not set, zstash will permit multiple entries for the same tar in its database.",
177+
help="FOR ADVANCED USERS ONLY: If a duplicate tar is encountered, overwrite the existing tar file with the new one (i.e., it will assume the latest tar is the correct one). If this flag is not set, zstash will permit multiple entries for the same tar in its database.",
178178
)
179179
optional.add_argument(
180180
"--for-developers-force-database-corruption",
181181
type=str,
182-
help="For developers only! Forces database corruption by inserting a duplicate tar entry into the tars table. This is useful for testing. 3 options: simulate_row_existing -- add the tar before the main logic. simulate_row_existing_bad_size -- add the tar before the main logic, but with the wrong size. simulate_no_correct_size -- add the tar twice, both times with incorrect size. If this option is not set, no corruption will be forced.",
182+
help="FOR DEVELOPERS ONLY! Forces database corruption by inserting a duplicate tar entry into the tars table. This is useful for testing. 3 options: simulate_row_existing -- add the tar before the main logic. simulate_row_existing_bad_size -- add the tar before the main logic, but with the wrong size. simulate_no_correct_size -- add the tar twice, both times with incorrect size. If this option is not set, no corruption will be forced. simulate_bad_size_for_most_recent -- add the tar twice, first with good size, then with bad size.",
183183
default="",
184184
)
185185
# Now that we're inside a subcommand, ignore the first two argvs

zstash/extract.py

Lines changed: 20 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -104,7 +104,7 @@ def setup_extract() -> Tuple[argparse.Namespace, str]:
104104
optional.add_argument(
105105
"--error-on-duplicate-tar",
106106
action="store_true",
107-
help="Raise an error if a tar file with the same name already exists in the database. If this flag is set, zstash will exit if it sees a duplicate tar. If it is not set, zstash will check if the sizes and md5sums match *at least one* of the tars.",
107+
help="FOR ADVANCED USERS ONLY: Raise an error if a tar file with the same name already exists in the database. If this flag is set, zstash will exit if it sees a duplicate tar. If it is not set, zstash will check if the size matches the *most recent* entry.",
108108
)
109109
parser.add_argument("files", nargs="*", default=["*"])
110110
args: argparse.Namespace = parser.parse_args(sys.argv[2:])
@@ -401,7 +401,9 @@ def check_sizes_match(cur, tfname, error_on_duplicate_tar):
401401
name_only: str = os.path.split(tfname)[1]
402402

403403
# Get ALL entries for this tar name
404-
cur.execute("SELECT size FROM tars WHERE name = ?", (name_only,))
404+
cur.execute(
405+
"SELECT size FROM tars WHERE name = ? ORDER by id DESC", (name_only,)
406+
)
405407
results = cur.fetchall()
406408

407409
if not results:
@@ -414,32 +416,38 @@ def check_sizes_match(cur, tfname, error_on_duplicate_tar):
414416
# Extract just the size values
415417
sizes: List[int] = [row[0] for row in results]
416418
error_str: str = (
417-
f"Database corruption detected! Found {len(results)} database entries for {name_only} with sizes {sizes}"
419+
f"Database corruption detected! Found {len(results)} database entries for {name_only}, with sizes {sizes}"
418420
)
419421

420422
if error_on_duplicate_tar:
421423
# Tested by database_corruption.bash Case 5
422424
logger.error(error_str)
423425
raise RuntimeError(error_str)
424-
425426
logger.warning(error_str)
426-
unique_sizes: Set[int] = set(sizes)
427-
if actual_size in unique_sizes:
427+
428+
# We ordered the results by id DESC,
429+
# so the first entry is the most recent.
430+
most_recent_size: int = sizes[0]
431+
if actual_size == most_recent_size:
428432
# Tested by database_corruption.bash Case 7
429-
# If the actual size matches at least one of the sizes in the database,
433+
# If the actual size matches the most recent size,
430434
# then we can assume that the tar is valid.
431435
logger.info(
432-
f"{name_only}: Found a database entry with the same size as the actual file size: {actual_size}."
436+
f"{name_only}: The most recent database entry has the same size as the actual file size: {actual_size}."
433437
)
434438
return True
439+
unique_sizes: Set[int] = set(sizes)
440+
if actual_size in unique_sizes:
441+
# Tested by database_corruption.bash Case 8
442+
logger.info(
443+
f"{name_only}: A database entry matches the actual file size, {actual_size}, but it is not the most recent entry."
444+
)
435445
else:
436446
# Tested by database_corruption.bash Case 6
437-
# If the actual size does not match any of the sizes in the database,
438-
# then we cannot assume that the tar is valid.
439447
logger.info(
440-
f"{name_only}: No database entries match the actual file size: {actual_size}."
448+
f"{name_only}: No database entry matches the actual file size: {actual_size}."
441449
)
442-
return False
450+
return False
443451
else:
444452
# Tested by database_corruption.bash Cases 1,2,4
445453
# Single entry - normal case

zstash/hpss_utils.py

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -234,6 +234,21 @@ def add_files(
234234
"INSERT INTO tars VALUES (NULL,?,?,?)",
235235
(tfname, tarsize + 2000, tar_md5),
236236
)
237+
elif force_database_corruption == "simulate_bad_size_for_most_recent":
238+
# Tested by database_corruption.bash Case 8
239+
# For developers only! For debugging purposes only!
240+
# Add this tar twice, second time with bad size.
241+
logger.info(
242+
f"TESTING/DEBUGGING ONLY: Simulating bad size for most recent entry for {tfname}."
243+
)
244+
cur.execute(
245+
"INSERT INTO tars VALUES (NULL,?,?,?)",
246+
(tfname, tarsize, tar_md5),
247+
)
248+
cur.execute(
249+
"INSERT INTO tars VALUES (NULL,?,?,?)",
250+
(tfname, tarsize + 2000, tar_md5),
251+
)
237252
else:
238253
# Tested by database_corruption.bash Cases 1,2
239254
# Typical case

zstash/update.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -108,12 +108,12 @@ def setup_update() -> Tuple[argparse.Namespace, str]:
108108
optional.add_argument(
109109
"--error-on-duplicate-tar",
110110
action="store_true",
111-
help="Raise an error if a tar file with the same name already exists in the database. If this flag is set, zstash will exit if it sees a duplicate tar. If it is not set, zstash's behavior will depend on whether or not the --overwrite-duplicate-tar flag is set.",
111+
help="FOR ADVANCED USERS ONLY: Raise an error if a tar file with the same name already exists in the database. If this flag is set, zstash will exit if it sees a duplicate tar. If it is not set, zstash's behavior will depend on whether or not the --overwrite-duplicate-tar flag is set.",
112112
)
113113
optional.add_argument(
114114
"--overwrite-duplicate-tars",
115115
action="store_true",
116-
help="If a duplicate tar is encountered, overwrite the existing tar file with the new one (i.e., it will assume the latest tar is the correct one). If this flag is not set, zstash will permit multiple entries for the same tar in its database.",
116+
help="FOR ADVANCED USERS ONLY: If a duplicate tar is encountered, overwrite the existing tar file with the new one (i.e., it will assume the latest tar is the correct one). If this flag is not set, zstash will permit multiple entries for the same tar in its database.",
117117
)
118118
optional.add_argument(
119119
"-v", "--verbose", action="store_true", help="increase output verbosity"

0 commit comments

Comments
 (0)