-
Notifications
You must be signed in to change notification settings - Fork 4
feat: adding script to read from lmdb #task #57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @raj-prince, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces new tooling designed to explore and analyze data access patterns within LMDB databases. It provides a Python script that can create an LMDB instance and then execute both simulated random-access reads, typical for machine learning batch processing, and efficient sequential reads. The inclusion of a Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a script to demonstrate read patterns from an LMDB database. The script is a good starting point, but I have identified several areas for improvement. There is a critical issue with a hardcoded, user-specific path that hinders portability. I've also provided feedback to enhance code clarity, correctness, and robustness, including removing unused code, correcting misleading logs, ensuring proper resource management with context managers, and handling potential null values. Lastly, an unused dependency in requirements.txt should be removed.
| import os | ||
| import random | ||
|
|
||
| DB_DIR = '/home/princer_google_com/gcs/lmdb_test_env' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The database directory DB_DIR is hardcoded to a user-specific absolute path. This makes the script not portable and will cause it to fail on any other machine or in a different environment. It is highly recommended to use a relative path or make the path configurable, for example, through command-line arguments.
| DB_DIR = '/home/princer_google_com/gcs/lmdb_test_env' | |
| DB_DIR = 'lmdb_test_env' |
|
|
||
| end_time = time.time() | ||
| read_duration = end_time - start_time | ||
| print(f"Total read time (10,000 random samples): {read_duration:.4f} seconds") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The log message here, and the comment on line 43, state that 10,000 random samples are being read. However, with NUM_BATCHES = 10 and BATCH_SIZE = 5, only 10 * 5 = 50 samples are actually read. This is misleading and should be corrected to reflect the actual number of operations.
| print(f"Total read time (10,000 random samples): {read_duration:.4f} seconds") | |
| print(f"Total read time ({NUM_BATCHES * BATCH_SIZE} random samples): {read_duration:.4f} seconds") |
| @@ -0,0 +1,111 @@ | |||
| import lmdb | |||
| import numpy as np | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| import random | ||
|
|
||
| DB_DIR = '/home/princer_google_com/gcs/lmdb_test_env' | ||
| full_path = os.path.abspath(DB_DIR) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
|
||
| txn.put(key, value) | ||
|
|
||
| env.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This env.close() call is not guaranteed to be executed if an error occurs earlier in the function. To ensure the LMDB environment is always closed properly, it's recommended to use lmdb.open as a context manager.
For example, you could change line 21 to with lmdb.open(DB_DIR, map_size=NUM_SAMPLES * SAMPLE_SIZE * 2) as env:, indent the following block (lines 23-33), and then remove this env.close() call.
| label, data = value.split(b'_', 1) | ||
| batch_data.append((label, data)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
txn.get(key) will return None if the key is not found in the database. The subsequent call to value.split() would then raise an AttributeError. It is safer to add a check to ensure value is not None before attempting to process it.
| label, data = value.split(b'_', 1) | |
| batch_data.append((label, data)) | |
| if value is not None: | |
| label, data = value.split(b'_', 1) | |
| batch_data.append((label, data)) |
| read_duration = end_time - start_time | ||
| print(f"Total read time (10,000 random samples): {read_duration:.4f} seconds") | ||
|
|
||
| env.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This env.close() call is not guaranteed to be executed if an error occurs earlier in the function. To ensure the LMDB environment is always closed properly, it's recommended to use lmdb.open as a context manager.
For example, you could change line 41 to with lmdb.open(DB_DIR, readonly=True, lock=False) as env:, indent the following block, and then remove this env.close() call.
| if count >= NUM_SAMPLES: | ||
| break # Ensure we don't go past the expected number of samples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The break condition if count >= NUM_SAMPLES: is redundant. The cursor iterator (for key, value in cursor:) will naturally stop after iterating over all key-value pairs in the database. Since the database is created by create_lmdb_db with NUM_SAMPLES items, this check is unnecessary and can be removed for better clarity.
|
|
||
| print(f"Total read time ({count} samples sequentially): {read_duration:.4f} seconds") | ||
|
|
||
| env.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This env.close() call is not guaranteed to be executed if an error occurs earlier in the function. To ensure the LMDB environment is always closed properly, it's recommended to use lmdb.open as a context manager.
For example, you could change line 79 to with lmdb.open(DB_DIR, readonly=True, lock=False) as env:, indent the following block, and then remove this env.close() call.
| @@ -0,0 +1,2 @@ | |||
| numpy 2.2.6 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uh oh!
There was an error while loading. Please reload this page.