-
Notifications
You must be signed in to change notification settings - Fork 141
Asv v2 s3 tests (Refactored) #2249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
MODIFIABLE = "MODIFIABLE" | ||
|
||
|
||
class StorageSetup: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The StorageSetup class can easily be refactored to be more readable like this:
def aws_default_factory() -> BaseS3StorageFixtureFactory:
return real_s3_from_environment_variables(shared_path=True)
def get_machine_id() -> str:
"""
Returns machine id, or id specified through environments variable (for github)
"""
return os.getenv("ARCTICDB_PERSISTENT_STORAGE_SHARED_PATH_PREFIX", socket.gethostname())
def create_prefix(storage_space: StorageSpace, add_to_prefix: str) -> str:
def is_valid_string(s: str) -> bool:
return bool(s and s.strip())
mandatory_part = storage_space.value
optional = add_to_prefix if is_valid_string(add_to_prefix) else ''
return f"{mandatory_part}/{optional}" if optional else mandatory_part
def check_persistence_access(storage_space: StorageSpace, confirm_persistent_storage_need: bool = False):
assert aws_default_factory(), "Environment variables not initialized (ARCTICDB_REAL_S3_ACCESS_KEY,ARCTICDB_REAL_S3_SECRET_KEY)"
if storage_space == StorageSpace.PERSISTENT:
assert confirm_persistent_storage_need, "Use of persistent store not confirmed!"
def get_arctic_uri(storage: Storage, storage_space: StorageSpace, add_to_prefix: str = None, confirm_persistent_storage_need: bool = False) -> str:
check_persistence_access(storage_space, confirm_persistent_storage_need)
prefix = create_prefix(storage_space, add_to_prefix)
if storage == Storage.AMAZON:
factory = aws_default_factory()
factory.default_prefix = prefix
return factory.create_fixture().arctic_uri
elif storage == Storage.LMDB:
return f"lmdb://{tempfile.gettempdir()}/benchmarks_{prefix}"
else:
raise Exception("Unsupported storage type:", storage)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not agree. There is a value of separating responsibility and working TestLibraryManager.
It provides better isolation and management. So I disagree with making those changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong opinion whether these live inside a class or as separate functions. I think in this case the class only provides namespacing, which can also be done with a module but I don't think it matters hugely.
I think @G-D-Petrov 's suggestion provides a few simplifications which I find valueable:
- The main purpose of this being a class seems to be the caching of
_aws_default_factory
. I don't think creating it is expensive? Why bother caching it then? This way we need to worry about forgetting to initialize it the first time. I.e. we can just replace the__new__
with aaws_default_factory
. - The
create_prefix
is rewritten in a much more concise way.
To sum up, I don't mind leaving this a class for the namespacing but it would be nice to still simplify the logic we can
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made suggested simplificiations in the class (part of them as google was added and other thing changed meanwhile). So code is simpler, but still in class where it should be. When there is module with lots of mixed functions classes can and should be used to provide namespacing and encapsulation and that is good style as it helps writing code - IDEs help, through intelli sense etc. Long files with lots of functions one scattered here another there mixed with other functions is actually not great example of code, and we are going to achieve that if we do not use classes. Classes provide clarity of purpose
|
||
|
||
class StorageInfo: | ||
class LibraryManager: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we spoke yesterday, this LibraryManager
is unnecessary because it duplicates a lot of the logic of Arctic's internal LibraryManager.
The only needed functionality is:
- having a function to create the persistent/modifiable Arctic client with the correct URIs
- having a function that constructs the correct names for the libraries
- some helper function for cleaning up modifiable libraries, which can just iterate over the libraries in the modifiable Arctic client
Every thing else here is can easily be handled though the arctic clients directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again - all tests work, pass and the abstraction isolates very well user from ASV etc. All points you make are perhaps valid but constitute totaly different approach. As I disagree with that approach on real grounds I cannot make those changes
TestLibraryManager isolates within itself all needs for user to understand the specifics of the structure and lets the person who writes the test to write any types of test using requested libraries.
It gives the creator full freedom to do what is needed to achive best result
It provides a way of work that eliminates the need of test author to know ASV internals and thus protects from problems that will arise during test execution
It also gives ability without change of test code to make changes in the structure of the the storage spaces.
All that is well tested and the framework itself is covered with tests that can be extended.
There is no point of making any changes rather than wasting more resources on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, It is TestLibrary manager now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TestLibraryManager does not duplicate any work of Arctic Library manager. It provides very needed isolation for creating tests from where and how Arctic will create them. More info can be found in the new documentation of the class:
This class is a thin wrapper around Arctic class. Its goal is to provide natural user |
As a conclusion I do not argue that eventually a version without use of class and only functions can be created. I argue its value and need. There are many many arguments why not take this approach in the description there.One additional and very compelling is the fact that Arctic is class and in order to override class you need a class not functions. (Arctic is not base class for good reason)
I do find enough reasons that current implementation is better - one I can name is the fact that no one uses Arctic in man db to directly create libraries - there is and UI library manager. In all real implementation when you have certain specifics for management the infrastructure you override what is avail. Current implementation is exactly this. That is why it is ok andy change is rather waste of resources and eventual result is more likely to force implementation of similar thing at some point.
|
||
WIDE_DATAFRAME_NUM_COLS = 30_000 | ||
|
||
class LargeAppendDataModifyCache: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name is not very descriptive and the comments seems a bit misleading.
AFAICS this is a cache for the expected results thought the run.
The name/comment should reflect that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will make necessary changes
def get_population_policy(self): | ||
pass | ||
|
||
def get_index_info(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be renamed to something like index_start
, then it doesn't need the comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the naming is ok. It returns both start and the index frequence - hence the name get_index_info(), also the comment there suggests that:
def get_index_info(self):
"""
Returns initial timestamp and index frequency
"""
return (pd.Timestamp("2-2-1986"), 's')
def initialize_cache(self, warmup_time, params, num_cols, num_sequential_dataframes): | ||
# warmup will execute tests additional time and we do not want that at all for write | ||
# update and append tests. We want exact specified `number` of times to be executed between | ||
assert warmup_time == 0, "warm up must be 0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it can be only 1 value, why do we even have it as a parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parameters of the tests are defined as such in each test case. Thus test case A parameters are memebers of class A not to the instance. Test case B parameters are for test case B. The function checks parameters of different classes
class AWSLargeAppendTests(AsvBase)
class AWS30kColsWideDFLargeAppendTests(AWSLargeAppendTests)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test code in some tests do contain assertains about the certains ASV parameter values. Those assertions are needed because the tests can either not work if they do not have that specific value or they might work but produce false results.
that is additional thing added to the new implementation of tests. They do check 2 things:
- validity of the preconditions (usually ASV parameters, or setup)
- validity of test operations (see tests for batches for instance the asserts for batch operations, which are silent by default - ie they do not fail laoudly if error)
# update and append tests. We want exact specified `number` of times to be executed between | ||
assert warmup_time == 0, "warm up must be 0" | ||
|
||
num_sequential_dataframes = num_sequential_dataframes + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num_sequential_dataframes += 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but it is also correct in current form right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
||
return cache | ||
|
||
def initialize_update_dataframes(self, num_rows: int, num_cols: int, cached_results: LargeAppendDataModifyCache, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is a bit hard to follow, consider refactoring it to something like:
def initialize_update_dataframes(self, num_rows: int, num_cols: int, cached_results: LargeAppendDataModifyCache,
generator: SequentialDataframesGenerator):
logger = self.get_logger()
initial_timestamp, freq = self.get_index_info()
timestamp_number = TimestampNumber.from_timestamp(initial_timestamp, freq)
def log_time_range(update_type: str, df_key: int):
time_range = generator.get_first_and_last_timestamp([cached_results[update_type][df_key]])
logger.info(f"Time range {update_type.upper()} update {time_range}")
def generate_and_log(update_type: str, num_rows: int, start_ts: pd.Timestamp):
df = generator.df_generator.get_dataframe(number_rows=num_rows, number_columns=num_cols, start_timestamp=start_ts, freq=freq)
cached_results[update_type][num_rows] = df
log_time_range(update_type, num_rows)
logger.info(f"Frame START-LAST Timestamps {timestamp_number} == {timestamp_number + num_rows}")
# Full update
generate_and_log('update_full_dict', num_rows, initial_timestamp)
# Half update
half = num_rows // 2
timestamp_number.inc(half - 3)
generate_and_log('update_half_dict', half, timestamp_number.to_timestamp())
# Upsert update
generate_and_log('update_upsert_dict', num_rows, timestamp_number.to_timestamp())
# Single update
timestamp_number.inc(half)
generate_and_log('update_single_dict', 1, timestamp_number.to_timestamp())
# Single append
next_timestamp = generator.get_next_timestamp_number(cached_results.write_and_append_dict[num_rows], freq)
generate_and_log('append_single_dict', 1, next_timestamp.to_timestamp())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sence!
|
||
def get_modifiable_library(self, library_suffix: Union[str, int] = None) -> Library: | ||
|
||
class LibraryPopulationPolicy: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we have discussed, this can be greatly simplified by decoupling the configuration for the population from the logic of executing the populating.
This can be done with a refactor like:
@dataclass
class LibraryPopulationConfig:
"""Immutable configuration for library population."""
parameters: List[int]
parameters_are_rows: bool = True
fixed_rows: int = 1
fixed_columns: int = 1
symbol_prefix: str = ""
use_auto_increment: bool = False
with_metadata: bool = False
versions_count: int = 1
versions_mean: float = 1.0
with_snapshots: bool = False
def symbol_name(self, index: int) -> str:
"""Get the symbol name based on configuration."""
prefix = f"symbol_{self.symbol_prefix}_" if self.symbol_prefix else "symbol_"
return f"{prefix}{index}"
def create_metadata(self) -> Dict[str, Any]:
"""Create metadata for symbols and snapshots."""
if not self.with_metadata:
return {}
return DFGenerator.generate_random_dataframe(rows=3, cols=10).to_dict()
class LibraryPopulator:
"""
Handles the actual population of a library based on a configuration.
Separates the configuration from the execution.
"""
def __init__(self, config: LibraryPopulationConfig, logger: logging.Logger,
df_generator: DataFrameGenerator = None):
self.config = config
self.logger = logger
self.df_generator = df_generator or VariableSizeDataframe()
def populate(self, library):
"""Populate the library according to the configuration."""
start_time = time.time()
for i, param in enumerate(self.config.parameters):
# Determine symbol index
symbol_index = i if self.config.use_auto_increment else param
symbol_name = self.config.symbol_name(symbol_index)
# Determine rows and columns
rows = param if self.config.parameters_are_rows else self.config.fixed_rows
columns = self.config.fixed_columns if self.config.parameters_are_rows else param
# Generate dataframe
df = self.df_generator.generate_dataframe(rows, columns)
# Create symbol
symbol = library.create_symbol(symbol_name, df)
# Add metadata if configured
if self.config.with_metadata:
symbol.set_metadata(self.config.create_metadata())
# Create versions if configured
if self.config.versions_count > 1:
versions_list = self._generate_versions_list(len(self.config.parameters))
for v in range(1, min(versions_list[i], self.config.versions_count) + 1):
version_df = self.df_generator.generate_dataframe(rows, columns)
version = symbol.create_version(version_df)
# Add metadata if configured
if self.config.with_metadata:
version.set_metadata(self.config.create_metadata())
# Create snapshot if configured
if self.config.with_snapshots:
snapshot = library.create_snapshot(f"snapshot_{symbol_name}_{v}")
if self.config.with_metadata:
snapshot.set_metadata(self.config.create_metadata())
self.logger.info(f"Population completed in: {time.time() - start_time:.2f}s")
def _generate_versions_list(self, number_symbols: int) -> List[np.int64]:
"""Generate a list of version counts for each symbol."""
# Implementation would depend on your specific requirements
# This is a placeholder based on the original code
versions_list = np.random.poisson(self.config.versions_mean, number_symbols)
versions_list = np.clip(versions_list, 1, self.config.versions_count)
return versions_list.astype(np.int64)
The code is just a example that I got from a pass though Claude and can be simplified further e.g:
- there are parameter in the Policy that can be removed
- we probably don't need a LibraryPopulator class, as some helper functions that take a policy should suffice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed we also agreed that current code is not conflicting any recommendations. Additionally there is no real benefit of making that rewrite other than to pass Claude's suggestions. The code is just ok in the form that is now. No need to changed it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation is ok it goes for fluent pattern which is widely used in python and even in arcticdb (queries). The requested change would not add any additional value nor it will fix any error. Therefore it is not needed at this moment.
MODIFIABLE = "MODIFIABLE" | ||
|
||
|
||
class StorageSetup: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong opinion whether these live inside a class or as separate functions. I think in this case the class only provides namespacing, which can also be done with a module but I don't think it matters hugely.
I think @G-D-Petrov 's suggestion provides a few simplifications which I find valueable:
- The main purpose of this being a class seems to be the caching of
_aws_default_factory
. I don't think creating it is expensive? Why bother caching it then? This way we need to worry about forgetting to initialize it the first time. I.e. we can just replace the__new__
with aaws_default_factory
. - The
create_prefix
is rewritten in a much more concise way.
To sum up, I don't mind leaving this a class for the namespacing but it would be nice to still simplify the logic we can
''' | ||
Defined special one time setup for real storages. | ||
Place here what is needed for proper initialization | ||
of each storage | ||
|
||
Abstracts storage space allocation from how user access it | ||
''' | ||
_instance = None | ||
_aws_default_factory: BaseS3StorageFixtureFactory = None | ||
_fixture_cache = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixture_cache
is now unused?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
return result | ||
|
||
def set_test_mode(self): | ||
def __init__(self, storage: Storage, name_benchmark: str, library_options: LibraryOptions = None) : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should all libraries have the same LibraryOptions
?
Doesn't it make more sense to pass LibraryOptions
to get_library
? This way we can create two libraries with different options which might be useful for some tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_library will return existing library in most of cases. Only if does not exist will create new library. Thus the library options is possible but not quite determined in the case when library exists what to do - to check if lib is having same options or skip and disregard. Still confusing.
The hole TestLibraryManager is thin wrapper over Arctic, which goal is to "hide" details of internal management associated with how exactly placement of libraries happen and where and how to exactly locate and remove them. The argument that we can live without classes is not good as if we could we would have not had fixtures as classes at first place. And exactly this is this class. It provides extension to functionality of Arcitc class which is suited to do jobs better in very specific environment.
Thus a possible and better options in this case are:
-
define create_library method ... now or when we actually need it.
-
workarround with what we have currently:
tlm.library_options = LibraryOptions(...)
lib = lib.get_library(...)
First proposal is clean why, second is workarround. But yet we do not have scenario that requires that, hence create_library was not implemented.
That is something that specialized class provides over free floating functions - encapsulation, emergent design (extend when you need, change easily something when new info arrises without breaking tests etc)
Classes are the natural way to grow in complexity and need, as they require use of the whole contract not only parts of it function by function, which in time would errode the overall specialization and will lead to conflicts of requirements.
# Currently we're using the same arctic client for both persistant and modifiable libraries. | ||
# We might decide that we want different arctic clients (e.g. different buckets) but probably not needed for now. | ||
def _get_arctic_client_persistent(self) -> Arctic: | ||
lib_type = StorageSpace.PERSISTENT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this storage_type
not lib_type
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The for methods serve very different purposes. Their usage scenarios is now documented as their use might not be that obvious
ac = self._get_arctic_client_modifiable() | ||
lib_names = set(ac.list_libraries()) | ||
to_deletes = [lib_name for lib_name in lib_names | ||
if (f"_{os.getpid()}_" in lib_name) and (f"_{self.name_benchmark}_" in lib_name)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just the simpler lib_name.startswith(f"{LibraryType.Modifiable}_{self.name_benchmark}_{os.getpid()}_")
?
|
||
def delete_modifiable_library(self, library_suffix: Union[str, int] = None): | ||
def clear_all_benchmark_libs(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these clear
functions can use a common clear_modifiable_with_prefix(self, prefix)
And clear_all_modifiable_from_this_process
can pass the prefix with the pid wheras clear_all_benchmark_libs
can pass the prefix without the pid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also remove_all_modifiable
and remove_all_test_libs
can also use the private prefix method.
return all(isinstance(i, list) for i in self._params) | ||
# It is quite clear what is this responsible for: only dataframe generation | ||
# Using such an abstraction can help us deduplicate the dataframe generation code between the different `EnvironmentSetup`s | ||
# Note: We use a class instead of a generator function to allow caching of dataframes in the state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doc is a leftover from the refactor proposal. We can probably just remove it, from the name it is clear enough what this does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added quite a few small changes suggestions.
A few more imporant things to discuss maybe offline:
- random seeding
- list symbol tests with respect to symbol list compaction
|
||
def delete_modifiable_library(self, library_suffix: Union[str, int] = None): | ||
def clear_all_benchmark_libs(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also remove_all_modifiable
and remove_all_test_libs
can also use the private prefix method.
self.symbol_fixed_str = symbol_fixed_str | ||
return self | ||
|
||
def use_parameters_are_columns(self) -> 'LibraryPopulationPolicy': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the notion of parameters
which can be either rows or columns a little confusing.
Why not just use num_rows: Union[int, List[int]]
and num_cols: Union[int, List[int]]
?
This way it's a bit clearer, if one of them is just an int it applies to all symbols if it is a list it ised similarly to the parameters. If they are both lists we should assert they have the same size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will also need including a num_symbols
. In general if the constructor receives the num_symbols
, num_rows
, and num_cols
and asserts that if rows or cols is a list it has the same length as num_symbols I think that would be quite easy to use
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway then the symbol naming becomes a bit less clear. Up to you
def set_max_number_versions(self, versions_max) -> 'GeneralSetupSymbolsVersionsSnapshots': | ||
self.versions_max = versions_max | ||
|
||
def generate_snapshots(self) -> 'LibraryPopulationPolicy': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I find the names starting with generate_
a little confusing. Sounds like they themselves will genereate something. Why not just call them set_with_snapshots
etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will think of better name
meta = None if not self.with_metadata else self._generate_metadata() | ||
versions_list = self._get_versions_list(len(self.parameters)) | ||
index = 0 | ||
for param_value in self.parameters: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better use for index, param_value in enumerate(self.parameters)
else: | ||
symbol = self.get_symbol_name(param_value) | ||
|
||
if self.parameters_is_number_rows_list: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it makes more sense to generate a new dataframe for each version? Instead of each version having the same dataframe. This will also be useful if we decide to allow (via a LibraryPopulationPolicy flag) the ability to update instead of write
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the scenarios that are currently implemented we do not need new dataframe on each write. That is the reason this was not implemented. And yes there is value of having new dataframes on each write. But that is expensive operation and should be approached carefully.
Still there is already a class that covers the scenario with new dataframes - SequentialDataframesGenerator, it is used in append scenarios.
Once a real case with Library Populator arises an optoion can be easily added for a new DF on each new write.
.set_with_metadata_for_each_version() | ||
.set_with_snapshot_for_each_version() | ||
.set_params([25, 50])) # for test purposes: .set_params([5, 6])) | ||
library_manager = TestLibraryManager(storage=Storage.AMAZON, name_benchmark="LIST_SYMBOLS") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Name should be LIST_VERSIONS
to not collide with the above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct!
return last_snapshot_names_dict | ||
|
||
def setup(self, last_snapshot_names_dict, num_syms): | ||
self.population_policy = self.get_population_policy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused
|
||
self.lib = manager.get_library(LibraryType.MODIFIABLE) | ||
|
||
self.symbol = f"symbol-{os.getpid()}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pid is already in the library, fine to use a more generic name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep we may choose to use or not :-) In our case it really does not matter what the name is
self.lib.update(self.symbol, self.cache.update_half_dict[num_rows]) | ||
|
||
def time_update_full(self, cache, num_rows): | ||
#self.lib.update(self.symbol, self.cache.update_full) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove comment?
# We could clear the modifiable libraries we used | ||
self.get_library_manager().clear_all_modifiable_libs_from_this_process() | ||
|
||
def get_last_x_percent_date_range(self, num_rows, percents): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's why I wanted to move this function in utils. A very identiacal function was used in read_batch_functions.py
iirc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do that
python/arcticdb/util/utils.py
Outdated
return [ListGenerators.random_string(length=str_size, | ||
include_unicode=include_unicode, seed=None) for _ in range(length)] | ||
include_unicode=include_unicode, seed=seed) for _ in range(length)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we are passing the same seed
to all random_string
calls? That's the behaviour I specifically wanted to avoid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed - quick replacement without thinking that this is special case. Thanks for catching!
python/arcticdb/util/utils.py
Outdated
@@ -458,13 +458,16 @@ class DFGenerator: | |||
Easy generation of DataFrames, via fluent interface | |||
""" | |||
|
|||
def __init__(self, size: int, seed = 1): | |||
def __init__(self, size: int, seed = 5555): | |||
self.__seed = seed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe remove the self.__seed
since we no longer need to pass it around. We can just do the if below.
@@ -48,7 +50,12 @@ class RealComparisonBenchmarks: | |||
# Therefore if you plan changes to those numbers make sure to delete old library manually | |||
NUMBER_ROWS = 2_000_000 #100_000 | |||
|
|||
params = [NO_OPERATION, CREATE_DATAFRAME, PANDAS_PARQUET, ARCTICDB_LMDB, ARCTICDB_AMAZON_S3] | |||
# NO_OPERATION measures class memory allocation. This is the actual memory that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doc is not up to date
python/benchmarks/real_read_write.py
Outdated
param_names = LMDBReadWrite.param_names | ||
param_names = ["num_cols"] | ||
# NOTE: Change of parameters will trigger failure as original library must also be deleted manually. | ||
# Therefore if you plan changes to those numbers make sure to delete old library manually |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to delete the old library? What would happen if we just left it lying around? If these steps are necessary we should have some quick docs for people telling them how to do this - I would have to reverse engineer how to do this cleanup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Work with shared persistence storage requires indeed a handbook for what to do and what not to do. And I will create a wiki page for that.
Here is what is quick explanation why this is need for that test.
When a test create a library on the shared persistent space, that library is considered to stay. Consequent test executions check that this library is there and continue the test or create it once again. Library structure is not checked (previous version contained structural checks of library and not just having checked its name, which was considered as overly complicating thing)
Now if the parameters of the test change, that would automatically trigger different code paths in test, but since the library was created once with old parameters the check will pass and it will be considered created, thus tests done on the library with those new params will produce either misleading results, or they will simply do not work. Both outcomes are possible.
Therefore there are 2 ways to battle this problem:
- having comments like this to assure whoever is changing parameters does clean the previous library first.
- having a more complicated name where library has the parameters encoded as suffix. Then a change in parameters will simply mean a new library created on the persistent store to stay. The end effect could be many libraries created and persistent storage bloated, which actually defeats its purpose. (For that there is already test storage space which should be used)
In other words persistent storage comes with logistics complications and therefore management should be applied. That is counterintuitive to most developers as usually tests are considered isolated from each other, and thus errors are possible. But in fact tests can and should be created considering also such persistence factor. Still they require some additional management
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docs will be prepared, for now the comment is changed
This will remove all persistent libraries for this test from the persistent storage | ||
Therefore use wisely only when needed (like change of parameters for tests) | ||
""" | ||
name_prefix = f"{LibraryType.value}_{self.name_benchmark}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be LibraryType.PERSISTENT.value
python/benchmarks/real_read_write.py
Outdated
# 2. Delete library from persistent storage once new parameters are ready to be committed | ||
# and remove "set_test_mode()". To delete all libs for current test use: | ||
# library_manager = TestLibraryManager(storage=Storage.AMAZON, name_benchmark="READ_WRITE") | ||
# library_manager.remove_all_persistent_libs_for_this_test() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment is confusing. Are you suggesting to commit a change to delete the library inside the github runners. That seems like an odd process, because then we would have to revert it.
I think it would be better if instructions describe how to do it locally instead of on github runner.
What do you think about the following:
If you plan to make changes to parameters, consider that library already may exist created with different number
of rows. Therefore, you need to either:
- Rename the library by changing `name_benchmark` or by adding a `lib_suffix`. Note that this will keep the old library around.
- Clean up the old library with (which needs to be run with the same environment variables as the github runner):
library_manager = TestLibraryManager(storage=Storage.AMAZON, name_benchmark="READ_WRITE")
library_manager.remove_all_persistent_libs_for_this_test()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually the process is much simpler and will leave it to the wiki which I currently work on: https://github.com/man-group/ArcticDB/wiki/ASV-Benchmarks:-Real-storage-tests
as we are talking about persistent tests which are accessed by all no specific vars are needed just simple 2-3 step process which will be explained there and the comment will have a link to it
fix omission lmdb test added and also some more logging new theory small fixes fix fix turn off execution of a test fixed delete tests small error fix last attempt remove setup bug fix bug addressed comments fixed notes fix date range get_library_name now remains only method addressed comments updated last comment for setyp multiple libs with symbols fix ommission tone down logging fixes for comments added support for unicode strings initial work updates small change comments applied new version changes requested implemented new test added one test remaining final version new test silence noisy batch test tunnings after first run of tests fix regression fixed regression final version more important notes updated comments from GP added comments add sanitization support for GCP review comments updated doc string mutliprocessing test StorageSetup class optimizations fix omission better documentation for TestManagementLibrary benchmark.json better naming fixes on comments fix error from comments comments addressed comment addressed forgotten things new version enhanced check for asv
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the commit message to something meaningful when you merge this
commit facc33bead487490322ba9cc973ed86dc9b5c4c6 Merge: bc68ed467 85d51e3b7 Author: Vasil Danielov Pashov <[email protected]> Date: Tue May 27 20:15:59 2025 +0300 Merge branch 'master' into vasil.pashov/coverity-test-existing-code-with-errors commit bc68ed467842b510bbd7001175cc8eecefc29e1c Merge: e68ec0146 91a076cc2 Author: Vasil Pashov <[email protected]> Date: Tue May 27 20:12:57 2025 +0300 Merge branch 'master' into vasil.pashov/coverity-test-existing-file commit 85d51e3b748982dc9121026a4dfcbd9f5a1dc2fb Author: Alex Owens <[email protected]> Date: Tue May 27 10:54:08 2025 +0100 Bugfix 9209057536: Allow concatenation of uint64 columns with int* columns (#2365) #### Reference Issues/PRs Fixes [9209057536](https://man312219.monday.com/boards/7852509418/pulses/9209057536) #### What does this implement or fix? Allows concatenating columns of type uint64 with columns of type int* commit 91a076cc267caf549ff38cb532dd76c5e4e168ba Author: Alex Owens <[email protected]> Date: Fri May 23 17:46:47 2025 +0100 Enhancement 7992967434: filters and projections ternary operator (#2103) #### Reference Issues/PRs Implements [7992967434](https://man312219.monday.com/boards/7852509418/pulses/7992967434) #### What does this implement or fix? Implements a ternary operator equivalent to `numpy.where`, primarily for projecting new columns based on some condition, although it can also be used for filtering. Semantically the same as `left if condition else right`, although this Pythonic syntax cannot be made to work due to limitations of the language. #### Any other comments? See `test_ternary.py` for a plethora of examples and the expected behaviour in each case. Example benchmark output with annotations below. The first parameter to all benchmarks is the number of rows (100k for all of them right now), so the single-threaded per-row time can be calculated by dividing by 100,000. e.g. projecting a new column of 100k rows by choosing from 2 dense columns (likely a common use case) takes 424us, or just over 4ns per row. Other parameters are explained for each individual benchmark. ``` Run on (20 X 2918.4 MHz CPU s) CPU Caches: L1 Data 48 KiB (x10) L1 Instruction 32 KiB (x10) L2 Unified 1280 KiB (x10) L3 Unified 24576 KiB (x1) Load Average: 4.23, 6.56, 6.73 -------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations -------------------------------------------------------------------------------------------------- BM_ternary_bitset_bitset/100000 13.1 us 13.1 us 58099 # Second arg is whether the boolean argument is true or false, third is whether the arguments are swapped BM_ternary_bitset_bool/100000/1/1 2.00 us 2.00 us 363634 BM_ternary_bitset_bool/100000/1/0 7.43 us 7.43 us 101700 BM_ternary_bitset_bool/100000/0/1 7.28 us 7.28 us 88907 BM_ternary_bitset_bool/100000/0/0 2.45 us 2.45 us 307832 BM_ternary_numeric_dense_col_dense_col/100000 424 us 424 us 1276 BM_ternary_numeric_sparse_col_sparse_col/100000 3548 us 3548 us 185 # Second arg is whether the arguments are swapped BM_ternary_numeric_dense_col_sparse_col/100000/1 2555 us 2555 us 258 BM_ternary_numeric_dense_col_sparse_col/100000/0 2800 us 2800 us 262 # Second arg is the number of unique strings in each string column, third is whether the columns have the same string pool or not BM_ternary_string_dense_col_dense_col/100000/100000/1 438 us 438 us 1534 BM_ternary_string_dense_col_dense_col/100000/100000/0 16257 us 16258 us 43 BM_ternary_string_dense_col_dense_col/100000/2/1 441 us 441 us 1603 BM_ternary_string_dense_col_dense_col/100000/2/0 4219 us 4219 us 186 BM_ternary_string_sparse_col_sparse_col/100000/100000/1 3854 us 3854 us 191 BM_ternary_string_sparse_col_sparse_col/100000/100000/0 10753 us 10754 us 67 BM_ternary_string_sparse_col_sparse_col/100000/2/1 3655 us 3655 us 183 BM_ternary_string_sparse_col_sparse_col/100000/2/0 4592 us 4592 us 123 BM_ternary_string_dense_col_sparse_col/100000/100000/1 2957 us 2957 us 236 BM_ternary_string_dense_col_sparse_col/100000/100000/0 13980 us 13980 us 50 BM_ternary_string_dense_col_sparse_col/100000/2/1 2967 us 2966 us 237 BM_ternary_string_dense_col_sparse_col/100000/2/0 5179 us 5179 us 160 # Second arg is whether the arguments are swapped BM_ternary_numeric_dense_col_val/100000/1 360 us 359 us 1871 BM_ternary_numeric_dense_col_val/100000/0 388 us 388 us 1692 BM_ternary_numeric_sparse_col_val/100000/1 2244 us 2244 us 292 BM_ternary_numeric_sparse_col_val/100000/0 2385 us 2385 us 283 # Second arg is whether the arguments are swapped, third is the number of unique strings in the column BM_ternary_string_dense_col_val/100000/1/100000 8259 us 8258 us 82 BM_ternary_string_dense_col_val/100000/0/100000 7683 us 7683 us 93 BM_ternary_string_dense_col_val/100000/1/2 2578 us 2578 us 261 BM_ternary_string_dense_col_val/100000/0/2 2385 us 2385 us 297 BM_ternary_string_sparse_col_val/100000/1/100000 6302 us 6302 us 129 BM_ternary_string_sparse_col_val/100000/0/100000 5792 us 5792 us 115 BM_ternary_string_sparse_col_val/100000/1/2 2903 us 2903 us 249 BM_ternary_string_sparse_col_val/100000/0/2 3095 us 3095 us 232 # Second arg is whether the arguments are swapped BM_ternary_numeric_dense_col_empty/100000/1 1269 us 1269 us 584 BM_ternary_numeric_dense_col_empty/100000/0 1354 us 1354 us 512 BM_ternary_numeric_sparse_col_empty/100000/1 1363 us 1363 us 572 BM_ternary_numeric_sparse_col_empty/100000/0 1374 us 1374 us 484 # Second arg is whether the arguments are swapped, third is the number of unique strings in the column BM_ternary_string_dense_col_empty/100000/1/100000 1217 us 1217 us 587 BM_ternary_string_dense_col_empty/100000/0/100000 1343 us 1343 us 577 BM_ternary_string_dense_col_empty/100000/1/2 1287 us 1287 us 574 BM_ternary_string_dense_col_empty/100000/0/2 1363 us 1363 us 518 BM_ternary_string_sparse_col_empty/100000/1/100000 1413 us 1413 us 524 BM_ternary_string_sparse_col_empty/100000/0/100000 1343 us 1343 us 517 BM_ternary_string_sparse_col_empty/100000/1/2 1293 us 1293 us 540 BM_ternary_string_sparse_col_empty/100000/0/2 1235 us 1235 us 480 BM_ternary_numeric_val_val/100000 368 us 368 us 2039 BM_ternary_string_val_val/100000 376 us 376 us 1862 # Second arg is whether the arguments are swapped BM_ternary_numeric_val_empty/100000/1 40.7 us 40.7 us 16491 BM_ternary_numeric_val_empty/100000/0 36.7 us 36.7 us 17836 BM_ternary_string_val_empty/100000/1 40.8 us 40.8 us 17892 BM_ternary_string_val_empty/100000/0 58.2 us 58.2 us 13825 # Second arg is whether the left argument is true or false, third is whether the right argument is true or false BM_ternary_bool_bool/100000/1/1 1.43 us 1.43 us 518204 BM_ternary_bool_bool/100000/1/0 1.99 us 1.99 us 378598 BM_ternary_bool_bool/100000/0/1 4.52 us 4.52 us 157505 BM_ternary_bool_bool/100000/0/0 0.020 us 0.020 us 37060921 ``` commit 3c059f4d4030dc73594f277d8754918c698a2969 Author: Phoebus Mak <[email protected]> Date: Thu May 22 09:50:29 2025 +0100 Fix gcp lib unreachable after making it read only (#2349) #### Reference Issues/PRs <!--Example: Fixes #1234. See also #3456.--> https://man312219.monday.com/boards/7852509418/pulses/8985074856 #### What does this implement or fix? `create_store_from_lib_config` took protobuf setting only. GCP setting is stored natively only, unlike other storages setting. So when new store is created with the above function, gcp settings have not been passed to the new store. Therefore the SDK will fallback to default but incorrect setting and cause errors. S3 and GCPXML native settings are given default value to avoid uninitiailzied value being used in the test #### Any other comments? Test in the CI: https://github.com/man-group/ArcticDB/actions/runs/15164054821/job/42638155043 ``` test_symbol_list.py::test_symbol_list_read_only_compaction_needed[real_gcp_store_factory-True] [gw0] [ 95%] PASSED tests/integration/arcticdb/version_store/test_symbol_list.py::test_symbol_list_read_only_compaction_needed[real_gcp_store_factory-True] test_symbol_list.py::test_symbol_list_read_only_compaction_needed[real_gcp_store_factory-False] [gw0] [ 95%] PASSED tests/integration/arcticdb/version_store/test_symbol_list.py::test_symbol_list_read_only_compaction_needed[real_gcp_store_factory-False] ``` (Other unrelated tests failed in the flaky real storage CI) #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details> <!-- Thanks for contributing a Pull Request to ArcticDB! Please ensure you have taken a look at: - ArcticDB's Code of Conduct: https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md - ArcticDB's Contribution Licensing: https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing --> commit 9d98a4436e376fa1623af92f23153cde5b68a68b Author: Alex Owens <[email protected]> Date: Wed May 21 18:03:28 2025 +0100 Fix multiindex series (#2363) #### What does this implement or fix? Fixes roundtripping of multiindexed Series with timestamps as the first level and strings as the second level. Broken by #2142 --------- Co-authored-by: Alex Owens <[email protected]> commit c3c7c2ac5d7d98d16305e6914713f03454d30a57 Author: Alex Owens <[email protected]> Date: Wed May 21 16:42:34 2025 +0100 Docs 8975554293: Add concat demo notebook (#2361) #### Reference Issues/PRs Completes [8975554293](https://man312219.monday.com/boards/7852509418/pulses/8975554293) #### What does this implement or fix? Adds a notebook demonstrating the new `concat` functionality added in https://github.com/man-group/ArcticDB/pull/2142 --------- Co-authored-by: Alex Owens <[email protected]> commit 17ea0e49deba0a3a1b8e6267e9516b14ea34b3ef Author: grusev <[email protected]> Date: Wed May 21 18:31:23 2025 +0300 Update installation_tests.yml with 5.3 and 5.4 final versions (#2362) #### Reference Issues/PRs <!--Example: Fixes #1234. See also #3456.--> #### What does this implement or fix? #### Any other comments? Moved 5.2.6 to different timeslot to eliminate the possibility about failures being because timeslot. Although a manual execution shows this problem with 5.2.6. is most probably persisting https://github.com/man-group/ArcticDB/actions/runs/15139549472/job/42559651096) Added: 5.3.4 https://github.com/man-group/ArcticDB/actions/runs/15133764164/ 5.4.1 https://github.com/man-group/ArcticDB/actions/runs/15133923361 #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details> <!-- Thanks for contributing a Pull Request to ArcticDB! Please ensure you have taken a look at: - ArcticDB's Code of Conduct: https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md - ArcticDB's Contribution Licensing: https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing --> commit e68ec014683d00f095e4efbe5d72b81b7509299d Author: Vasil Pashov <[email protected]> Date: Wed May 21 11:38:45 2025 +0300 Temporary disable tests commit e3afff2115d4f0038d13a5327a8c7b7779552a99 Merge: bdbc17028 424cd56e2 Author: Vasil Pashov <[email protected]> Date: Wed May 21 11:17:22 2025 +0300 Merge branch 'master' into vasil.pashov/coverity-test-existing-file commit 424cd56e295afafd64444420b92fcf89a82dd1ea Author: grusev <[email protected]> Date: Tue May 20 11:09:42 2025 +0300 Schedule S3 tests and fix STS to run only against AWS S3 (#2356) #### Reference Issues/PRs <!--Example: Fixes #1234. See also #3456.--> #### What does this implement or fix? Shedule for now to run twice a week Contains also couple of other fixes of the workflow: - seeding tests were not executed previously due to change in workflow parameter from boolean to choice for GCP tests. Now seeding tests are executed. - STS role creation was executed for GCP tests which was unnecessary. Now it gets executed only with AWS S3 - persistent tests cleaning had a problem with the context and resulted in crash not being able to load storage_tests.py. This test is fixed now to allow proper loading of mark.py in defferent contexts Results: https://github.com/man-group/ArcticDB/actions/runs/15061574677/job/42337724260 (NOTE: the failures in the above run are because this PR: https://github.com/man-group/ArcticDB/pull/2353 is not part of current one. Once it gets merge S3 tests will run without problems) #### Any other comments? #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details> <!-- Thanks for contributing a Pull Request to ArcticDB! Please ensure you have taken a look at: - ArcticDB's Code of Conduct: https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md - ArcticDB's Contribution Licensing: https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing --> --------- Co-authored-by: Georgi Rusev <Georgi Rusev> commit a158b0c2e684c9389691744c001192ce94ddc79d Author: Alex Owens <[email protected]> Date: Mon May 19 13:28:51 2025 +0100 Bugfix 9123099670: fix resampling of old updated data (#2351) #### Reference Issues/PRs Fixes [9123099670](https://man312219.monday.com/boards/7852509418/views/168855452/pulses/9123099670) #### What does this implement or fix? Fixes three separate resampling bugs: 1. Old versions of `update` (changed sometime between `4.1.0` and `4.4.0`, I haven't pinned down exactly where) had a behaviour in which the `end_index` value in the data key of the segment overlapping with the start of the date range provided to the `update` call was set to the first value of the date range in the `update` call. For all other modification methods, this is set to 1 nanosecond larger than the last index value in the contained segment. Resampling assumed this to be the case, and had an assertion verifying it. Relaxing this assertion is sufficient to fix the issue. 2. Providing a `date_range` argument with a resample where the provided date range did not overlap with the timerange covered by the index of the symbol led to trying to reserve a vector with a negative size. This now correctly returns an empty result. 3. Previously, checks that a symbol being resampled had a timestamp index occurred after some operations which also require this to be true, which could lead to the same vector reserve issue above. It is now checked in advance, and a suitable exception raised. commit 9edc74a89102b4ab66fbd7911a31322425dfcacc Author: grusev <[email protected]> Date: Mon May 19 12:54:07 2025 +0300 nfs backed tests for v1 API (#2350) #### Reference Issues/PRs <!--Example: Fixes #1234. See also #3456.--> #### What does this implement or fix? arctic_* fixtures or v2 API is already covered with nfs backed s3 tests. What is needed now is to add also tests for v1 API fixtures. New Fixtures: nfs_backed_s3_store_factory nfs_backed_s3_version_store_v1 nfs_backed_s3_version_store_v2 nfs_backed_s3_version_store_dynamic_schema_v1 nfs_backed_s3_version_store_dynamic_schema_v2 nfs_backed_s3_version_store Added to: object_store_factory s3_store_factory -> nfs_backed_s3_store_factory object_and_mem_and_lmdb_version_store s3_version_store_v1 -> nfs_backed_s3_version_store_v1 s3_version_store_v2 -> nfs_backed_s3_version_store_v2 object_and_mem_and_lmdb_version_store_dynamic_schema s3_version_store_dynamic_schema_v1 -> nfs_backed_s3_version_store_dynamic_schema_v1 s3_version_store_dynamic_schema_v2 -> nfs_backed_s3_version_store_dynamic_schema_v2 #### Any other comments? #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details> <!-- Thanks for contributing a Pull Request to ArcticDB! Please ensure you have taken a look at: - ArcticDB's Code of Conduct: https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md - ArcticDB's Contribution Licensing: https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing --> --------- Co-authored-by: Georgi Rusev <Georgi Rusev> commit 67d2bbe530f96a0aa5412f479e123da480ba2d99 Author: Alex Owens <[email protected]> Date: Fri May 16 15:20:37 2025 +0100 Enhancement 8277989680: symbol concatenation poc (#2142) #### Reference Issues/PRs 8277989680 #### What does this implement or fix? Implements symbol concatenation. Inner and outer joins over columns both supported. Expected usage: ``` # Read requests can contain usual as_of, date_range, columns, etc arguments lazy_dfs = lib.read_batch([read_request_1, read_request_2, ...]) # Potentially apply some processing to all or individual constituent lazy dataframes here, that will be applied before the join lazy_dfs = lazy_dfs[lazy_dfs["col"].notnull()] # Join here lazy_df = adb.concat(lazy_dfs) # Perform more processing if desired lazy_df = lazy_df.resample("15min").agg({"col": "mean"}) # Collect result res = lazy_df.collect() # res contains a list of VersionedItems from the consituent symbols that went into the join with data=None, and a data member with the joined Series/DataFrame ``` See `test_symbol_concatenation.py` for thorough examples of how the API works. For outer joins, if a column is not present in one of the input symbols, then the same type-specific behaviour as used for dynamic schema is used to backfill the missing values. Not all symbols can be concatenated together. The following will throw exceptions if attempted to be concatenated: - a Series with a DataFrame - Different index types, including multiindexes with different numbers of levels - Incompatible column types. e.g. if `col` has type `INT64` in one symbol, and is a string column in another symbol. this only applies if the column would be in the result, which is always the case for all columns with an outer join, but may not always be for inner joins. Where possible, the implementation is permissive with what can be joined with an output as sensible as possible: - Joining two or more Series with different names that are otherwise compatible will produce a Series with no name - Joining two or more timeseries where the indexes have different names will produce a timeseries with an unnamed index - Joining two or more timeseries where the indexes have different timezones will produce a timeseries with a UTC index - Joining two or more multiindexed Series/DataFrames where the levels have compatible types but different names will produce a multiindexed Series/DataFrame with unnamed levels where they differed between some of the inputs. - Joining two or more Series/DataFrames that all have `RangeIndex`. If the index `step` does not match between all of the inputs, then the output will have a `RangeIndex` with `start=0` and `step=1`. **This is different behaviour to Pandas, which converts to an Int64 index in this case. For this reason, a warning is logged when this happens.** The only known major limitation is that all of the symbols being joined together (after any pre-join processing) must fit into memory. Relaxing this constraint would require much more sophisticated query planning than we currently support, in which all of the clauses both for individual symbols pre-join, the join, and any post-join clauses, are all taken into account when scheduling both IO and individual processing tasks. commit c1c7a8cff3193dcf4aefee268cd3feea01c68bd9 Author: grusev <[email protected]> Date: Fri May 16 13:55:12 2025 +0300 Patch for Real S3 library names (#2353) #### Reference Issues/PRs <!--Example: Fixes #1234. See also #3456.--> #### What does this implement or fix? Currently we create library names which are too long for real S3, this is a patch for the tests until the real bug is addressed Manually triggered run: https://github.com/man-group/ArcticDB/actions/runs/15013824867 #### Any other comments? #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details> <!-- Thanks for contributing a Pull Request to ArcticDB! Please ensure you have taken a look at: - ArcticDB's Code of Conduct: https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md - ArcticDB's Contribution Licensing: https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing --> --------- Co-authored-by: Georgi Rusev <Georgi Rusev> commit bb65a85ab82dd7fec5297b258956545f8b4adea7 Author: Alex Owens <[email protected]> Date: Fri May 16 11:41:18 2025 +0100 Add resolve_defaults back in as a static method of NativeVersionStore (#2358) #### Reference Issues/PRs Was removed in #2345 , but is needed at least by some internal tests, and technically constitutes an API break (although we don't expect anybody to be using it) commit e78758a7fe5fbb02085dcfae01218903d6dad6d9 Author: grusev <[email protected]> Date: Fri May 16 13:25:24 2025 +0300 Installation Tests Workflow Fixes (#2354) #### Reference Issues/PRs <!--Example: Fixes #1234. See also #3456.--> #### What does this implement or fix? A failure when job is triggered on schedule is fixed - the string containe extra single quotes. Also the order of 2 steps is changed for schedulling specific use case. Changes in workflow dispatch are implemented to simplify execution and leave some parts for enhancements - ie the selection of exact os-python-repo combination which needs actually single flow of step and not matrix. S3 tests also enabled to run along with LMDB test by default #### Any other comments? #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details> <!-- Thanks for contributing a Pull Request to ArcticDB! Please ensure you have taken a look at: - ArcticDB's Code of Conduct: https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md - ArcticDB's Contribution Licensing: https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing --> --------- Co-authored-by: Georgi Rusev <Georgi Rusev> commit 9e544da9d823c3a4e76b256b741925af52a20742 Author: grusev <[email protected]> Date: Tue May 13 13:45:53 2025 +0300 Installation tests v4 (#2339) #### Reference Issues/PRs <!--Example: Fixes #1234. See also #3456.--> #### What does this implement or fix? Successful execution 5.2.6: https://github.com/man-group/ArcticDB/actions/runs/14641126753/job/41083591802 5.1.2: https://github.com/man-group/ArcticDB/actions/runs/14637571996 4.5.1: https://github.com/man-group/ArcticDB/actions/runs/14639124835/job/41077126258 1.6.2: https://github.com/man-group/ArcticDB/actions/runs/14701046721/job/41250511273 The PR contains workflow definition to execute tests on installed arcticdb it is combination of approaches: https://github.com/man-group/ArcticDB/pull/2330 https://github.com/man-group/ArcticDB/pull/2316 Installation tests are now in separate folder (python/installation_tests) not part of tests. They have their own fixtures, making them independent from rest of code base The tests are direct copy from originals with one modified to user ver 2 API. Otherwise now if there are changes in API each test in installation set can be addapted. As tests run very fast no need to use simulators, instead directly using S3 real storage The tests are executed by a workflow. Currently each test is executed against LMDB and real S3. The moto simulated version is not available in this moment due to tight coupling with protobufs which differ for ach version as well as tight coupling with whole existing test code. The workflow have 2 triggers: - manual trigger - allowing tests to be executed manually on demand - on schedule - the schedule execution is overnight. Each arcticdb version tests are executed within 1hr difference from the other. Thats is due to fact that executing all at once is likely to generate errors with real storages #### Any other comments? #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details> <!-- Thanks for contributing a Pull Request to ArcticDB! Please ensure you have taken a look at: - ArcticDB's Code of Conduct: https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md - ArcticDB's Contribution Licensing: https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing --> --------- Co-authored-by: Georgi Rusev <Georgi Rusev> Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> commit 2612fb45f15350dc483ddde1c8d43c2d6a02731b Author: grusev <[email protected]> Date: Mon May 12 15:39:20 2025 +0300 Asv v2 s3 tests (Refactored) (#2249) #### Reference Issues/PRs <!--Example: Fixes #1234. See also #3456.--> Contains refactored framework for setting up shared storages + tests for AWS S3 storage Merged 3 Prs into one: - https://github.com/man-group/ArcticDB/pull/2185 - https://github.com/man-group/ArcticDB/pull/2227 - https://github.com/man-group/ArcticDB/pull/2204 Important: the benchmark tests down in this PR cannot run successfully. Therefore do not take them as criteria. All tests need to be run manually. Here are runs from 27-march: LMDB set: https://github.com/man-group/ArcticDB/actions/runs/14100376040/job/39495398374 Real set: https://github.com/man-group/ArcticDB/actions/runs/14100497273/job/39495728734 #### What does this implement or fix? #### Any other comments? #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details> <!-- Thanks for contributing a Pull Request to ArcticDB! Please ensure you have taken a look at: - ArcticDB's Code of Conduct: https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md - ArcticDB's Contribution Licensing: https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing --> Co-authored-by: Georgi Rusev <Georgi Rusev> commit 3c2fe145cad45797356a4ec5fbd42e4dac57681a Author: William Dealtry <[email protected]> Date: Mon May 12 09:57:15 2025 +0100 size_t size in MacOS commit bb54de8879ab57c37093a62c5282e405fc9a834b Author: William Dealtry <[email protected]> Date: Mon May 12 09:03:04 2025 +0100 resolve defaults is a free function commit e973f8dbd898aedc747bc232e022c9a1137d882c Author: willdealtry <[email protected]> Date: Wed Apr 16 14:49:46 2025 +0100 Fix up file operations commit af1a171eab284902db4333946b732de7d9ec2b18 Author: Phoebus Mak <[email protected]> Date: Mon May 12 10:00:32 2025 +0100 Disable s3 checksumming (#2337) #### Reference Issues/PRs <!--Example: Fixes #1234. See also #3456.--> https://github.com/man-group/ArcticDB/issues/2251 #### What does this implement or fix? Disable s3 checksumming by setting environment variable in the wheel. #### Any other comments? This will also unblock the upgrade of `aws-sdk-cpp` on vcpkg. The upgrade will not be made in this PR One of the newly added test is needed to be skipped as `conda` CI has `aws-sdk-cpp` pinned at non-s3-checksumming version due the `libarrow` pin. `environment-dev.yml` doesn't align with the counterpart in the feedstock. Therefore the new version of `aws-sdk-cpp` is only used in the feedstock thus release wheel but not in local and CI build here. This will be addressed in separate ticket. [Commit](https://github.com/man-group/ArcticDB/pull/2337/commits/245a02cd455e39fb8f976301ccd5409e6ae88b13) to remove `libarrow` pin so more updated `aws-sdk-cpp`, which support s3 checksumming is in used in conda It's for verifying the change with the newly added the test. The [test](https://github.com/man-group/ArcticDB/actions/runs/14732394443/job/41349695905) is successful. #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details> <!-- Thanks for contributing a Pull Request to ArcticDB! Please ensure you have taken a look at: - ArcticDB's Code of Conduct: https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md - ArcticDB's Contribution Licensing: https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing --> commit b808afac25bed84595b874f28b6b3ce2407fbd0c Author: grusev <[email protected]> Date: Fri May 9 15:46:17 2025 +0300 Delete STS roles regularly (#2344) #### Reference Issues/PRs <!--Example: Fixes #1234. See also #3456.--> #### What does this implement or fix? Due to limitation of STS roles number we should constantly do cleaning of failed to delete roles. The PR contains a scheduled job that would do that every Sa. The python script can also be executed at any time and will delete only roles created prior of today, leaving all currently running jobs unaffected As roles cannot be guaranteed to be cleaned after tests execution due to many factors, we should take them out on regular bases, and perhaps this is the quickest and most reliable approach #### Any other comments? #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details> <!-- Thanks for contributing a Pull Request to ArcticDB! Please ensure you have taken a look at: - ArcticDB's Code of Conduct: https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md - ArcticDB's Contribution Licensing: https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing --> --------- Co-authored-by: Georgi Rusev <Georgi Rusev> commit 0136f4ca52559e0640dc1b7518d6a8b0773ed3a8 Author: Ognyan Stoimenov <[email protected]> Date: Fri May 9 14:36:54 2025 +0300 Fix permissions for the automatic docs building (#2347) #### Reference Issues/PRs <!--Example: Fixes #1234. See also #3456.--> #### What does this implement or fix? Fixes failures when building the docs automatically on release like: https://github.com/man-group/ArcticDB/actions/runs/14832306883 #### Any other comments? #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details> <!-- Thanks for contributing a Pull Request to ArcticDB! Please ensure you have taken a look at: - ArcticDB's Code of Conduct: https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md - ArcticDB's Contribution Licensing: https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing --> commit 652d968561d473599e90508078005c4fd00a1ba4 Author: Phoebus Mak <[email protected]> Date: Sat May 3 02:03:44 2025 +0100 Query Stat framework v3 (#2304) #### Reference Issues/PRs <!--Example: Fixes #1234. See also #3456.--> #### What does this implement or fix? New query stat implemenation which its schema is static The feature of linking arcticdb API calls to storage operations has been dropped. Now only storage operation stats will be logged. Therefore the schema of the stats is hardcoded and allow the summation of stats is logged, one statical object with numerous atomic ints is enough to do the job. No fancy map nor modification of folly executor. #### Any other comments? Sample output: ``` { // Stats "SYMBOL_LIST": // std::array<std::array<OpStats, NUMBER_OF_TASK_TYPES>, NUMBER_OF_KEYS> { "storage_ops": { "S3_ListObjectsV2": { // OpStats "result_count": 1, "total_time_ms": 34 } } } } ``` #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details> <!-- Thanks for contributing a Pull Request to ArcticDB! Please ensure you have taken a look at: - ArcticDB's Code of Conduct: https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md - ArcticDB's Contribution Licensing: https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing --> commit 9b93303adf8d5c436ae267be4d950fc5e55139de Author: Vasil Danielov Pashov <[email protected]> Date: Fri May 2 17:29:18 2025 +0300 Hold the GIL when incrementing None's refcount to prevent race conditions when there are multiple Python threads (#2334) #### Reference Issues/PRs <!--Example: Fixes #1234. See also #3456.--> None is a global static object in Python which is also refcounted. When ArcticDB creates `None` objects it must increase their refcount. It must acquire the GIL when the refcount is increased. Currently we don't acquire the GIL when we do this, we only hold a SpinLock protecting other ArcticDB threads from racing on the GIL refcount. With this change we add an atomic variable in the PythonHandler data which will accumulate the refcount. Then at the end of the operation when we reacquire the GIL we will increase the refcount. The same is done for the NaN refcount, note that we don't really need the GIL to increase NaN's refcount as we create it internally and don't handle it to Python until the read operation is done. Currently only read operations need to work with the `None` object. `apply_global_refcounts` must be called at the very end before passing the dataframe to python to prevent something raising an exception in after the refcount is applied but before python receives the data. Increasing None's refcount but never decreasing it doesn't seem to be fatal but we're trying to be good citizens. The best place for that is `adapt_read_df` or `adapt_read_dfs` as they are called at the end of all read functions. The code is changed so that the type handler data is created always in the python bindings file as it's easier to track. #### What does this implement or fix? #### Any other comments? #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details> <!-- Thanks for contributing a Pull Request to ArcticDB! Please ensure you have taken a look at: - ArcticDB's Code of Conduct: https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md - ArcticDB's Contribution Licensing: https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing --> --------- Co-authored-by: Vasil Pashov <[email protected]> commit d4b40e287863960d608d52131471a88a435bf844 Author: Phoebus Mak <[email protected]> Date: Fri May 2 11:13:30 2025 +0100 Update docs for sts ca issue (#2265) #### Reference Issues/PRs <!--Example: Fixes #1234. See also #3456.--> #### What does this implement or fix? Clarify when does the workaround need for STS CA issue #### Any other comments? #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details> <!-- Thanks for contributing a Pull Request to ArcticDB! Please ensure you have taken a look at: - ArcticDB's Code of Conduct: https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md - ArcticDB's Contribution Licensing: https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing --> commit a9d0e41e47c40a34e2e146a4297b5c638375fe85 Author: Phoebus Mak <[email protected]> Date: Tue Apr 29 17:44:08 2025 +0100 Skip azurite api check (#2288) #### Reference Issues/PRs <!--Example: Fixes #1234. See also #3456.--> #### What does this implement or fix? The api check in Azurite has brought pain to local tests as the azurite version needs to keep up with the SDK version. We are only using very simple API so safe to skip the check. #### Any other comments? #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details> <!-- Thanks for contributing a Pull Request to ArcticDB! Please ensure you have taken a look at: - ArcticDB's Code of Conduct: https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md - ArcticDB's Contribution Licensing: https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing --> commit 550d3e7c29a5f9d67a0e993bbabc1cbf88295ef1 Author: grusev <[email protected]> Date: Thu Apr 24 17:45:21 2025 +0300 initial version fix for GCP (#2326) #### Reference Issues/PRs <!--Example: Fixes #1234. See also #3456.--> #### What does this implement or fix? #### Any other comments? #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details> <!-- Thanks for contributing a Pull Request to ArcticDB! Please ensure you have taken a look at: - ArcticDB's Code of Conduct: https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md - ArcticDB's Contribution Licensing: https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing --> --------- Co-authored-by: Georgi Rusev <Georgi Rusev> commit 41a2086963e018ffe0ac90e6fea72d3577d463f3 Author: Alex Owens <[email protected]> Date: Wed Apr 23 12:31:26 2025 +0100 Timeseries defrag function (#2319) #### What does this implement or fix? Adds a (private) function to defragment timeseries data. See big list of caveats in code comments for limitations commit 61b00e99ce7861a0fd767572be0d58600c065b53 Author: Vasil Danielov Pashov <[email protected]> Date: Thu Apr 17 16:04:41 2025 +0300 Fix race conditions on the None object refcount during a multithreaded read (#2320) #### Reference Issues/PRs <!--Example: Fixes #1234. See also #3456.--> #### What does this implement or fix? **Bugfix** Columns are handled in multiple threads during read calls. String columns can contain `None` values. `None` is a global static ref counted object and the refcount is not atomic. When ArcticDB places `None` objects in columns it must increment the refcount. Currently None objects are allocated only via type handlers. ArcticDB has a global spin-lock that is shared by all type-handlers. The bug is caused by [this line](https://github.com/man-group/ArcticDB/blob/300e121e1be47ecfbabba78f077851a9c3b0772c/cpp/arcticdb/python/python_utils.hpp#L117) the spin-lock is wrapped in a `std::lock_guard` but there is a call to `unlock`. When `unlock` is called another thread will take the lock and start calling `Py_INCREF(Py_None)` but when the function exists the `std::scope_guard` will call unlock again allowing another thread to start calling `Py_INCREF(Py_None)` in parallel. **Refactoring** - Remove GIL safe py none. It was created because pybind11 wraps `Py_None` in an object and calls `Py_INCREF(Py_None)` and we must hold the GIL when incrementing the refcount. The wrapper we have was used only to get the pointer to the `Py_None` object. We don't need pybind11 to do that. Using the C API we can directly get `Py_None` which is global object - Add function to check if a python object is `None` - Remove uses of py::none{} in places where we don't hold the GIL (most of those were just to get the `Py_None` object that's inside `py:none` #### Any other comments? #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details> <!-- Thanks for contributing a Pull Request to ArcticDB! Please ensure you have taken a look at: - ArcticDB's Code of Conduct: https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md - ArcticDB's Contribution Licensing: https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing --> --------- Co-authored-by: Vasil Pashov <[email protected]> commit 396757028afbd460fd6325fd2403636ed8482d56 Author: Julien Jerphanion <[email protected]> Date: Thu Apr 17 11:39:55 2025 +0200 Support MSVC 19.29 (#2332) Signed-off-by: Julien Jerphanion <[email protected]> commit b89fc53dbd7cd1eee783fed1fba7b401d69b6ffd Author: Georgi Petrov <[email protected]> Date: Wed Apr 16 15:35:56 2025 +0300 Increase tolerance to arithmetic mismatches with Pandas with floats (#2333) #### Reference Issues/PRs https://github.com/man-group/ArcticDB/actions/runs/14487537861/job/40636907727?pr=2331 #### What does this implement or fix? To resolve this type of flakiness: ``` python FAILED tests/hypothesis/arcticdb/test_resample.py::test_resample - AssertionError: Series are different Series values are different (100.0 %) [index]: [1969-12-31T23:59:01.000000000] [left]: [-1706666.6666666667] [right]: [-1706325.3333333333] At positional index 0, first diff: -1706666.6666666667 != -1706325.3333333333 Falsifying example: test_resample( df= col_float col_int col_uint 1970-01-01 00:00:00.000000000 0.0 9223372036849590785 0 1970-01-01 00:00:00.000000001 0.0 512 0 1970-01-01 00:00:00.000000002 0.0 -9223372036854710785 0 , rule='1min', origin='start', offset='1s', ) You can reproduce this example by temporarily adding @reproduce_failure('6.72.4', b'AXicY2RgYGQAYxCCUEwMyAAkzVD/Hwg2PGIEq2ACqgASjBDR/0yMMFUwAAB9FAui') as a decorator on your test case ``` #### Any other comments? A similar fix was done here: https://github.com/man-group/ArcticDB/commit/fe9de294580526e921102fbdedda736f20596fc7 #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details> <!-- Thanks for contributing a Pull Request to ArcticDB! Please ensure you have taken a look at: - ArcticDB's Code of Conduct: https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md - ArcticDB's Contribution Licensing: https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing --> commit 30f4c48db0d742898f629d129b5d1caa83091662 Author: Alex Seaton <[email protected]> Date: Wed Apr 16 13:08:30 2025 +0100 Symbol sizes API (#2266) Add Python APIs to get sizes of symbols, in a new `AdminTools` class. Add documentation for this feature to our website. You can access the new tools with: ``` lib: Library lib.admin_tools(): AdminTools ``` Refactor the existing symbol scanning APIs to a visitor pattern so they can all share as much of the implementation as possible. Monday: 8560764974 commit 6b3c593924808d33a39e275f921f613f77139d06 Author: Georgi Petrov <[email protected]> Date: Wed Apr 16 14:32:57 2025 +0300 Prevent exceptions in ReliableStorageLockGuard destructor (#2331) #### Reference Issues/PRs <!--Example: Fixes #1234. See also #3456.--> #### What does this implement or fix? Sometimes when trying to release the lock, there could be exceptions that occur (either storage related or others). This PR is trying to catch all exceptions, mainly to prevent unnecessary seg faults in enterprise. #### Any other comments? #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details> <!-- Thanks for contributing a Pull Request to ArcticDB! Please ensure you have taken a look at: - ArcticDB's Code of Conduct: https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md - ArcticDB's Contribution Licensing: https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing --> commit aa585fc0a5ae60f61f1752d78614e0951047d21e Author: Julien Jerphanion <[email protected]> Date: Wed Apr 16 10:10:11 2025 +0200 conda-build: Extend development environment for Windows (#2328) #### Reference Issues/PRs Extracted from https://github.com/man-group/ArcticDB/pull/2252. #### What does this implement or fix? #### Any other comments? #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details> <!-- Thanks for contributing a Pull Request to ArcticDB! Please ensure you have taken a look at: - ArcticDB's Code of Conduct: https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md - ArcticDB's Contribution Licensing: https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing --> Signed-off-by: Julien Jerphanion <[email protected]> commit 42091dbe1ea4b7b827cad4f53b2ef099eb43b4fb Author: Ognyan Stoimenov <[email protected]> Date: Tue Apr 15 18:13:47 2025 +0300 Fix pr getting action (#2323) #### Reference Issues/PRs <!--Example: Fixes #1234. See also #3456.--> #### What does this implement or fix? https://github.com/VanOns/get-merged-pull-requests-action was updated to fix some issues but changes its API * Accommodate new API * Remove previous workaround (now fixed) * Pin action to 1.3.0 so no such breaks happen in the future * Changelog generator was not skipping release candidates when comparing version. Fixed now * Fix docs building permission #### Any other comments? #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details> <!-- Thanks for contributing a Pull Request to ArcticDB! Please ensure you have taken a look at: - ArcticDB's Code of Conduct: https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md - ArcticDB's Contribution Licensing: https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing --> commit 311c1bf8099a491bf1dd85c09e83d640f9d6ce74 Author: Julien Jerphanion <[email protected]> Date: Tue Apr 15 17:13:05 2025 +0200 ci: Benchmark workflow adaptations (#2327) #### Reference Issues/PRs #### What does this implement or fix? Fixes the import error, working around https://github.com/airspeed-velocity/asv/issues/1465. #### Any other comments? #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details> <!-- Thanks for contributing a Pull Request to ArcticDB! Please ensure you have taken a look at: - ArcticDB's Code of Conduct: https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md - ArcticDB's Contribution Licensing: https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing --> Signed-off-by: Julien Jerphanion <[email protected]> commit 7b37536b67b8410d2d890b8ee8bf38b05181aa61 Author: Vasil Danielov Pashov <[email protected]> Date: Tue Apr 15 11:25:03 2025 +0300 Refactor to_atom and to_ref to properly use forwarding references (#2321) #### Reference Issues/PRs <!--Example: Fixes #1234. See also #3456.--> #### What does this implement or fix? This solves two problems - Code duplication. to_atom had 3 overloads for value/ref/rval ref for the same thing. Forwarding references were invented to solve this problem. - There were unnecessary copies. `to_atom` had an overload taking `VeriantKey` by value at some point some APIs have changed and started returning `AtomKey` instead of `VariantKey` due to the excessive use of `auto` nobody noticed the difference. Thus we ended up with calling `to_atom` on an atom key, that worked because `VariantKey` can be constructed from an `AtomKey` implicitly thus we ended up constructing `VariantKey` from an `AtomKey` only to extract the `AtomKey` from that. Forwarding references do not allow implicit conversions thus the compiler pointed out all places in the code where the above happens. #### Any other comments? #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details> <!-- Thanks for contributing a Pull Request to ArcticDB! Please ensure you have taken a look at: - ArcticDB's Code of Conduct: https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md - ArcticDB's Contribution Licensing: https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing --> commit 300e121e1be47ecfbabba78f077851a9c3b0772c Author: grusev <[email protected]> Date: Fri Apr 11 14:07:36 2025 +0300 Update s3.py moto*.create_fixture - add retry attempts (#2311) #### Reference Issues/PRs <!--Example: Fixes #1234. See also #3456.--> #### What does this implement or fix? Addresses couple of flaky tests opened due to NFS or S3…
Reference Issues/PRs
Contains refactored framework for setting up shared storages + tests for AWS S3 storage
Merged 3 Prs into one:
Important: the benchmark tests down in this PR cannot run successfully. Therefore do not take them as criteria. All tests need to be run manually. Here are runs from 27-march:
LMDB set: https://github.com/man-group/ArcticDB/actions/runs/14100376040/job/39495398374
Real set: https://github.com/man-group/ArcticDB/actions/runs/14100497273/job/39495728734
What does this implement or fix?
Any other comments?
Checklist
Checklist for code changes...