Skip to content

Asv v2 s3 tests (Refactored) #2249

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 12, 2025
Merged

Asv v2 s3 tests (Refactored) #2249

merged 1 commit into from
May 12, 2025

Conversation

grusev
Copy link
Collaborator

@grusev grusev commented Mar 17, 2025

Reference Issues/PRs

Contains refactored framework for setting up shared storages + tests for AWS S3 storage

Merged 3 Prs into one:

Important: the benchmark tests down in this PR cannot run successfully. Therefore do not take them as criteria. All tests need to be run manually. Here are runs from 27-march:
LMDB set: https://github.com/man-group/ArcticDB/actions/runs/14100376040/job/39495398374
Real set: https://github.com/man-group/ArcticDB/actions/runs/14100497273/job/39495728734

What does this implement or fix?

Any other comments?

Checklist

Checklist for code changes...
  • Have you updated the relevant docstrings, documentation and copyright notice?
  • Is this contribution tested against all ArcticDB's features?
  • Do all exceptions introduced raise appropriate error messages?
  • Are API changes highlighted in the PR description?
  • Is the PR labelled as enhancement or bug so it appears in autogenerated release notes?

@grusev grusev added the patch Small change, should increase patch version label Mar 17, 2025
MODIFIABLE = "MODIFIABLE"


class StorageSetup:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The StorageSetup class can easily be refactored to be more readable like this:

def aws_default_factory() -> BaseS3StorageFixtureFactory:
    return real_s3_from_environment_variables(shared_path=True)

def get_machine_id() -> str:
    """
    Returns machine id, or id specified through environments variable (for github)
    """
    return os.getenv("ARCTICDB_PERSISTENT_STORAGE_SHARED_PATH_PREFIX", socket.gethostname())


def create_prefix(storage_space: StorageSpace, add_to_prefix: str) -> str:
    def is_valid_string(s: str) -> bool:
        return bool(s and s.strip())

    mandatory_part = storage_space.value
    optional = add_to_prefix if is_valid_string(add_to_prefix) else ''
    return f"{mandatory_part}/{optional}" if optional else mandatory_part


def check_persistence_access(storage_space: StorageSpace, confirm_persistent_storage_need: bool = False):
    assert aws_default_factory(), "Environment variables not initialized (ARCTICDB_REAL_S3_ACCESS_KEY,ARCTICDB_REAL_S3_SECRET_KEY)"
    if storage_space == StorageSpace.PERSISTENT:
        assert confirm_persistent_storage_need, "Use of persistent store not confirmed!"


def get_arctic_uri(storage: Storage, storage_space: StorageSpace, add_to_prefix: str = None, confirm_persistent_storage_need: bool = False) -> str:
    check_persistence_access(storage_space, confirm_persistent_storage_need)
    prefix = create_prefix(storage_space, add_to_prefix)
    if storage == Storage.AMAZON:
        factory = aws_default_factory()
        factory.default_prefix = prefix
        return factory.create_fixture().arctic_uri
    elif storage == Storage.LMDB:
        return f"lmdb://{tempfile.gettempdir()}/benchmarks_{prefix}"
    else:
        raise Exception("Unsupported storage type:", storage)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not agree. There is a value of separating responsibility and working TestLibraryManager.

It provides better isolation and management. So I disagree with making those changes

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion whether these live inside a class or as separate functions. I think in this case the class only provides namespacing, which can also be done with a module but I don't think it matters hugely.

I think @G-D-Petrov 's suggestion provides a few simplifications which I find valueable:

  • The main purpose of this being a class seems to be the caching of _aws_default_factory. I don't think creating it is expensive? Why bother caching it then? This way we need to worry about forgetting to initialize it the first time. I.e. we can just replace the __new__ with a aws_default_factory.
  • The create_prefix is rewritten in a much more concise way.

To sum up, I don't mind leaving this a class for the namespacing but it would be nice to still simplify the logic we can

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made suggested simplificiations in the class (part of them as google was added and other thing changed meanwhile). So code is simpler, but still in class where it should be. When there is module with lots of mixed functions classes can and should be used to provide namespacing and encapsulation and that is good style as it helps writing code - IDEs help, through intelli sense etc. Long files with lots of functions one scattered here another there mixed with other functions is actually not great example of code, and we are going to achieve that if we do not use classes. Classes provide clarity of purpose



class StorageInfo:
class LibraryManager:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we spoke yesterday, this LibraryManager is unnecessary because it duplicates a lot of the logic of Arctic's internal LibraryManager.
The only needed functionality is:

  • having a function to create the persistent/modifiable Arctic client with the correct URIs
  • having a function that constructs the correct names for the libraries
  • some helper function for cleaning up modifiable libraries, which can just iterate over the libraries in the modifiable Arctic client

Every thing else here is can easily be handled though the arctic clients directly

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again - all tests work, pass and the abstraction isolates very well user from ASV etc. All points you make are perhaps valid but constitute totaly different approach. As I disagree with that approach on real grounds I cannot make those changes

TestLibraryManager isolates within itself all needs for user to understand the specifics of the structure and lets the person who writes the test to write any types of test using requested libraries.

It gives the creator full freedom to do what is needed to achive best result

It provides a way of work that eliminates the need of test author to know ASV internals and thus protects from problems that will arise during test execution

It also gives ability without change of test code to make changes in the structure of the the storage spaces.

All that is well tested and the framework itself is covered with tests that can be extended.

There is no point of making any changes rather than wasting more resources on this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, It is TestLibrary manager now

Copy link
Collaborator Author

@grusev grusev Mar 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestLibraryManager does not duplicate any work of Arctic Library manager. It provides very needed isolation for creating tests from where and how Arctic will create them. More info can be found in the new documentation of the class:

This class is a thin wrapper around Arctic class. Its goal is to provide natural user

As a conclusion I do not argue that eventually a version without use of class and only functions can be created. I argue its value and need. There are many many arguments why not take this approach in the description there.One additional and very compelling is the fact that Arctic is class and in order to override class you need a class not functions. (Arctic is not base class for good reason)

I do find enough reasons that current implementation is better - one I can name is the fact that no one uses Arctic in man db to directly create libraries - there is and UI library manager. In all real implementation when you have certain specifics for management the infrastructure you override what is avail. Current implementation is exactly this. That is why it is ok andy change is rather waste of resources and eventual result is more likely to force implementation of similar thing at some point.


WIDE_DATAFRAME_NUM_COLS = 30_000

class LargeAppendDataModifyCache:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name is not very descriptive and the comments seems a bit misleading.
AFAICS this is a cache for the expected results thought the run.
The name/comment should reflect that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will make necessary changes

def get_population_policy(self):
pass

def get_index_info(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be renamed to something like index_start, then it doesn't need the comment

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the naming is ok. It returns both start and the index frequence - hence the name get_index_info(), also the comment there suggests that:

def get_index_info(self):
    """
    Returns initial timestamp and index frequency
    """
    return (pd.Timestamp("2-2-1986"), 's')

def initialize_cache(self, warmup_time, params, num_cols, num_sequential_dataframes):
# warmup will execute tests additional time and we do not want that at all for write
# update and append tests. We want exact specified `number` of times to be executed between
assert warmup_time == 0, "warm up must be 0"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it can be only 1 value, why do we even have it as a parameter?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameters of the tests are defined as such in each test case. Thus test case A parameters are memebers of class A not to the instance. Test case B parameters are for test case B. The function checks parameters of different classes

class AWSLargeAppendTests(AsvBase)
class AWS30kColsWideDFLargeAppendTests(AWSLargeAppendTests)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test code in some tests do contain assertains about the certains ASV parameter values. Those assertions are needed because the tests can either not work if they do not have that specific value or they might work but produce false results.

that is additional thing added to the new implementation of tests. They do check 2 things:

  • validity of the preconditions (usually ASV parameters, or setup)
  • validity of test operations (see tests for batches for instance the asserts for batch operations, which are silent by default - ie they do not fail laoudly if error)

# update and append tests. We want exact specified `number` of times to be executed between
assert warmup_time == 0, "warm up must be 0"

num_sequential_dataframes = num_sequential_dataframes + 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

num_sequential_dataframes += 1

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but it is also correct in current form right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


return cache

def initialize_update_dataframes(self, num_rows: int, num_cols: int, cached_results: LargeAppendDataModifyCache,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is a bit hard to follow, consider refactoring it to something like:

def initialize_update_dataframes(self, num_rows: int, num_cols: int, cached_results: LargeAppendDataModifyCache, 
                                 generator: SequentialDataframesGenerator):
    logger = self.get_logger()
    initial_timestamp, freq = self.get_index_info()
    timestamp_number = TimestampNumber.from_timestamp(initial_timestamp, freq)
    
    def log_time_range(update_type: str, df_key: int):
        time_range = generator.get_first_and_last_timestamp([cached_results[update_type][df_key]])
        logger.info(f"Time range {update_type.upper()} update {time_range}")

    def generate_and_log(update_type: str, num_rows: int, start_ts: pd.Timestamp):
        df = generator.df_generator.get_dataframe(number_rows=num_rows, number_columns=num_cols, start_timestamp=start_ts, freq=freq)
        cached_results[update_type][num_rows] = df
        log_time_range(update_type, num_rows)

    logger.info(f"Frame START-LAST Timestamps {timestamp_number} == {timestamp_number + num_rows}")

    # Full update
    generate_and_log('update_full_dict', num_rows, initial_timestamp)

    # Half update
    half = num_rows // 2
    timestamp_number.inc(half - 3)
    generate_and_log('update_half_dict', half, timestamp_number.to_timestamp())

    # Upsert update
    generate_and_log('update_upsert_dict', num_rows, timestamp_number.to_timestamp())

    # Single update
    timestamp_number.inc(half)
    generate_and_log('update_single_dict', 1, timestamp_number.to_timestamp())

    # Single append
    next_timestamp = generator.get_next_timestamp_number(cached_results.write_and_append_dict[num_rows], freq)
    generate_and_log('append_single_dict', 1, next_timestamp.to_timestamp())

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sence!


def get_modifiable_library(self, library_suffix: Union[str, int] = None) -> Library:

class LibraryPopulationPolicy:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we have discussed, this can be greatly simplified by decoupling the configuration for the population from the logic of executing the populating.
This can be done with a refactor like:

@dataclass
class LibraryPopulationConfig:
    """Immutable configuration for library population."""
    parameters: List[int]
    parameters_are_rows: bool = True
    fixed_rows: int = 1
    fixed_columns: int = 1
    symbol_prefix: str = ""
    use_auto_increment: bool = False
    with_metadata: bool = False
    versions_count: int = 1
    versions_mean: float = 1.0
    with_snapshots: bool = False

    def symbol_name(self, index: int) -> str:
        """Get the symbol name based on configuration."""
        prefix = f"symbol_{self.symbol_prefix}_" if self.symbol_prefix else "symbol_"
        return f"{prefix}{index}"
    
    def create_metadata(self) -> Dict[str, Any]:
        """Create metadata for symbols and snapshots."""
        if not self.with_metadata:
            return {}
        return DFGenerator.generate_random_dataframe(rows=3, cols=10).to_dict()


class LibraryPopulator:
    """
    Handles the actual population of a library based on a configuration.
    Separates the configuration from the execution.
    """
    def __init__(self, config: LibraryPopulationConfig, logger: logging.Logger, 
                 df_generator: DataFrameGenerator = None):
        self.config = config
        self.logger = logger
        self.df_generator = df_generator or VariableSizeDataframe()
    
    def populate(self, library):
        """Populate the library according to the configuration."""
        start_time = time.time()
        
        for i, param in enumerate(self.config.parameters):
            # Determine symbol index
            symbol_index = i if self.config.use_auto_increment else param
            symbol_name = self.config.symbol_name(symbol_index)
            
            # Determine rows and columns
            rows = param if self.config.parameters_are_rows else self.config.fixed_rows
            columns = self.config.fixed_columns if self.config.parameters_are_rows else param
            
            # Generate dataframe
            df = self.df_generator.generate_dataframe(rows, columns)
            
            # Create symbol
            symbol = library.create_symbol(symbol_name, df)
            
            # Add metadata if configured
            if self.config.with_metadata:
                symbol.set_metadata(self.config.create_metadata())
            
            # Create versions if configured
            if self.config.versions_count > 1:
                versions_list = self._generate_versions_list(len(self.config.parameters))
                for v in range(1, min(versions_list[i], self.config.versions_count) + 1):
                    version_df = self.df_generator.generate_dataframe(rows, columns)
                    version = symbol.create_version(version_df)
                    
                    # Add metadata if configured
                    if self.config.with_metadata:
                        version.set_metadata(self.config.create_metadata())
                    
                    # Create snapshot if configured
                    if self.config.with_snapshots:
                        snapshot = library.create_snapshot(f"snapshot_{symbol_name}_{v}")
                        if self.config.with_metadata:
                            snapshot.set_metadata(self.config.create_metadata())
        
        self.logger.info(f"Population completed in: {time.time() - start_time:.2f}s")
    
    def _generate_versions_list(self, number_symbols: int) -> List[np.int64]:
        """Generate a list of version counts for each symbol."""
        # Implementation would depend on your specific requirements
        # This is a placeholder based on the original code
        versions_list = np.random.poisson(self.config.versions_mean, number_symbols)
        versions_list = np.clip(versions_list, 1, self.config.versions_count)
        return versions_list.astype(np.int64)

The code is just a example that I got from a pass though Claude and can be simplified further e.g:

  • there are parameter in the Policy that can be removed
  • we probably don't need a LibraryPopulator class, as some helper functions that take a policy should suffice

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed we also agreed that current code is not conflicting any recommendations. Additionally there is no real benefit of making that rewrite other than to pass Claude's suggestions. The code is just ok in the form that is now. No need to changed it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation is ok it goes for fluent pattern which is widely used in python and even in arcticdb (queries). The requested change would not add any additional value nor it will fix any error. Therefore it is not needed at this moment.

MODIFIABLE = "MODIFIABLE"


class StorageSetup:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion whether these live inside a class or as separate functions. I think in this case the class only provides namespacing, which can also be done with a module but I don't think it matters hugely.

I think @G-D-Petrov 's suggestion provides a few simplifications which I find valueable:

  • The main purpose of this being a class seems to be the caching of _aws_default_factory. I don't think creating it is expensive? Why bother caching it then? This way we need to worry about forgetting to initialize it the first time. I.e. we can just replace the __new__ with a aws_default_factory.
  • The create_prefix is rewritten in a much more concise way.

To sum up, I don't mind leaving this a class for the namespacing but it would be nice to still simplify the logic we can

'''
Defined special one time setup for real storages.
Place here what is needed for proper initialization
of each storage

Abstracts storage space allocation from how user access it
'''
_instance = None
_aws_default_factory: BaseS3StorageFixtureFactory = None
_fixture_cache = {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixture_cache is now unused?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return result

def set_test_mode(self):
def __init__(self, storage: Storage, name_benchmark: str, library_options: LibraryOptions = None) :
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should all libraries have the same LibraryOptions?

Doesn't it make more sense to pass LibraryOptions to get_library? This way we can create two libraries with different options which might be useful for some tests.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_library will return existing library in most of cases. Only if does not exist will create new library. Thus the library options is possible but not quite determined in the case when library exists what to do - to check if lib is having same options or skip and disregard. Still confusing.

The hole TestLibraryManager is thin wrapper over Arctic, which goal is to "hide" details of internal management associated with how exactly placement of libraries happen and where and how to exactly locate and remove them. The argument that we can live without classes is not good as if we could we would have not had fixtures as classes at first place. And exactly this is this class. It provides extension to functionality of Arcitc class which is suited to do jobs better in very specific environment.

Thus a possible and better options in this case are:

  • define create_library method ... now or when we actually need it.

  • workarround with what we have currently:
    tlm.library_options = LibraryOptions(...)
    lib = lib.get_library(...)

First proposal is clean why, second is workarround. But yet we do not have scenario that requires that, hence create_library was not implemented.

That is something that specialized class provides over free floating functions - encapsulation, emergent design (extend when you need, change easily something when new info arrises without breaking tests etc)

Classes are the natural way to grow in complexity and need, as they require use of the whole contract not only parts of it function by function, which in time would errode the overall specialization and will lead to conflicts of requirements.

# Currently we're using the same arctic client for both persistant and modifiable libraries.
# We might decide that we want different arctic clients (e.g. different buckets) but probably not needed for now.
def _get_arctic_client_persistent(self) -> Arctic:
lib_type = StorageSpace.PERSISTENT
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this storage_type not lib_type?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The for methods serve very different purposes. Their usage scenarios is now documented as their use might not be that obvious

ac = self._get_arctic_client_modifiable()
lib_names = set(ac.list_libraries())
to_deletes = [lib_name for lib_name in lib_names
if (f"_{os.getpid()}_" in lib_name) and (f"_{self.name_benchmark}_" in lib_name)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just the simpler lib_name.startswith(f"{LibraryType.Modifiable}_{self.name_benchmark}_{os.getpid()}_")?


def delete_modifiable_library(self, library_suffix: Union[str, int] = None):
def clear_all_benchmark_libs(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these clear functions can use a common clear_modifiable_with_prefix(self, prefix)
And clear_all_modifiable_from_this_process can pass the prefix with the pid wheras clear_all_benchmark_libs can pass the prefix without the pid.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also remove_all_modifiable and remove_all_test_libs can also use the private prefix method.

return all(isinstance(i, list) for i in self._params)
# It is quite clear what is this responsible for: only dataframe generation
# Using such an abstraction can help us deduplicate the dataframe generation code between the different `EnvironmentSetup`s
# Note: We use a class instead of a generator function to allow caching of dataframes in the state
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc is a leftover from the refactor proposal. We can probably just remove it, from the name it is clear enough what this does.

Copy link
Collaborator

@IvoDD IvoDD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added quite a few small changes suggestions.

A few more imporant things to discuss maybe offline:

  • random seeding
  • list symbol tests with respect to symbol list compaction


def delete_modifiable_library(self, library_suffix: Union[str, int] = None):
def clear_all_benchmark_libs(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also remove_all_modifiable and remove_all_test_libs can also use the private prefix method.

self.symbol_fixed_str = symbol_fixed_str
return self

def use_parameters_are_columns(self) -> 'LibraryPopulationPolicy':
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the notion of parameters which can be either rows or columns a little confusing.
Why not just use num_rows: Union[int, List[int]] and num_cols: Union[int, List[int]]?
This way it's a bit clearer, if one of them is just an int it applies to all symbols if it is a list it ised similarly to the parameters. If they are both lists we should assert they have the same size.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will also need including a num_symbols. In general if the constructor receives the num_symbols, num_rows, and num_cols and asserts that if rows or cols is a list it has the same length as num_symbols I think that would be quite easy to use

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway then the symbol naming becomes a bit less clear. Up to you

def set_max_number_versions(self, versions_max) -> 'GeneralSetupSymbolsVersionsSnapshots':
self.versions_max = versions_max

def generate_snapshots(self) -> 'LibraryPopulationPolicy':
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I find the names starting with generate_ a little confusing. Sounds like they themselves will genereate something. Why not just call them set_with_snapshots etc.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will think of better name

meta = None if not self.with_metadata else self._generate_metadata()
versions_list = self._get_versions_list(len(self.parameters))
index = 0
for param_value in self.parameters:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better use for index, param_value in enumerate(self.parameters)

else:
symbol = self.get_symbol_name(param_value)

if self.parameters_is_number_rows_list:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it makes more sense to generate a new dataframe for each version? Instead of each version having the same dataframe. This will also be useful if we decide to allow (via a LibraryPopulationPolicy flag) the ability to update instead of write

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the scenarios that are currently implemented we do not need new dataframe on each write. That is the reason this was not implemented. And yes there is value of having new dataframes on each write. But that is expensive operation and should be approached carefully.

Still there is already a class that covers the scenario with new dataframes - SequentialDataframesGenerator, it is used in append scenarios.

Once a real case with Library Populator arises an optoion can be easily added for a new DF on each new write.

.set_with_metadata_for_each_version()
.set_with_snapshot_for_each_version()
.set_params([25, 50])) # for test purposes: .set_params([5, 6]))
library_manager = TestLibraryManager(storage=Storage.AMAZON, name_benchmark="LIST_SYMBOLS")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name should be LIST_VERSIONS to not collide with the above

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct!

return last_snapshot_names_dict

def setup(self, last_snapshot_names_dict, num_syms):
self.population_policy = self.get_population_policy()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused


self.lib = manager.get_library(LibraryType.MODIFIABLE)

self.symbol = f"symbol-{os.getpid()}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pid is already in the library, fine to use a more generic name

Copy link
Collaborator Author

@grusev grusev Mar 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep we may choose to use or not :-) In our case it really does not matter what the name is

self.lib.update(self.symbol, self.cache.update_half_dict[num_rows])

def time_update_full(self, cache, num_rows):
#self.lib.update(self.symbol, self.cache.update_full)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove comment?

# We could clear the modifiable libraries we used
self.get_library_manager().clear_all_modifiable_libs_from_this_process()

def get_last_x_percent_date_range(self, num_rows, percents):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why I wanted to move this function in utils. A very identiacal function was used in read_batch_functions.py iirc

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do that

return [ListGenerators.random_string(length=str_size,
include_unicode=include_unicode, seed=None) for _ in range(length)]
include_unicode=include_unicode, seed=seed) for _ in range(length)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are passing the same seed to all random_string calls? That's the behaviour I specifically wanted to avoid

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed - quick replacement without thinking that this is special case. Thanks for catching!

@@ -458,13 +458,16 @@ class DFGenerator:
Easy generation of DataFrames, via fluent interface
"""

def __init__(self, size: int, seed = 1):
def __init__(self, size: int, seed = 5555):
self.__seed = seed
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe remove the self.__seed since we no longer need to pass it around. We can just do the if below.

@@ -48,7 +50,12 @@ class RealComparisonBenchmarks:
# Therefore if you plan changes to those numbers make sure to delete old library manually
NUMBER_ROWS = 2_000_000 #100_000

params = [NO_OPERATION, CREATE_DATAFRAME, PANDAS_PARQUET, ARCTICDB_LMDB, ARCTICDB_AMAZON_S3]
# NO_OPERATION measures class memory allocation. This is the actual memory that
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc is not up to date

param_names = LMDBReadWrite.param_names
param_names = ["num_cols"]
# NOTE: Change of parameters will trigger failure as original library must also be deleted manually.
# Therefore if you plan changes to those numbers make sure to delete old library manually
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to delete the old library? What would happen if we just left it lying around? If these steps are necessary we should have some quick docs for people telling them how to do this - I would have to reverse engineer how to do this cleanup

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Work with shared persistence storage requires indeed a handbook for what to do and what not to do. And I will create a wiki page for that.

Here is what is quick explanation why this is need for that test.

When a test create a library on the shared persistent space, that library is considered to stay. Consequent test executions check that this library is there and continue the test or create it once again. Library structure is not checked (previous version contained structural checks of library and not just having checked its name, which was considered as overly complicating thing)

Now if the parameters of the test change, that would automatically trigger different code paths in test, but since the library was created once with old parameters the check will pass and it will be considered created, thus tests done on the library with those new params will produce either misleading results, or they will simply do not work. Both outcomes are possible.

Therefore there are 2 ways to battle this problem:

  • having comments like this to assure whoever is changing parameters does clean the previous library first.
  • having a more complicated name where library has the parameters encoded as suffix. Then a change in parameters will simply mean a new library created on the persistent store to stay. The end effect could be many libraries created and persistent storage bloated, which actually defeats its purpose. (For that there is already test storage space which should be used)

In other words persistent storage comes with logistics complications and therefore management should be applied. That is counterintuitive to most developers as usually tests are considered isolated from each other, and thus errors are possible. But in fact tests can and should be created considering also such persistence factor. Still they require some additional management

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs will be prepared, for now the comment is changed

This will remove all persistent libraries for this test from the persistent storage
Therefore use wisely only when needed (like change of parameters for tests)
"""
name_prefix = f"{LibraryType.value}_{self.name_benchmark}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be LibraryType.PERSISTENT.value

# 2. Delete library from persistent storage once new parameters are ready to be committed
# and remove "set_test_mode()". To delete all libs for current test use:
# library_manager = TestLibraryManager(storage=Storage.AMAZON, name_benchmark="READ_WRITE")
# library_manager.remove_all_persistent_libs_for_this_test()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment is confusing. Are you suggesting to commit a change to delete the library inside the github runners. That seems like an odd process, because then we would have to revert it.

I think it would be better if instructions describe how to do it locally instead of on github runner.
What do you think about the following:

If you plan to make changes to parameters, consider that library already may exist created with different number
of rows. Therefore, you need to either:
- Rename the library by changing `name_benchmark` or by adding a `lib_suffix`. Note that this will keep the old library around.
- Clean up the old library with (which needs to be run with the same environment variables as the github runner):
       library_manager = TestLibraryManager(storage=Storage.AMAZON, name_benchmark="READ_WRITE")           
       library_manager.remove_all_persistent_libs_for_this_test()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually the process is much simpler and will leave it to the wiki which I currently work on: https://github.com/man-group/ArcticDB/wiki/ASV-Benchmarks:-Real-storage-tests

as we are talking about persistent tests which are accessed by all no specific vars are needed just simple 2-3 step process which will be explained there and the comment will have a link to it

fix omission

lmdb test added and also some more logging

new theory

small fixes

fix

fix

turn off execution of a test

fixed delete tests

small error fix

last attempt

remove setup bug

fix bug

addressed comments

fixed notes

fix date range

get_library_name now remains only method

addressed comments

updated last comment for setyp multiple libs with symbols

fix ommission

tone down logging

fixes for comments

added support for unicode strings

initial work

updates

small change

comments applied

new version

changes requested implemented

new test added

one test remaining

final version

new test

silence noisy batch test

tunnings after first run of tests

fix regression

fixed regression

final version

more important notes

updated comments from GP

added comments

add sanitization

support for GCP

review comments

updated doc string

mutliprocessing test

StorageSetup class optimizations

fix omission

better documentation for TestManagementLibrary

benchmark.json

better naming

fixes on comments

fix error

from comments

comments addressed

comment addressed

forgotten things

new version

enhanced check for asv
@grusev grusev force-pushed the asv_v2_s3_tests branch from 0a9cb08 to bf75182 Compare May 12, 2025 07:45
Copy link
Collaborator

@poodlewars poodlewars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the commit message to something meaningful when you merge this

@grusev grusev merged commit 2612fb4 into master May 12, 2025
145 of 147 checks passed
@grusev grusev deleted the asv_v2_s3_tests branch May 12, 2025 12:39
vasil-pashov added a commit that referenced this pull request May 27, 2025
commit facc33bead487490322ba9cc973ed86dc9b5c4c6
Merge: bc68ed467 85d51e3b7
Author: Vasil Danielov Pashov <[email protected]>
Date:   Tue May 27 20:15:59 2025 +0300

    Merge branch 'master' into vasil.pashov/coverity-test-existing-code-with-errors

commit bc68ed467842b510bbd7001175cc8eecefc29e1c
Merge: e68ec0146 91a076cc2
Author: Vasil Pashov <[email protected]>
Date:   Tue May 27 20:12:57 2025 +0300

    Merge branch 'master' into vasil.pashov/coverity-test-existing-file

commit 85d51e3b748982dc9121026a4dfcbd9f5a1dc2fb
Author: Alex Owens <[email protected]>
Date:   Tue May 27 10:54:08 2025 +0100

    Bugfix 9209057536: Allow concatenation of uint64 columns with int* columns (#2365)

    #### Reference Issues/PRs
    Fixes
    [9209057536](https://man312219.monday.com/boards/7852509418/pulses/9209057536)

    #### What does this implement or fix?
    Allows concatenating columns of type uint64 with columns of type int*

commit 91a076cc267caf549ff38cb532dd76c5e4e168ba
Author: Alex Owens <[email protected]>
Date:   Fri May 23 17:46:47 2025 +0100

    Enhancement 7992967434: filters and projections ternary operator (#2103)

    #### Reference Issues/PRs
    Implements
    [7992967434](https://man312219.monday.com/boards/7852509418/pulses/7992967434)

    #### What does this implement or fix?
    Implements a ternary operator equivalent to `numpy.where`, primarily for
    projecting new columns based on some condition, although it can also be
    used for filtering. Semantically the same as `left if condition else
    right`, although this Pythonic syntax cannot be made to work due to
    limitations of the language.

    #### Any other comments?
    See `test_ternary.py` for a plethora of examples and the expected
    behaviour in each case.
    Example benchmark output with annotations below.
    The first parameter to all benchmarks is the number of rows (100k for
    all of them right now), so the single-threaded per-row time can be
    calculated by dividing by 100,000.
    e.g. projecting a new column of 100k rows by choosing from 2 dense
    columns (likely a common use case) takes 424us, or just over 4ns per
    row.
    Other parameters are explained for each individual benchmark.
    ```
    Run on (20 X 2918.4 MHz CPU s)
    CPU Caches:
      L1 Data 48 KiB (x10)
      L1 Instruction 32 KiB (x10)
      L2 Unified 1280 KiB (x10)
      L3 Unified 24576 KiB (x1)
    Load Average: 4.23, 6.56, 6.73
    --------------------------------------------------------------------------------------------------
    Benchmark                                                        Time             CPU   Iterations
    --------------------------------------------------------------------------------------------------
    BM_ternary_bitset_bitset/100000                               13.1 us         13.1 us        58099
    # Second arg is whether the boolean argument is true or false, third is whether the arguments are swapped
    BM_ternary_bitset_bool/100000/1/1                             2.00 us         2.00 us       363634
    BM_ternary_bitset_bool/100000/1/0                             7.43 us         7.43 us       101700
    BM_ternary_bitset_bool/100000/0/1                             7.28 us         7.28 us        88907
    BM_ternary_bitset_bool/100000/0/0                             2.45 us         2.45 us       307832
    BM_ternary_numeric_dense_col_dense_col/100000                  424 us          424 us         1276
    BM_ternary_numeric_sparse_col_sparse_col/100000               3548 us         3548 us          185
    # Second arg is whether the arguments are swapped
    BM_ternary_numeric_dense_col_sparse_col/100000/1              2555 us         2555 us          258
    BM_ternary_numeric_dense_col_sparse_col/100000/0              2800 us         2800 us          262
    # Second arg is the number of unique strings in each string column, third is whether the columns have the same string pool or not
    BM_ternary_string_dense_col_dense_col/100000/100000/1          438 us          438 us         1534
    BM_ternary_string_dense_col_dense_col/100000/100000/0        16257 us        16258 us           43
    BM_ternary_string_dense_col_dense_col/100000/2/1               441 us          441 us         1603
    BM_ternary_string_dense_col_dense_col/100000/2/0              4219 us         4219 us          186
    BM_ternary_string_sparse_col_sparse_col/100000/100000/1       3854 us         3854 us          191
    BM_ternary_string_sparse_col_sparse_col/100000/100000/0      10753 us        10754 us           67
    BM_ternary_string_sparse_col_sparse_col/100000/2/1            3655 us         3655 us          183
    BM_ternary_string_sparse_col_sparse_col/100000/2/0            4592 us         4592 us          123
    BM_ternary_string_dense_col_sparse_col/100000/100000/1        2957 us         2957 us          236
    BM_ternary_string_dense_col_sparse_col/100000/100000/0       13980 us        13980 us           50
    BM_ternary_string_dense_col_sparse_col/100000/2/1             2967 us         2966 us          237
    BM_ternary_string_dense_col_sparse_col/100000/2/0             5179 us         5179 us          160
    # Second arg  is whether the arguments are swapped
    BM_ternary_numeric_dense_col_val/100000/1                      360 us          359 us         1871
    BM_ternary_numeric_dense_col_val/100000/0                      388 us          388 us         1692
    BM_ternary_numeric_sparse_col_val/100000/1                    2244 us         2244 us          292
    BM_ternary_numeric_sparse_col_val/100000/0                    2385 us         2385 us          283
    # Second arg  is whether the arguments are swapped, third is the number of unique strings in the column
    BM_ternary_string_dense_col_val/100000/1/100000               8259 us         8258 us           82
    BM_ternary_string_dense_col_val/100000/0/100000               7683 us         7683 us           93
    BM_ternary_string_dense_col_val/100000/1/2                    2578 us         2578 us          261
    BM_ternary_string_dense_col_val/100000/0/2                    2385 us         2385 us          297
    BM_ternary_string_sparse_col_val/100000/1/100000              6302 us         6302 us          129
    BM_ternary_string_sparse_col_val/100000/0/100000              5792 us         5792 us          115
    BM_ternary_string_sparse_col_val/100000/1/2                   2903 us         2903 us          249
    BM_ternary_string_sparse_col_val/100000/0/2                   3095 us         3095 us          232
    # Second arg  is whether the arguments are swapped
    BM_ternary_numeric_dense_col_empty/100000/1                   1269 us         1269 us          584
    BM_ternary_numeric_dense_col_empty/100000/0                   1354 us         1354 us          512
    BM_ternary_numeric_sparse_col_empty/100000/1                  1363 us         1363 us          572
    BM_ternary_numeric_sparse_col_empty/100000/0                  1374 us         1374 us          484
    # Second arg  is whether the arguments are swapped, third is the number of unique strings in the column
    BM_ternary_string_dense_col_empty/100000/1/100000             1217 us         1217 us          587
    BM_ternary_string_dense_col_empty/100000/0/100000             1343 us         1343 us          577
    BM_ternary_string_dense_col_empty/100000/1/2                  1287 us         1287 us          574
    BM_ternary_string_dense_col_empty/100000/0/2                  1363 us         1363 us          518
    BM_ternary_string_sparse_col_empty/100000/1/100000            1413 us         1413 us          524
    BM_ternary_string_sparse_col_empty/100000/0/100000            1343 us         1343 us          517
    BM_ternary_string_sparse_col_empty/100000/1/2                 1293 us         1293 us          540
    BM_ternary_string_sparse_col_empty/100000/0/2                 1235 us         1235 us          480
    BM_ternary_numeric_val_val/100000                              368 us          368 us         2039
    BM_ternary_string_val_val/100000                               376 us          376 us         1862
    # Second arg  is whether the arguments are swapped
    BM_ternary_numeric_val_empty/100000/1                         40.7 us         40.7 us        16491
    BM_ternary_numeric_val_empty/100000/0                         36.7 us         36.7 us        17836
    BM_ternary_string_val_empty/100000/1                          40.8 us         40.8 us        17892
    BM_ternary_string_val_empty/100000/0                          58.2 us         58.2 us        13825
    # Second arg is whether the left argument is true or false, third is whether the right argument is true or false
    BM_ternary_bool_bool/100000/1/1                               1.43 us         1.43 us       518204
    BM_ternary_bool_bool/100000/1/0                               1.99 us         1.99 us       378598
    BM_ternary_bool_bool/100000/0/1                               4.52 us         4.52 us       157505
    BM_ternary_bool_bool/100000/0/0                              0.020 us        0.020 us     37060921
    ```

commit 3c059f4d4030dc73594f277d8754918c698a2969
Author: Phoebus Mak <[email protected]>
Date:   Thu May 22 09:50:29 2025 +0100

    Fix gcp lib unreachable after making it read only (#2349)

    #### Reference Issues/PRs
    <!--Example: Fixes #1234. See also #3456.-->
    https://man312219.monday.com/boards/7852509418/pulses/8985074856

    #### What does this implement or fix?
    `create_store_from_lib_config` took protobuf setting only.
    GCP setting is stored natively only, unlike other storages setting.
    So when new store is created with the above function, gcp settings have
    not been passed to the new store. Therefore the SDK will fallback to
    default but incorrect setting and cause errors.

    S3 and GCPXML native settings are given default value to avoid
    uninitiailzied value being used in the test

    #### Any other comments?
    Test in the CI:
    https://github.com/man-group/ArcticDB/actions/runs/15164054821/job/42638155043
    ```
    test_symbol_list.py::test_symbol_list_read_only_compaction_needed[real_gcp_store_factory-True]
    [gw0] [ 95%] PASSED tests/integration/arcticdb/version_store/test_symbol_list.py::test_symbol_list_read_only_compaction_needed[real_gcp_store_factory-True]
    test_symbol_list.py::test_symbol_list_read_only_compaction_needed[real_gcp_store_factory-False]
    [gw0] [ 95%] PASSED tests/integration/arcticdb/version_store/test_symbol_list.py::test_symbol_list_read_only_compaction_needed[real_gcp_store_factory-False]
    ```
    (Other unrelated tests failed in the flaky real storage CI)
    #### Checklist

    <details>
      <summary>
       Checklist for code changes...
      </summary>

    - [ ] Have you updated the relevant docstrings, documentation and
    copyright notice?
    - [ ] Is this contribution tested against [all ArcticDB's
    features](../docs/mkdocs/docs/technical/contributing.md)?
    - [ ] Do all exceptions introduced raise appropriate [error
    messages](https://docs.arcticdb.io/error_messages/)?
     - [ ] Are API changes highlighted in the PR description?
    - [ ] Is the PR labelled as enhancement or bug so it appears in
    autogenerated release notes?
    </details>

    <!--
    Thanks for contributing a Pull Request to ArcticDB! Please ensure you
    have taken a look at:
    - ArcticDB's Code of Conduct:
    https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md
    - ArcticDB's Contribution Licensing:
    https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing
    -->

commit 9d98a4436e376fa1623af92f23153cde5b68a68b
Author: Alex Owens <[email protected]>
Date:   Wed May 21 18:03:28 2025 +0100

    Fix multiindex series (#2363)

    #### What does this implement or fix?
    Fixes roundtripping of multiindexed Series with timestamps as the first
    level and strings as the second level.
    Broken by #2142

    ---------

    Co-authored-by: Alex Owens <[email protected]>

commit c3c7c2ac5d7d98d16305e6914713f03454d30a57
Author: Alex Owens <[email protected]>
Date:   Wed May 21 16:42:34 2025 +0100

    Docs 8975554293: Add concat demo notebook (#2361)

    #### Reference Issues/PRs
    Completes
    [8975554293](https://man312219.monday.com/boards/7852509418/pulses/8975554293)

    #### What does this implement or fix?
    Adds a notebook demonstrating the new `concat` functionality added in
    https://github.com/man-group/ArcticDB/pull/2142

    ---------

    Co-authored-by: Alex Owens <[email protected]>

commit 17ea0e49deba0a3a1b8e6267e9516b14ea34b3ef
Author: grusev <[email protected]>
Date:   Wed May 21 18:31:23 2025 +0300

    Update installation_tests.yml with 5.3 and 5.4 final versions (#2362)

    #### Reference Issues/PRs
    <!--Example: Fixes #1234. See also #3456.-->

    #### What does this implement or fix?

    #### Any other comments?

    Moved 5.2.6 to different timeslot to eliminate the possibility about
    failures being because timeslot. Although a manual execution shows this
    problem with 5.2.6. is most probably persisting
    https://github.com/man-group/ArcticDB/actions/runs/15139549472/job/42559651096)

    Added:
     5.3.4 https://github.com/man-group/ArcticDB/actions/runs/15133764164/
     5.4.1 https://github.com/man-group/ArcticDB/actions/runs/15133923361

    #### Checklist

    <details>
      <summary>
       Checklist for code changes...
      </summary>

    - [ ] Have you updated the relevant docstrings, documentation and
    copyright notice?
    - [ ] Is this contribution tested against [all ArcticDB's
    features](../docs/mkdocs/docs/technical/contributing.md)?
    - [ ] Do all exceptions introduced raise appropriate [error
    messages](https://docs.arcticdb.io/error_messages/)?
     - [ ] Are API changes highlighted in the PR description?
    - [ ] Is the PR labelled as enhancement or bug so it appears in
    autogenerated release notes?
    </details>

    <!--
    Thanks for contributing a Pull Request to ArcticDB! Please ensure you
    have taken a look at:
    - ArcticDB's Code of Conduct:
    https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md
    - ArcticDB's Contribution Licensing:
    https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing
    -->

commit e68ec014683d00f095e4efbe5d72b81b7509299d
Author: Vasil Pashov <[email protected]>
Date:   Wed May 21 11:38:45 2025 +0300

    Temporary disable tests

commit e3afff2115d4f0038d13a5327a8c7b7779552a99
Merge: bdbc17028 424cd56e2
Author: Vasil Pashov <[email protected]>
Date:   Wed May 21 11:17:22 2025 +0300

    Merge branch 'master' into vasil.pashov/coverity-test-existing-file

commit 424cd56e295afafd64444420b92fcf89a82dd1ea
Author: grusev <[email protected]>
Date:   Tue May 20 11:09:42 2025 +0300

    Schedule S3 tests and fix STS to run only against AWS S3 (#2356)

    #### Reference Issues/PRs
    <!--Example: Fixes #1234. See also #3456.-->

    #### What does this implement or fix?

    Shedule for now to run twice a week

    Contains also couple of other fixes of the workflow:
    - seeding tests were not executed previously due to change in workflow
    parameter from boolean to choice for GCP tests. Now seeding tests are
    executed.
    - STS role creation was executed for GCP tests which was unnecessary.
    Now it gets executed only with AWS S3
    - persistent tests cleaning had a problem with the context and resulted
    in crash not being able to load storage_tests.py. This test is fixed now
    to allow proper loading of mark.py in defferent contexts

    Results:
    https://github.com/man-group/ArcticDB/actions/runs/15061574677/job/42337724260
    (NOTE: the failures in the above run are because this PR:
    https://github.com/man-group/ArcticDB/pull/2353 is not part of current
    one. Once it gets merge S3 tests will run without problems)

    #### Any other comments?

    #### Checklist

    <details>
      <summary>
       Checklist for code changes...
      </summary>

    - [ ] Have you updated the relevant docstrings, documentation and
    copyright notice?
    - [ ] Is this contribution tested against [all ArcticDB's
    features](../docs/mkdocs/docs/technical/contributing.md)?
    - [ ] Do all exceptions introduced raise appropriate [error
    messages](https://docs.arcticdb.io/error_messages/)?
     - [ ] Are API changes highlighted in the PR description?
    - [ ] Is the PR labelled as enhancement or bug so it appears in
    autogenerated release notes?
    </details>

    <!--
    Thanks for contributing a Pull Request to ArcticDB! Please ensure you
    have taken a look at:
    - ArcticDB's Code of Conduct:
    https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md
    - ArcticDB's Contribution Licensing:
    https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing
    -->

    ---------

    Co-authored-by: Georgi Rusev <Georgi Rusev>

commit a158b0c2e684c9389691744c001192ce94ddc79d
Author: Alex Owens <[email protected]>
Date:   Mon May 19 13:28:51 2025 +0100

    Bugfix 9123099670: fix resampling of old updated data (#2351)

    #### Reference Issues/PRs
    Fixes
    [9123099670](https://man312219.monday.com/boards/7852509418/views/168855452/pulses/9123099670)

    #### What does this implement or fix?
    Fixes three separate resampling bugs:

    1. Old versions of `update` (changed sometime between `4.1.0` and
    `4.4.0`, I haven't pinned down exactly where) had a behaviour in which
    the `end_index` value in the data key of the segment overlapping with
    the start of the date range provided to the `update` call was set to the
    first value of the date range in the `update` call. For all other
    modification methods, this is set to 1 nanosecond larger than the last
    index value in the contained segment. Resampling assumed this to be the
    case, and had an assertion verifying it. Relaxing this assertion is
    sufficient to fix the issue.
    2. Providing a `date_range` argument with a resample where the provided
    date range did not overlap with the timerange covered by the index of
    the symbol led to trying to reserve a vector with a negative size. This
    now correctly returns an empty result.
    3. Previously, checks that a symbol being resampled had a timestamp
    index occurred after some operations which also require this to be true,
    which could lead to the same vector reserve issue above. It is now
    checked in advance, and a suitable exception raised.

commit 9edc74a89102b4ab66fbd7911a31322425dfcacc
Author: grusev <[email protected]>
Date:   Mon May 19 12:54:07 2025 +0300

    nfs backed tests for v1 API (#2350)

    #### Reference Issues/PRs
    <!--Example: Fixes #1234. See also #3456.-->

    #### What does this implement or fix?

    arctic_* fixtures or v2 API is already covered with nfs backed s3 tests.
    What is needed now is to add also tests for v1 API fixtures.

    New Fixtures:

    nfs_backed_s3_store_factory
    nfs_backed_s3_version_store_v1
    nfs_backed_s3_version_store_v2
    nfs_backed_s3_version_store_dynamic_schema_v1
    nfs_backed_s3_version_store_dynamic_schema_v2
    nfs_backed_s3_version_store

    Added to:

    object_store_factory
      s3_store_factory -> nfs_backed_s3_store_factory
    object_and_mem_and_lmdb_version_store
      s3_version_store_v1 -> nfs_backed_s3_version_store_v1
      s3_version_store_v2 -> nfs_backed_s3_version_store_v2
    object_and_mem_and_lmdb_version_store_dynamic_schema
    s3_version_store_dynamic_schema_v1 ->
    nfs_backed_s3_version_store_dynamic_schema_v1
    s3_version_store_dynamic_schema_v2 ->
    nfs_backed_s3_version_store_dynamic_schema_v2

    #### Any other comments?

    #### Checklist

    <details>
      <summary>
       Checklist for code changes...
      </summary>

    - [ ] Have you updated the relevant docstrings, documentation and
    copyright notice?
    - [ ] Is this contribution tested against [all ArcticDB's
    features](../docs/mkdocs/docs/technical/contributing.md)?
    - [ ] Do all exceptions introduced raise appropriate [error
    messages](https://docs.arcticdb.io/error_messages/)?
     - [ ] Are API changes highlighted in the PR description?
    - [ ] Is the PR labelled as enhancement or bug so it appears in
    autogenerated release notes?
    </details>

    <!--
    Thanks for contributing a Pull Request to ArcticDB! Please ensure you
    have taken a look at:
    - ArcticDB's Code of Conduct:
    https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md
    - ArcticDB's Contribution Licensing:
    https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing
    -->

    ---------

    Co-authored-by: Georgi Rusev <Georgi Rusev>

commit 67d2bbe530f96a0aa5412f479e123da480ba2d99
Author: Alex Owens <[email protected]>
Date:   Fri May 16 15:20:37 2025 +0100

    Enhancement 8277989680: symbol concatenation poc (#2142)

    #### Reference Issues/PRs
    8277989680

    #### What does this implement or fix?
    Implements symbol concatenation. Inner and outer joins over columns both
    supported. Expected usage:
    ```
    # Read requests can contain usual as_of, date_range, columns, etc arguments
    lazy_dfs = lib.read_batch([read_request_1, read_request_2, ...])
    # Potentially apply some processing to all or individual constituent lazy dataframes here, that will be applied before the join
    lazy_dfs = lazy_dfs[lazy_dfs["col"].notnull()]
    # Join here
    lazy_df = adb.concat(lazy_dfs)
    # Perform more processing if desired
    lazy_df = lazy_df.resample("15min").agg({"col": "mean"})
    # Collect result
    res = lazy_df.collect()
    # res contains a list of VersionedItems from the consituent symbols that went into the join with data=None, and a data member with the joined Series/DataFrame
    ```
    See `test_symbol_concatenation.py` for thorough examples of how the API
    works.
    For outer joins, if a column is not present in one of the input symbols,
    then the same type-specific behaviour as used for dynamic schema is used
    to backfill the missing values.
    Not all symbols can be concatenated together. The following will throw
    exceptions if attempted to be concatenated:

    - a Series with a DataFrame
    - Different index types, including multiindexes with different numbers
    of levels
    - Incompatible column types. e.g. if `col` has type `INT64` in one
    symbol, and is a string column in another symbol. this only applies if
    the column would be in the result, which is always the case for all
    columns with an outer join, but may not always be for inner joins.

    Where possible, the implementation is permissive with what can be joined
    with an output as sensible as possible:

    - Joining two or more Series with different names that are otherwise
    compatible will produce a Series with no name
    - Joining two or more timeseries where the indexes have different names
    will produce a timeseries with an unnamed index
    - Joining two or more timeseries where the indexes have different
    timezones will produce a timeseries with a UTC index
    - Joining two or more multiindexed Series/DataFrames where the levels
    have compatible types but different names will produce a multiindexed
    Series/DataFrame with unnamed levels where they differed between some of
    the inputs.
    - Joining two or more Series/DataFrames that all have `RangeIndex`. If
    the index `step` does not match between all of the inputs, then the
    output will have a `RangeIndex` with `start=0` and `step=1`. **This is
    different behaviour to Pandas, which converts to an Int64 index in this
    case. For this reason, a warning is logged when this happens.**

    The only known major limitation is that all of the symbols being joined
    together (after any pre-join processing) must fit into memory. Relaxing
    this constraint would require much more sophisticated query planning
    than we currently support, in which all of the clauses both for
    individual symbols pre-join, the join, and any post-join clauses, are
    all taken into account when scheduling both IO and individual processing
    tasks.

commit c1c7a8cff3193dcf4aefee268cd3feea01c68bd9
Author: grusev <[email protected]>
Date:   Fri May 16 13:55:12 2025 +0300

    Patch for Real S3 library names (#2353)

    #### Reference Issues/PRs
    <!--Example: Fixes #1234. See also #3456.-->

    #### What does this implement or fix?

    Currently we create library names which are too long for real S3, this
    is a patch for the tests until the real bug is addressed

    Manually triggered run:
    https://github.com/man-group/ArcticDB/actions/runs/15013824867

    #### Any other comments?

    #### Checklist

    <details>
      <summary>
       Checklist for code changes...
      </summary>

    - [ ] Have you updated the relevant docstrings, documentation and
    copyright notice?
    - [ ] Is this contribution tested against [all ArcticDB's
    features](../docs/mkdocs/docs/technical/contributing.md)?
    - [ ] Do all exceptions introduced raise appropriate [error
    messages](https://docs.arcticdb.io/error_messages/)?
     - [ ] Are API changes highlighted in the PR description?
    - [ ] Is the PR labelled as enhancement or bug so it appears in
    autogenerated release notes?
    </details>

    <!--
    Thanks for contributing a Pull Request to ArcticDB! Please ensure you
    have taken a look at:
    - ArcticDB's Code of Conduct:
    https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md
    - ArcticDB's Contribution Licensing:
    https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing
    -->

    ---------

    Co-authored-by: Georgi Rusev <Georgi Rusev>

commit bb65a85ab82dd7fec5297b258956545f8b4adea7
Author: Alex Owens <[email protected]>
Date:   Fri May 16 11:41:18 2025 +0100

    Add resolve_defaults back in as a static method of NativeVersionStore (#2358)

    #### Reference Issues/PRs
    Was removed in #2345 , but is needed at least by some internal tests,
    and technically constitutes an API break (although we don't expect
    anybody to be using it)

commit e78758a7fe5fbb02085dcfae01218903d6dad6d9
Author: grusev <[email protected]>
Date:   Fri May 16 13:25:24 2025 +0300

    Installation Tests Workflow Fixes (#2354)

    #### Reference Issues/PRs
    <!--Example: Fixes #1234. See also #3456.-->

    #### What does this implement or fix?

    A failure when job is triggered on schedule is fixed - the string
    containe extra single quotes. Also the order of 2 steps is changed for
    schedulling specific use case.

    Changes in workflow dispatch are implemented to simplify execution and
    leave some parts for enhancements - ie the selection of exact
    os-python-repo combination which needs actually single flow of step and
    not matrix.

    S3 tests also enabled to run along with LMDB test by default

    #### Any other comments?

    #### Checklist

    <details>
      <summary>
       Checklist for code changes...
      </summary>

    - [ ] Have you updated the relevant docstrings, documentation and
    copyright notice?
    - [ ] Is this contribution tested against [all ArcticDB's
    features](../docs/mkdocs/docs/technical/contributing.md)?
    - [ ] Do all exceptions introduced raise appropriate [error
    messages](https://docs.arcticdb.io/error_messages/)?
     - [ ] Are API changes highlighted in the PR description?
    - [ ] Is the PR labelled as enhancement or bug so it appears in
    autogenerated release notes?
    </details>

    <!--
    Thanks for contributing a Pull Request to ArcticDB! Please ensure you
    have taken a look at:
    - ArcticDB's Code of Conduct:
    https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md
    - ArcticDB's Contribution Licensing:
    https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing
    -->

    ---------

    Co-authored-by: Georgi Rusev <Georgi Rusev>

commit 9e544da9d823c3a4e76b256b741925af52a20742
Author: grusev <[email protected]>
Date:   Tue May 13 13:45:53 2025 +0300

    Installation tests v4 (#2339)

    #### Reference Issues/PRs
    <!--Example: Fixes #1234. See also #3456.-->

    #### What does this implement or fix?

    Successful execution 5.2.6:
    https://github.com/man-group/ArcticDB/actions/runs/14641126753/job/41083591802
    5.1.2: https://github.com/man-group/ArcticDB/actions/runs/14637571996
    4.5.1:
    https://github.com/man-group/ArcticDB/actions/runs/14639124835/job/41077126258
    1.6.2:
    https://github.com/man-group/ArcticDB/actions/runs/14701046721/job/41250511273

    The PR contains workflow definition to execute tests on installed
    arcticdb it is combination of approaches:

    https://github.com/man-group/ArcticDB/pull/2330
    https://github.com/man-group/ArcticDB/pull/2316

    Installation tests are now in separate folder
    (python/installation_tests) not part of tests. They have their own
    fixtures, making them independent from rest of code base

    The tests are direct copy from originals with one modified to user ver 2
    API. Otherwise now if there are changes in API each test in installation
    set can be addapted. As tests run very fast no need to use simulators,
    instead directly using S3 real storage

    The tests are executed by a workflow.

    Currently each test is executed against LMDB and real S3. The moto
    simulated version is not available in this moment due to tight coupling
    with protobufs which differ for ach version as well as tight coupling
    with whole existing test code.

    The workflow have 2 triggers:

     - manual trigger - allowing tests to be executed manually on demand
    - on schedule - the schedule execution is overnight. Each arcticdb
    version tests are executed within 1hr difference from the other. Thats
    is due to fact that executing all at once is likely to generate errors
    with real storages

    #### Any other comments?

    #### Checklist

    <details>
      <summary>
       Checklist for code changes...
      </summary>

    - [ ] Have you updated the relevant docstrings, documentation and
    copyright notice?
    - [ ] Is this contribution tested against [all ArcticDB's
    features](../docs/mkdocs/docs/technical/contributing.md)?
    - [ ] Do all exceptions introduced raise appropriate [error
    messages](https://docs.arcticdb.io/error_messages/)?
     - [ ] Are API changes highlighted in the PR description?
    - [ ] Is the PR labelled as enhancement or bug so it appears in
    autogenerated release notes?
    </details>

    <!--
    Thanks for contributing a Pull Request to ArcticDB! Please ensure you
    have taken a look at:
    - ArcticDB's Code of Conduct:
    https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md
    - ArcticDB's Contribution Licensing:
    https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing
    -->

    ---------

    Co-authored-by: Georgi Rusev <Georgi Rusev>
    Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

commit 2612fb45f15350dc483ddde1c8d43c2d6a02731b
Author: grusev <[email protected]>
Date:   Mon May 12 15:39:20 2025 +0300

    Asv v2 s3 tests (Refactored) (#2249)

    #### Reference Issues/PRs
    <!--Example: Fixes #1234. See also #3456.-->

    Contains refactored framework for setting up shared storages + tests for
    AWS S3 storage

    Merged 3 Prs into one:
      - https://github.com/man-group/ArcticDB/pull/2185
      - https://github.com/man-group/ArcticDB/pull/2227
      - https://github.com/man-group/ArcticDB/pull/2204

    Important: the benchmark tests down in this PR cannot run successfully.
    Therefore do not take them as criteria. All tests need to be run
    manually. Here are runs from 27-march:
    LMDB set:
    https://github.com/man-group/ArcticDB/actions/runs/14100376040/job/39495398374
    Real set:
    https://github.com/man-group/ArcticDB/actions/runs/14100497273/job/39495728734

    #### What does this implement or fix?

    #### Any other comments?

    #### Checklist

    <details>
      <summary>
       Checklist for code changes...
      </summary>

    - [ ] Have you updated the relevant docstrings, documentation and
    copyright notice?
    - [ ] Is this contribution tested against [all ArcticDB's
    features](../docs/mkdocs/docs/technical/contributing.md)?
    - [ ] Do all exceptions introduced raise appropriate [error
    messages](https://docs.arcticdb.io/error_messages/)?
     - [ ] Are API changes highlighted in the PR description?
    - [ ] Is the PR labelled as enhancement or bug so it appears in
    autogenerated release notes?
    </details>

    <!--
    Thanks for contributing a Pull Request to ArcticDB! Please ensure you
    have taken a look at:
    - ArcticDB's Code of Conduct:
    https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md
    - ArcticDB's Contribution Licensing:
    https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing
    -->

    Co-authored-by: Georgi Rusev <Georgi Rusev>

commit 3c2fe145cad45797356a4ec5fbd42e4dac57681a
Author: William Dealtry <[email protected]>
Date:   Mon May 12 09:57:15 2025 +0100

    size_t size in MacOS

commit bb54de8879ab57c37093a62c5282e405fc9a834b
Author: William Dealtry <[email protected]>
Date:   Mon May 12 09:03:04 2025 +0100

    resolve defaults is a free function

commit e973f8dbd898aedc747bc232e022c9a1137d882c
Author: willdealtry <[email protected]>
Date:   Wed Apr 16 14:49:46 2025 +0100

    Fix up file operations

commit af1a171eab284902db4333946b732de7d9ec2b18
Author: Phoebus Mak <[email protected]>
Date:   Mon May 12 10:00:32 2025 +0100

    Disable s3 checksumming (#2337)

    #### Reference Issues/PRs
    <!--Example: Fixes #1234. See also #3456.-->
    https://github.com/man-group/ArcticDB/issues/2251
    #### What does this implement or fix?
    Disable s3 checksumming by setting environment variable in the wheel.

    #### Any other comments?
    This will also unblock the upgrade of `aws-sdk-cpp` on vcpkg.
    The upgrade will not be made in this PR

    One of the newly added test is needed to be skipped as `conda` CI has
    `aws-sdk-cpp` pinned at non-s3-checksumming version due the `libarrow`
    pin.
    `environment-dev.yml` doesn't align with the counterpart in the
    feedstock. Therefore the new version of `aws-sdk-cpp` is only used in
    the feedstock thus release wheel but not in local and CI build here.
    This will be addressed in separate ticket.

    [Commit](https://github.com/man-group/ArcticDB/pull/2337/commits/245a02cd455e39fb8f976301ccd5409e6ae88b13)
    to remove `libarrow` pin so more updated `aws-sdk-cpp`, which support s3
    checksumming is in used in conda
    It's for verifying the change with the newly added the test. The
    [test](https://github.com/man-group/ArcticDB/actions/runs/14732394443/job/41349695905)
    is successful.

    #### Checklist

    <details>
      <summary>
       Checklist for code changes...
      </summary>

    - [ ] Have you updated the relevant docstrings, documentation and
    copyright notice?
    - [ ] Is this contribution tested against [all ArcticDB's
    features](../docs/mkdocs/docs/technical/contributing.md)?
    - [ ] Do all exceptions introduced raise appropriate [error
    messages](https://docs.arcticdb.io/error_messages/)?
     - [ ] Are API changes highlighted in the PR description?
    - [ ] Is the PR labelled as enhancement or bug so it appears in
    autogenerated release notes?
    </details>

    <!--
    Thanks for contributing a Pull Request to ArcticDB! Please ensure you
    have taken a look at:
    - ArcticDB's Code of Conduct:
    https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md
    - ArcticDB's Contribution Licensing:
    https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing
    -->

commit b808afac25bed84595b874f28b6b3ce2407fbd0c
Author: grusev <[email protected]>
Date:   Fri May 9 15:46:17 2025 +0300

    Delete STS roles regularly  (#2344)

    #### Reference Issues/PRs
    <!--Example: Fixes #1234. See also #3456.-->

    #### What does this implement or fix?

    Due to limitation of STS roles number we should constantly do cleaning
    of failed to delete roles. The PR contains a scheduled job that would do
    that every Sa. The python script can also be executed at any time and
    will delete only roles created prior of today, leaving all currently
    running jobs unaffected

    As roles cannot be guaranteed to be cleaned after tests execution due to
    many factors, we should take them out on regular bases, and perhaps this
    is the quickest and most reliable approach

    #### Any other comments?

    #### Checklist

    <details>
      <summary>
       Checklist for code changes...
      </summary>

    - [ ] Have you updated the relevant docstrings, documentation and
    copyright notice?
    - [ ] Is this contribution tested against [all ArcticDB's
    features](../docs/mkdocs/docs/technical/contributing.md)?
    - [ ] Do all exceptions introduced raise appropriate [error
    messages](https://docs.arcticdb.io/error_messages/)?
     - [ ] Are API changes highlighted in the PR description?
    - [ ] Is the PR labelled as enhancement or bug so it appears in
    autogenerated release notes?
    </details>

    <!--
    Thanks for contributing a Pull Request to ArcticDB! Please ensure you
    have taken a look at:
    - ArcticDB's Code of Conduct:
    https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md
    - ArcticDB's Contribution Licensing:
    https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing
    -->

    ---------

    Co-authored-by: Georgi Rusev <Georgi Rusev>

commit 0136f4ca52559e0640dc1b7518d6a8b0773ed3a8
Author: Ognyan Stoimenov <[email protected]>
Date:   Fri May 9 14:36:54 2025 +0300

    Fix permissions for the automatic docs building (#2347)

    #### Reference Issues/PRs
    <!--Example: Fixes #1234. See also #3456.-->

    #### What does this implement or fix?
    Fixes failures when building the docs automatically on release like:
    https://github.com/man-group/ArcticDB/actions/runs/14832306883
    #### Any other comments?

    #### Checklist

    <details>
      <summary>
       Checklist for code changes...
      </summary>

    - [ ] Have you updated the relevant docstrings, documentation and
    copyright notice?
    - [ ] Is this contribution tested against [all ArcticDB's
    features](../docs/mkdocs/docs/technical/contributing.md)?
    - [ ] Do all exceptions introduced raise appropriate [error
    messages](https://docs.arcticdb.io/error_messages/)?
     - [ ] Are API changes highlighted in the PR description?
    - [ ] Is the PR labelled as enhancement or bug so it appears in
    autogenerated release notes?
    </details>

    <!--
    Thanks for contributing a Pull Request to ArcticDB! Please ensure you
    have taken a look at:
    - ArcticDB's Code of Conduct:
    https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md
    - ArcticDB's Contribution Licensing:
    https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing
    -->

commit 652d968561d473599e90508078005c4fd00a1ba4
Author: Phoebus Mak <[email protected]>
Date:   Sat May 3 02:03:44 2025 +0100

    Query Stat framework v3 (#2304)

    #### Reference Issues/PRs
    <!--Example: Fixes #1234. See also #3456.-->

    #### What does this implement or fix?
    New query stat implemenation which its schema is static
    The feature of linking arcticdb API calls to storage operations has been
    dropped. Now only storage operation stats will be logged. Therefore the
    schema of the stats is hardcoded and allow the summation of stats is
    logged, one statical object with numerous atomic ints is enough to do
    the job.
    No fancy map nor modification of folly executor.

    #### Any other comments?
    Sample output:
    ```
    { // Stats
            "SYMBOL_LIST":  // std::array<std::array<OpStats, NUMBER_OF_TASK_TYPES>, NUMBER_OF_KEYS>
             {
                "storage_ops": {
                    "S3_ListObjectsV2":
                    { // OpStats
                        "result_count": 1,
                        "total_time_ms": 34
                    }
                }
            }
        }
    ```

    #### Checklist

    <details>
      <summary>
       Checklist for code changes...
      </summary>

    - [ ] Have you updated the relevant docstrings, documentation and
    copyright notice?
    - [ ] Is this contribution tested against [all ArcticDB's
    features](../docs/mkdocs/docs/technical/contributing.md)?
    - [ ] Do all exceptions introduced raise appropriate [error
    messages](https://docs.arcticdb.io/error_messages/)?
     - [ ] Are API changes highlighted in the PR description?
    - [ ] Is the PR labelled as enhancement or bug so it appears in
    autogenerated release notes?
    </details>

    <!--
    Thanks for contributing a Pull Request to ArcticDB! Please ensure you
    have taken a look at:
    - ArcticDB's Code of Conduct:
    https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md
    - ArcticDB's Contribution Licensing:
    https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing
    -->

commit 9b93303adf8d5c436ae267be4d950fc5e55139de
Author: Vasil Danielov Pashov <[email protected]>
Date:   Fri May 2 17:29:18 2025 +0300

    Hold the GIL when incrementing None's refcount to prevent race conditions when there are multiple Python threads (#2334)

    #### Reference Issues/PRs
    <!--Example: Fixes #1234. See also #3456.-->
    None is a global static object in Python which is also refcounted. When
    ArcticDB creates `None` objects it must increase their refcount. It must
    acquire the GIL when the refcount is increased. Currently we don't
    acquire the GIL when we do this, we only hold a SpinLock protecting
    other ArcticDB threads from racing on the GIL refcount. With this change
    we add an atomic variable in the PythonHandler data which will
    accumulate the refcount. Then at the end of the operation when we
    reacquire the GIL we will increase the refcount. The same is done for
    the NaN refcount, note that we don't really need the GIL to increase
    NaN's refcount as we create it internally and don't handle it to Python
    until the read operation is done. Currently only read operations need to
    work with the `None` object.

    `apply_global_refcounts` must be called at the very end before passing
    the dataframe to python to prevent something raising an exception in
    after the refcount is applied but before python receives the data.
    Increasing None's refcount but never decreasing it doesn't seem to be
    fatal but we're trying to be good citizens. The best place for that is
    `adapt_read_df` or `adapt_read_dfs` as they are called at the end of all
    read functions. The code is changed so that the type handler data is
    created always in the python bindings file as it's easier to track.
    #### What does this implement or fix?

    #### Any other comments?

    #### Checklist

    <details>
      <summary>
       Checklist for code changes...
      </summary>

    - [ ] Have you updated the relevant docstrings, documentation and
    copyright notice?
    - [ ] Is this contribution tested against [all ArcticDB's
    features](../docs/mkdocs/docs/technical/contributing.md)?
    - [ ] Do all exceptions introduced raise appropriate [error
    messages](https://docs.arcticdb.io/error_messages/)?
     - [ ] Are API changes highlighted in the PR description?
    - [ ] Is the PR labelled as enhancement or bug so it appears in
    autogenerated release notes?
    </details>

    <!--
    Thanks for contributing a Pull Request to ArcticDB! Please ensure you
    have taken a look at:
    - ArcticDB's Code of Conduct:
    https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md
    - ArcticDB's Contribution Licensing:
    https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing
    -->

    ---------

    Co-authored-by: Vasil Pashov <[email protected]>

commit d4b40e287863960d608d52131471a88a435bf844
Author: Phoebus Mak <[email protected]>
Date:   Fri May 2 11:13:30 2025 +0100

    Update docs for sts ca issue (#2265)

    #### Reference Issues/PRs
    <!--Example: Fixes #1234. See also #3456.-->

    #### What does this implement or fix?
    Clarify when does the workaround need for STS CA issue

    #### Any other comments?

    #### Checklist

    <details>
      <summary>
       Checklist for code changes...
      </summary>

    - [ ] Have you updated the relevant docstrings, documentation and
    copyright notice?
    - [ ] Is this contribution tested against [all ArcticDB's
    features](../docs/mkdocs/docs/technical/contributing.md)?
    - [ ] Do all exceptions introduced raise appropriate [error
    messages](https://docs.arcticdb.io/error_messages/)?
     - [ ] Are API changes highlighted in the PR description?
    - [ ] Is the PR labelled as enhancement or bug so it appears in
    autogenerated release notes?
    </details>

    <!--
    Thanks for contributing a Pull Request to ArcticDB! Please ensure you
    have taken a look at:
    - ArcticDB's Code of Conduct:
    https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md
    - ArcticDB's Contribution Licensing:
    https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing
    -->

commit a9d0e41e47c40a34e2e146a4297b5c638375fe85
Author: Phoebus Mak <[email protected]>
Date:   Tue Apr 29 17:44:08 2025 +0100

    Skip azurite api check (#2288)

    #### Reference Issues/PRs
    <!--Example: Fixes #1234. See also #3456.-->

    #### What does this implement or fix?
    The api check in Azurite has brought pain to local tests as the azurite
    version needs to keep up with the SDK version. We are only using very
    simple API so safe to skip the check.

    #### Any other comments?

    #### Checklist

    <details>
      <summary>
       Checklist for code changes...
      </summary>

    - [ ] Have you updated the relevant docstrings, documentation and
    copyright notice?
    - [ ] Is this contribution tested against [all ArcticDB's
    features](../docs/mkdocs/docs/technical/contributing.md)?
    - [ ] Do all exceptions introduced raise appropriate [error
    messages](https://docs.arcticdb.io/error_messages/)?
     - [ ] Are API changes highlighted in the PR description?
    - [ ] Is the PR labelled as enhancement or bug so it appears in
    autogenerated release notes?
    </details>

    <!--
    Thanks for contributing a Pull Request to ArcticDB! Please ensure you
    have taken a look at:
    - ArcticDB's Code of Conduct:
    https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md
    - ArcticDB's Contribution Licensing:
    https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing
    -->

commit 550d3e7c29a5f9d67a0e993bbabc1cbf88295ef1
Author: grusev <[email protected]>
Date:   Thu Apr 24 17:45:21 2025 +0300

    initial version fix for GCP (#2326)

    #### Reference Issues/PRs
    <!--Example: Fixes #1234. See also #3456.-->

    #### What does this implement or fix?

    #### Any other comments?

    #### Checklist

    <details>
      <summary>
       Checklist for code changes...
      </summary>

    - [ ] Have you updated the relevant docstrings, documentation and
    copyright notice?
    - [ ] Is this contribution tested against [all ArcticDB's
    features](../docs/mkdocs/docs/technical/contributing.md)?
    - [ ] Do all exceptions introduced raise appropriate [error
    messages](https://docs.arcticdb.io/error_messages/)?
     - [ ] Are API changes highlighted in the PR description?
    - [ ] Is the PR labelled as enhancement or bug so it appears in
    autogenerated release notes?
    </details>

    <!--
    Thanks for contributing a Pull Request to ArcticDB! Please ensure you
    have taken a look at:
    - ArcticDB's Code of Conduct:
    https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md
    - ArcticDB's Contribution Licensing:
    https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing
    -->

    ---------

    Co-authored-by: Georgi Rusev <Georgi Rusev>

commit 41a2086963e018ffe0ac90e6fea72d3577d463f3
Author: Alex Owens <[email protected]>
Date:   Wed Apr 23 12:31:26 2025 +0100

    Timeseries defrag function (#2319)

    #### What does this implement or fix?
    Adds a (private) function to defragment timeseries data. See big list of
    caveats in code comments for limitations

commit 61b00e99ce7861a0fd767572be0d58600c065b53
Author: Vasil Danielov Pashov <[email protected]>
Date:   Thu Apr 17 16:04:41 2025 +0300

    Fix race conditions on the None object refcount during a multithreaded read (#2320)

    #### Reference Issues/PRs
    <!--Example: Fixes #1234. See also #3456.-->

    #### What does this implement or fix?
    **Bugfix**
    Columns are handled in multiple threads during read calls. String
    columns can contain `None` values. `None` is a global static ref counted
    object and the refcount is not atomic. When ArcticDB places `None`
    objects in columns it must increment the refcount. Currently None
    objects are allocated only via type handlers. ArcticDB has a global
    spin-lock that is shared by all type-handlers. The bug is caused by
    [this
    line](https://github.com/man-group/ArcticDB/blob/300e121e1be47ecfbabba78f077851a9c3b0772c/cpp/arcticdb/python/python_utils.hpp#L117)
    the spin-lock is wrapped in a `std::lock_guard` but there is a call to
    `unlock`. When `unlock` is called another thread will take the lock and
    start calling `Py_INCREF(Py_None)` but when the function exists the
    `std::scope_guard` will call unlock again allowing another thread to
    start calling `Py_INCREF(Py_None)` in parallel.

    **Refactoring**
    - Remove GIL safe py none. It was created because pybind11 wraps
    `Py_None` in an object and calls `Py_INCREF(Py_None)` and we must hold
    the GIL when incrementing the refcount. The wrapper we have was used
    only to get the pointer to the `Py_None` object. We don't need pybind11
    to do that. Using the C API we can directly get `Py_None` which is
    global object
    - Add function to check if a python object is `None`
    - Remove uses of py::none{} in places where we don't hold the GIL (most
    of those were just to get the `Py_None` object that's inside `py:none`

    #### Any other comments?

    #### Checklist

    <details>
      <summary>
       Checklist for code changes...
      </summary>

    - [ ] Have you updated the relevant docstrings, documentation and
    copyright notice?
    - [ ] Is this contribution tested against [all ArcticDB's
    features](../docs/mkdocs/docs/technical/contributing.md)?
    - [ ] Do all exceptions introduced raise appropriate [error
    messages](https://docs.arcticdb.io/error_messages/)?
     - [ ] Are API changes highlighted in the PR description?
    - [ ] Is the PR labelled as enhancement or bug so it appears in
    autogenerated release notes?
    </details>

    <!--
    Thanks for contributing a Pull Request to ArcticDB! Please ensure you
    have taken a look at:
    - ArcticDB's Code of Conduct:
    https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md
    - ArcticDB's Contribution Licensing:
    https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing
    -->

    ---------

    Co-authored-by: Vasil Pashov <[email protected]>

commit 396757028afbd460fd6325fd2403636ed8482d56
Author: Julien Jerphanion <[email protected]>
Date:   Thu Apr 17 11:39:55 2025 +0200

    Support MSVC 19.29 (#2332)

    Signed-off-by: Julien Jerphanion <[email protected]>

commit b89fc53dbd7cd1eee783fed1fba7b401d69b6ffd
Author: Georgi Petrov <[email protected]>
Date:   Wed Apr 16 15:35:56 2025 +0300

    Increase tolerance to arithmetic mismatches with Pandas with floats (#2333)

    #### Reference Issues/PRs

    https://github.com/man-group/ArcticDB/actions/runs/14487537861/job/40636907727?pr=2331

    #### What does this implement or fix?
    To resolve this type of flakiness:

    ``` python
    FAILED tests/hypothesis/arcticdb/test_resample.py::test_resample - AssertionError: Series are different

    Series values are different (100.0 %)
    [index]: [1969-12-31T23:59:01.000000000]
    [left]:  [-1706666.6666666667]
    [right]: [-1706325.3333333333]
    At positional index 0, first diff: -1706666.6666666667 != -1706325.3333333333
    Falsifying example: test_resample(
        df=
                                           col_float              col_int  col_uint
            1970-01-01 00:00:00.000000000        0.0  9223372036849590785         0
            1970-01-01 00:00:00.000000001        0.0                  512         0
            1970-01-01 00:00:00.000000002        0.0 -9223372036854710785         0
        ,
        rule='1min',
        origin='start',
        offset='1s',
    )

    You can reproduce this example by temporarily adding @reproduce_failure('6.72.4', b'AXicY2RgYGQAYxCCUEwMyAAkzVD/Hwg2PGIEq2ACqgASjBDR/0yMMFUwAAB9FAui') as a decorator on your test case
    ```

    #### Any other comments?
    A similar fix was done here:
    https://github.com/man-group/ArcticDB/commit/fe9de294580526e921102fbdedda736f20596fc7

    #### Checklist

    <details>
      <summary>
       Checklist for code changes...
      </summary>

    - [ ] Have you updated the relevant docstrings, documentation and
    copyright notice?
    - [ ] Is this contribution tested against [all ArcticDB's
    features](../docs/mkdocs/docs/technical/contributing.md)?
    - [ ] Do all exceptions introduced raise appropriate [error
    messages](https://docs.arcticdb.io/error_messages/)?
     - [ ] Are API changes highlighted in the PR description?
    - [ ] Is the PR labelled as enhancement or bug so it appears in
    autogenerated release notes?
    </details>

    <!--
    Thanks for contributing a Pull Request to ArcticDB! Please ensure you
    have taken a look at:
    - ArcticDB's Code of Conduct:
    https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md
    - ArcticDB's Contribution Licensing:
    https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing
    -->

commit 30f4c48db0d742898f629d129b5d1caa83091662
Author: Alex Seaton <[email protected]>
Date:   Wed Apr 16 13:08:30 2025 +0100

    Symbol sizes API (#2266)

    Add Python APIs to get sizes of symbols, in a new `AdminTools` class.
    Add documentation for this feature to our website.

    You can access the new tools with:

    ```
    lib: Library
    lib.admin_tools(): AdminTools
    ```

    Refactor the existing symbol scanning APIs to a visitor pattern so they
    can all share as much of the implementation as possible.

    Monday: 8560764974

commit 6b3c593924808d33a39e275f921f613f77139d06
Author: Georgi Petrov <[email protected]>
Date:   Wed Apr 16 14:32:57 2025 +0300

    Prevent exceptions in ReliableStorageLockGuard destructor (#2331)

    #### Reference Issues/PRs
    <!--Example: Fixes #1234. See also #3456.-->

    #### What does this implement or fix?
    Sometimes when trying to release the lock, there could be exceptions
    that occur (either storage related or others).
    This PR is trying to catch all exceptions, mainly to prevent unnecessary
    seg faults in enterprise.

    #### Any other comments?

    #### Checklist

    <details>
      <summary>
       Checklist for code changes...
      </summary>

    - [ ] Have you updated the relevant docstrings, documentation and
    copyright notice?
    - [ ] Is this contribution tested against [all ArcticDB's
    features](../docs/mkdocs/docs/technical/contributing.md)?
    - [ ] Do all exceptions introduced raise appropriate [error
    messages](https://docs.arcticdb.io/error_messages/)?
     - [ ] Are API changes highlighted in the PR description?
    - [ ] Is the PR labelled as enhancement or bug so it appears in
    autogenerated release notes?
    </details>

    <!--
    Thanks for contributing a Pull Request to ArcticDB! Please ensure you
    have taken a look at:
    - ArcticDB's Code of Conduct:
    https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md
    - ArcticDB's Contribution Licensing:
    https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing
    -->

commit aa585fc0a5ae60f61f1752d78614e0951047d21e
Author: Julien Jerphanion <[email protected]>
Date:   Wed Apr 16 10:10:11 2025 +0200

    conda-build: Extend development environment for Windows (#2328)

    #### Reference Issues/PRs

    Extracted from https://github.com/man-group/ArcticDB/pull/2252.

    #### What does this implement or fix?

    #### Any other comments?

    #### Checklist

    <details>
      <summary>
       Checklist for code changes...
      </summary>

    - [ ] Have you updated the relevant docstrings, documentation and
    copyright notice?
    - [ ] Is this contribution tested against [all ArcticDB's
    features](../docs/mkdocs/docs/technical/contributing.md)?
    - [ ] Do all exceptions introduced raise appropriate [error
    messages](https://docs.arcticdb.io/error_messages/)?
     - [ ] Are API changes highlighted in the PR description?
    - [ ] Is the PR labelled as enhancement or bug so it appears in
    autogenerated release notes?
    </details>

    <!--
    Thanks for contributing a Pull Request to ArcticDB! Please ensure you
    have taken a look at:
    - ArcticDB's Code of Conduct:
    https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md
    - ArcticDB's Contribution Licensing:
    https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing
    -->

    Signed-off-by: Julien Jerphanion <[email protected]>

commit 42091dbe1ea4b7b827cad4f53b2ef099eb43b4fb
Author: Ognyan Stoimenov <[email protected]>
Date:   Tue Apr 15 18:13:47 2025 +0300

    Fix pr getting action (#2323)

    #### Reference Issues/PRs
    <!--Example: Fixes #1234. See also #3456.-->

    #### What does this implement or fix?
    https://github.com/VanOns/get-merged-pull-requests-action was updated to
    fix some issues but changes its API
    * Accommodate new API
    * Remove previous workaround (now fixed)
    * Pin action to 1.3.0 so no such breaks happen in the future
    * Changelog generator was not skipping release candidates when comparing
    version. Fixed now
    * Fix docs building permission

    #### Any other comments?

    #### Checklist

    <details>
      <summary>
       Checklist for code changes...
      </summary>

    - [ ] Have you updated the relevant docstrings, documentation and
    copyright notice?
    - [ ] Is this contribution tested against [all ArcticDB's
    features](../docs/mkdocs/docs/technical/contributing.md)?
    - [ ] Do all exceptions introduced raise appropriate [error
    messages](https://docs.arcticdb.io/error_messages/)?
     - [ ] Are API changes highlighted in the PR description?
    - [ ] Is the PR labelled as enhancement or bug so it appears in
    autogenerated release notes?
    </details>

    <!--
    Thanks for contributing a Pull Request to ArcticDB! Please ensure you
    have taken a look at:
    - ArcticDB's Code of Conduct:
    https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md
    - ArcticDB's Contribution Licensing:
    https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing
    -->

commit 311c1bf8099a491bf1dd85c09e83d640f9d6ce74
Author: Julien Jerphanion <[email protected]>
Date:   Tue Apr 15 17:13:05 2025 +0200

    ci: Benchmark workflow adaptations (#2327)

    #### Reference Issues/PRs

    #### What does this implement or fix?

    Fixes the import error, working around
    https://github.com/airspeed-velocity/asv/issues/1465.

    #### Any other comments?

    #### Checklist

    <details>
      <summary>
       Checklist for code changes...
      </summary>

    - [ ] Have you updated the relevant docstrings, documentation and
    copyright notice?
    - [ ] Is this contribution tested against [all ArcticDB's
    features](../docs/mkdocs/docs/technical/contributing.md)?
    - [ ] Do all exceptions introduced raise appropriate [error
    messages](https://docs.arcticdb.io/error_messages/)?
     - [ ] Are API changes highlighted in the PR description?
    - [ ] Is the PR labelled as enhancement or bug so it appears in
    autogenerated release notes?
    </details>

    <!--
    Thanks for contributing a Pull Request to ArcticDB! Please ensure you
    have taken a look at:
    - ArcticDB's Code of Conduct:
    https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md
    - ArcticDB's Contribution Licensing:
    https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing
    -->

    Signed-off-by: Julien Jerphanion <[email protected]>

commit 7b37536b67b8410d2d890b8ee8bf38b05181aa61
Author: Vasil Danielov Pashov <[email protected]>
Date:   Tue Apr 15 11:25:03 2025 +0300

    Refactor to_atom and to_ref to properly use forwarding references (#2321)

    #### Reference Issues/PRs
    <!--Example: Fixes #1234. See also #3456.-->

    #### What does this implement or fix?
    This solves two problems
    - Code duplication. to_atom had 3 overloads for value/ref/rval ref for
    the same thing. Forwarding references were invented to solve this
    problem.
    - There were unnecessary copies. `to_atom` had an overload taking
    `VeriantKey` by value at some point some APIs have changed and started
    returning `AtomKey` instead of `VariantKey` due to the excessive use of
    `auto` nobody noticed the difference. Thus we ended up with calling
    `to_atom` on an atom key, that worked because `VariantKey` can be
    constructed from an `AtomKey` implicitly thus we ended up constructing
    `VariantKey` from an `AtomKey` only to extract the `AtomKey` from that.
    Forwarding references do not allow implicit conversions thus the
    compiler pointed out all places in the code where the above happens.
    #### Any other comments?

    #### Checklist

    <details>
      <summary>
       Checklist for code changes...
      </summary>

    - [ ] Have you updated the relevant docstrings, documentation and
    copyright notice?
    - [ ] Is this contribution tested against [all ArcticDB's
    features](../docs/mkdocs/docs/technical/contributing.md)?
    - [ ] Do all exceptions introduced raise appropriate [error
    messages](https://docs.arcticdb.io/error_messages/)?
     - [ ] Are API changes highlighted in the PR description?
    - [ ] Is the PR labelled as enhancement or bug so it appears in
    autogenerated release notes?
    </details>

    <!--
    Thanks for contributing a Pull Request to ArcticDB! Please ensure you
    have taken a look at:
    - ArcticDB's Code of Conduct:
    https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md
    - ArcticDB's Contribution Licensing:
    https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing
    -->

commit 300e121e1be47ecfbabba78f077851a9c3b0772c
Author: grusev <[email protected]>
Date:   Fri Apr 11 14:07:36 2025 +0300

    Update s3.py moto*.create_fixture - add retry attempts (#2311)

    #### Reference Issues/PRs
    <!--Example: Fixes #1234. See also #3456.-->

    #### What does this implement or fix?

    Addresses couple of flaky tests opened due to NFS or S3…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
patch Small change, should increase patch version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants