Skip to content

Refactor CI-Lib Implementation to Support Litmus 3.0 #198

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 41 commits into
base: master
Choose a base branch
from

Conversation

SkySingh04
Copy link

@SkySingh04 SkySingh04 commented Apr 13, 2025

This pull request refactors the Chaos CI Library to work with the Litmus 3.0 approach, shifting from direct Kubernetes API calls to utilizing the Litmus Go SDK.

Changes

Code Modernization

  • Updated Go version from 1.14 to 1.24.0
  • Upgraded and cleaned up dependencies including k8s.io/api, k8s.io/apimachinery, and ginkgo
  • Added github.com/litmuschaos/litmus-go-sdk as a core dependency

Architecture Changes

  • Replaced direct Kubernetes and Litmus client initialization with the Litmus SDK client
  • Updated ClientSet generation to work with the SDK using environment variables
  • Added proper error handling for missing configuration and initialization failures
  • Added NetworkPacketCorruptionPercentage field to the ExperimentDetails struct

Test Suite Refactoring

  • Completely refactored all experiment test files to use SDK-based approach:
    • node-io-stress_test.go
    • node-memory-hog_test.go
    • pod-autoscaler_test.go
    • pod-network-corruption_test.go
    • pod-network-duplication_test.go
    • pod-network-loss_test.go
    • pod-memory-hog_test.go
    • pod-network-latency_test.go
  • Updated experiment template retrieval to use external hub sources
  • Implemented consistent structure across all test files with connection, execution, and cleanup phases

Functionality Improvements

  • Increased default timeout from 180 to 300 seconds for better reliability
  • Refactored status checking functions to use the SDK's GetRunPhase method
  • Improved experiment state monitoring with enhanced logging
  • Introduced helper functions for retrieving experiment and application statuses
  • Updated all status-checking functions to align with the new status retrieval methods

The refactoring maintains the same functionality while leveraging the more robust and maintainable SDK approach, reducing direct dependency on the Kubernetes API.

@SkySingh04 SkySingh04 changed the title feat : Refactor CI-Lib implementation to 3.0 approach WIP : Refactor CI-Lib Implementation to Support Litmus 3.0 Apr 13, 2025
@SkySingh04 SkySingh04 force-pushed the refactor_to_3.0 branch 3 times, most recently from dcff924 to 0a398d4 Compare April 16, 2025 21:57
@SkySingh04 SkySingh04 changed the title WIP : Refactor CI-Lib Implementation to Support Litmus 3.0 Refactor CI-Lib Implementation to Support Litmus 3.0 Apr 18, 2025
….mod and go.sum

- Bump Go version to 1.24.0 and toolchain to 1.24.1
- Update dependencies for litmuschaos components and testing libraries
- Add new indirect dependencies for improved functionality and compatibility
- Remove outdated versions and clean up go.sum

Signed-off-by: [Your Name] [Your Email]
Signed-off-by: Sky Singh <[email protected]>
…ate go.mod/go.sum

- Introduce github.com/imdario/mergo v0.3.9 and github.com/spf13/pflag v1.0.5 as indirect dependencies
- Enhance ClientSets structure by adding SDKClient for Litmus SDK integration
- Implement GenerateClientSetFromSDK method to initialize Litmus SDK client with environment variables

Signed-off-by: [Your Name] [Your Email]
Signed-off-by: Sky Singh <[email protected]>
…ture connection

- Add support for connecting to ChaosCenter infrastructure via SDK in the pod-delete experiment.
- Introduce new environment variables for infrastructure configuration.
- Refactor test setup to include infrastructure connection checks and cleanup.
- Update go.mod to include the necessary Litmus SDK dependency.

Signed-off-by: [Your Name] [Your Email]
Signed-off-by: Sky Singh <[email protected]>
…K for experiment execution

- Integrate the use of the Litmus SDK for creating and running the pod-delete experiment.
- Refactor test setup to include infrastructure connection and experiment request construction.
- Add new helper function to construct experiment requests with proper manifest YAML.
- Update go.mod and go.sum to include the latest version of github.com/google/uuid.

Signed-off-by: Sky Singh <[email protected]>
…n to fetch template from external source

- Introduce HTTP fetching of the engine template for the pod-delete experiment.
- Update the manifest construction logic to modify the fetched YAML template with experiment details.
- Replace the placeholder manifest with a dynamic template that adapts based on input parameters.
- Enhance error handling for template fetching and parsing.

Signed-off-by: Sky Singh <[email protected]>
…ion and dependency updates

- Add support for the `github.com/google/uuid` package to generate unique identifiers in experiments.
- Update experiment test files to utilize the Litmus SDK for constructing and executing chaos experiments.
- Refactor YAML manifest construction logic to dynamically set experiment parameters based on input details.
- Introduce new environment variables and enhance error handling for manifest generation.
- Update `go.mod` to include the latest version of `github.com/google/uuid` and `sigs.k8s.io/yaml`.

Signed-off-by: Sky Singh <[email protected]>
…ment with new module

- Introduce a new `infrastructure` package to handle setup and disconnection of infrastructure for chaos experiments.
- Update experiment test files to utilize the new infrastructure management functions, improving code clarity and maintainability.
- Add new environment variables for controlling infrastructure installation and usage of existing infrastructure.
- Update `go.mod` and `go.sum` to reflect changes in dependencies.

Signed-off-by: Sky Singh <[email protected]>
…rastructure management module

- Update all experiment test files to replace direct SDK infrastructure connection logic with the new `infrastructure` module for setup and cleanup.
- Enhance code clarity and maintainability by centralizing infrastructure management.
- Validate infrastructure ID after setup to ensure proper connection.
- Ensure consistent error handling across all experiments during infrastructure operations.

Signed-off-by: Sky Singh <[email protected]>
… final phase string

- Replace usage of final status object with final phase string for polling experiment run status.
- Enhance validation checks to ensure final phase is not empty and equals "Completed" after polling.
- Improve code clarity and maintainability by simplifying the status handling in both experiment tests.

Signed-off-by: Sky Singh <[email protected]>
…figurations

- Refactor experiment test files to utilize dynamic timeout and polling interval values from environment variables.
- Enhance error messages to reflect the actual timeout duration for better debugging.
- Extend final phase checks to include "Completed_With_Error" and "Terminated" for improved status validation.

Signed-off-by: Sky Singh <[email protected]>
… Workflow manifest

- Replace the previous method of fetching the engine template from an external source with a direct construction of an Argo Workflow manifest for Litmus 3.0.
- Update the manifest to include dynamic parameters for experiment execution, enhancing clarity and maintainability.
- Remove unnecessary HTTP requests and YAML parsing, streamlining the experiment request construction process.

Signed-off-by: Sky Singh <[email protected]>
…de-io-stress experiment

- Introduce a new environment variable, NODES_AFFECTED_PERC, to capture the percentage of affected pods during the node IO stress experiment.
- Update the setNodeIOStressExperimentENV function to include this new variable alongside existing filesystem utilization parameters for enhanced experiment configuration.
…nction

- Introduce a new utility function, ContainsString, in the pkg module to streamline the checking of final phases in various experiment tests.
- Replace the previous manual phase checking logic with the new utility function across multiple experiment test files, enhancing code clarity and maintainability.
- Ensure consistent handling of final phase validation for improved status checks in experiments.
"name": "run-chaos",
"path": "/tmp/chaosengine-run-chaos.yaml",
"raw": {
"data": "apiVersion: litmuschaos.io/v1alpha1\\nkind: ChaosEngine\\nmetadata:\\n namespace: \\\"{{workflow.parameters.adminModeNamespace}}\\\"\\n labels:\\n context: \\\"{{workflow.parameters.appNamespace}}_pod-delete\\\"\\n workflow_run_id: \\\"{{ workflow.uid }}\\\"\\n workflow_name: %s\\n annotations:\\n probeRef: '[{\\\"name\\\":\\\"ping-google\\\",\\\"mode\\\":\\\"SOT\\\"}]'\\n generateName: run-chaos\\nspec:\\n appinfo:\\n appns: %s\\n applabel: %s\\n appkind: deployment\\n jobCleanUpPolicy: retain\\n engineState: active\\n chaosServiceAccount: litmus-admin\\n experiments:\\n - name: pod-delete\\n spec:\\n components:\\n env:\\n - name: TOTAL_CHAOS_DURATION\\n value: \\\"%d\\\"\\n - name: CHAOS_INTERVAL\\n value: \\\"%d\\\"\\n - name: FORCE\\n value: \\\"false\\\"\\n"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the probe needs to be created and the probe details needed to be added to this in the template:
annotations:
probeRef: '[{"name":"ping-google","mode":"SOT"}]'

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ALso consider that we are provoding multiple fault names => create two chaosengines => in template we always have a loop option / accept as an array (accept teh chaosengine as an array)

SkySingh04 added 2 commits May 4, 2025 14:22
…cy versions

- Added replacement for github.com/go-openapi/spec to v0.19.5 in go.mod.
- Added replacement for k8s.io/kube-openapi to v0.0.0-20200805222855-6aeccd4b50c6 in go.mod.
- Updated go.sum to reflect changes in dependencies and their versions.
…t logic

- Commented out kubeconfig retrieval and clientset generation in the pod-delete experiment test for clarity.
- Enhanced infrastructure setup logic to manually set ConnectedInfraID when using existing infrastructure.
- Refactored clientset generation to retrieve the token using the Auth() method, improving error handling and logging for token retrieval.
SkySingh04 added 13 commits May 5, 2025 00:51
…tion

- Introduced a unique experiment name generation using UUID for better identification.
- Updated the Argo Workflow manifest to include proper JSON escaping and dynamic parameters.
- Refactored the manifest parsing logic to modify metadata and workflow names dynamically.
- Improved error handling during manifest parsing and JSON conversion.

This refactor enhances clarity and maintainability of the pod-delete experiment request construction.
- Added a local replacement for github.com/litmuschaos/litmus-go-sdk in go.mod, to be removed after the SDK is published.
- Updated go.sum to reflect the changes in dependencies and their versions.
- Replaced UUID generation with a unique experiment name generation function for better clarity.
- Updated the Argo Workflow manifest to dynamically set the experiment name and improved metadata handling.
- Enhanced logging to provide clearer insights during experiment request construction.
- Streamlined the manifest parsing logic to ensure consistent naming conventions across the experiment lifecycle.
… SDK integration

- Replaced the experiment saving logic with a direct creation method for better clarity and error handling.
- Enhanced the experiment run status polling by fetching the latest run ID dynamically.
- Improved logging to provide clearer insights during experiment execution and status checks.
- Updated the experiment request construction to ensure proper handling of experiment run IDs and statuses.
…tion with templating

- Replaced static Argo Workflow manifest with a templated approach for dynamic parameter handling.
- Improved metadata handling by utilizing template data structure for experiment details.
- Streamlined the construction of the experiment request, enhancing clarity and maintainability.
- Updated logging to provide better insights during the experiment request creation process.
…ruction

- Moved the pod-delete experiment request construction logic to a new workflow package for better organization and reusability.
- Introduced a templated approach for generating Argo Workflow manifests, allowing for dynamic parameter handling.
- Enhanced the ExperimentConfig structure to encapsulate experiment-specific configurations.
- Updated the test to utilize the new workflow functions, improving clarity and maintainability of the experiment execution process.
…eneration

- Introduced new experiment types (PodCPUHog, PodMemoryHog) with specific parameters in the ExperimentConfig structure.
- Enhanced GetDefaultExperimentConfig to apply experiment-specific defaults based on the type.
- Modularized the construction of Argo Workflow manifests to support dynamic templates for various experiment types.
- Improved error handling and logging during manifest generation and experiment request construction, enhancing clarity and maintainability.
…ed SDK integration

- Commented out kubeconfig retrieval and clientset generation in both pod-cpu-hog and pod-memory-hog tests for clarity.
- Enhanced logging by replacing log with klog for better consistency across experiments.
- Modularized experiment request construction by utilizing the workflow package for generating unique experiment names and IDs.
- Improved error handling during experiment creation and run status polling, ensuring dynamic fetching of the latest experiment run ID.
- Updated the test descriptions for better readability and consistency.
…pport

- Added new fields to ExperimentConfig for probe configuration, including UseExistingProbe, ProbeName, and ProbeMode.
- Updated GetDefaultExperimentConfig to set default probe values.
- Modified experiment manifest generation to include probe annotations directly in the YAML.
- Implemented applyProbeConfigFromEnv function to read probe settings from environment variables, improving flexibility for experiment configurations.
- Updated specific experiment request constructors to apply probe configurations, enhancing reusability and maintainability.
… request construction

- Updated pod network experiments (corruption, latency, loss, duplication) to utilize the workflow package for constructing experiment requests.
- Improved logging consistency by replacing log statements with klog across all network chaos tests.
- Modularized experiment request construction to dynamically generate unique experiment names and IDs, enhancing maintainability.
- Enhanced error handling during experiment creation and run status polling, ensuring accurate fetching of the latest experiment run ID.
- Updated test descriptions for better clarity and consistency across network chaos experiments.
…on and enhance logging

- Updated container-kill, disk-fill, and pod-autoscaler experiments to utilize the workflow package for constructing experiment requests, improving maintainability.
- Replaced log statements with klog for consistent logging across all chaos experiments.
- Enhanced error handling during experiment creation and run status polling, ensuring accurate fetching of the latest experiment run ID.
- Updated test descriptions for better clarity and consistency across chaos experiments.
…ity in chaos tests

- Updated various chaos experiment tests (container-kill, disk-fill, pod-autoscaler, pod-cpu-hog, pod-memory-hog, pod-network-corruption, pod-network-duplication, pod-network-latency, pod-network-loss) to include a polling mechanism for checking the availability of experiment runs.
- Enhanced logging to provide detailed insights during the polling process, including retry attempts and found experiment run IDs.
- Removed commented-out kubeconfig retrieval and clientset generation for improved clarity and maintainability.
- Ensured consistent error handling and expectations during experiment run status checks across all updated tests.
…os experiment tests

- Updated multiple chaos experiment tests (container-kill, disk-fill, node-cpu-hog, node-io-stress, node-memory-hog, pod-cpu-hog, pod-memory-hog) to replace ioutil.ReadAll with io.ReadAll for improved performance and consistency.
- Modified file handling in pkg/file.go and pkg/install.go to utilize os.ReadFile and os.WriteFile, enhancing clarity and maintainability.
- Ensured consistent error handling and logging across all updated tests.
go.mod Outdated
github.com/go-openapi/spec => github.com/go-openapi/spec v0.19.5
github.com/googleapis/gnostic => github.com/googleapis/gnostic v0.3.1
github.com/litmuschaos/chaos-operator => github.com/litmuschaos/chaos-operator v0.0.0-20210610071657-a58dbd939e73
github.com/litmuschaos/litmus-go-sdk => ../litmus-go-sdk //TO remove after litmus-go-sdk is published
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a comment here to change it using commitID of other PR once that is merged.

@@ -27,7 +27,8 @@ func setNodeIOStressExperimentENV(experimentsDetails *types.ExperimentDetails) *
}
// Add Experiment ENV's
envDetails.SetEnv("FILESYSTEM_UTILIZATION_PERCENTAGE", strconv.Itoa(experimentsDetails.NodeCPUCore)).
SetEnv("FILESYSTEM_UTILIZATION_BYTES", strconv.Itoa(experimentsDetails.FilesystemUtilizationBytes))
SetEnv("FILESYSTEM_UTILIZATION_BYTES", strconv.Itoa(experimentsDetails.FilesystemUtilizationBytes)).
SetEnv("NODES_AFFECTED_PERC", strconv.Itoa(experimentsDetails.PodsAffectedPerc))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good!

SkySingh04 added 11 commits May 10, 2025 20:41
…experiment

- Introduced probe configuration options in ExperimentDetails, allowing for dynamic probe creation during the pod-delete experiment.
- Updated the GetENV function to read new environment variables related to probe settings.
- Enhanced the pod-delete experiment test to set up and clean up probes based on the new configuration.
- Implemented CreateProbe and CleanupProbe functions to manage probe lifecycle, improving experiment flexibility and maintainability.
…eriment tests

- Added support for probe configuration in various chaos experiment tests (container-kill, disk-fill, node-cpu-hog, node-io-stress, node-memory-hog, pod-autoscaler, pod-cpu-hog, pod-memory-hog, pod-network-corruption, pod-network-duplication, pod-network-latency, pod-network-loss).
- Implemented logic to create and validate probes before running experiments, enhancing flexibility and reliability.
- Included cleanup procedures for probes after experiment execution to ensure proper resource management.
- Enhanced logging for probe setup and cleanup processes to improve traceability during test execution.
…nment variables

- Eliminated NODE_LABEL from the environment variables in multiple chaos experiment tests (disk-fill, node-cpu-hog, node-io-stress, node-memory-hog) to streamline configuration.
- Implemented logic to filter out NODE_LABEL if it is empty, enhancing the flexibility of experiment configurations.
- Updated the GetExperimentManifest function to handle NODE_LABEL removal in the manifest generation process, ensuring cleaner manifests.
- Improved overall maintainability and clarity of chaos experiment tests by reducing unnecessary environment variables.
…truction

- Removed uuid dependency from chaos experiment tests (node-cpu-hog, node-io-stress, node-memory-hog) to streamline imports.
- Commented out kubeconfig retrieval and clientset generation in tests for improved clarity.
- Enhanced experiment request construction by utilizing pkg functions to generate unique experiment names and IDs, improving maintainability.
- Added logic to set ConnectedInfraID from ExistingInfraID if not already set, enhancing infrastructure setup reliability.
- Updated GetDefaultExperimentConfig to include new chaos experiment types and their specific configurations, ensuring comprehensive support for node chaos experiments.
- Updated GetExperimentManifest to remove NODE_LABEL environment variables when empty, improving manifest cleanliness.
- Added logic to replace NODE_LABEL with its value when provided, ensuring proper configuration for chaos experiments.
- Adjusted CPU_LOAD environment variable to a fixed value of '100' for consistency across experiments.
…t tests

- Replaced environment.ClientSets with Litmus SDK client in pod-delete experiment tests for improved clarity and maintainability.
- Updated SetupInfrastructure and DisconnectInfrastructure functions to utilize the new SDK client, enhancing infrastructure management.
- Streamlined experiment run polling and status checking by leveraging the SDK client, ensuring consistent error handling and logging.
- Enhanced CreateProbe function to utilize the SDK client for probe creation, improving flexibility in chaos experiment configurations.
…iment tests

- Replaced environment.ClientSets with Litmus SDK client in various chaos experiment tests (container-kill, disk-fill, node-cpu-hog, node-io-stress, node-memory-hog, pod-autoscaler, pod-cpu-hog, pod-memory-hog, pod-network-corruption, pod-network-duplication, pod-network-latency, pod-network-loss) for improved clarity and maintainability.
- Updated SetupInfrastructure and DisconnectInfrastructure functions to utilize the new SDK client, enhancing infrastructure management.
- Streamlined experiment run polling and status checking by leveraging the SDK client, ensuring consistent error handling and logging.
- Enhanced CreateProbe function to utilize the SDK client for probe creation, improving flexibility in chaos experiment configurations.
- Enhanced GetExperimentManifest to remove NODE_LABEL environment variables when empty, ensuring cleaner manifests.
- Updated regex patterns to match and remove NODE_LABEL in various formats, improving flexibility in manifest generation.
- Implemented logic to directly replace NODE_LABEL with an empty string or its value based on configuration, streamlining experiment setup.
- Introduced a comprehensive list of environment variables for configuring environments, infrastructure, and probes in the README.
- Implemented `GenerateEnvironmentID` function to create unique environment IDs.
- Enhanced `SetupEnvironment` function to manage the creation or usage of existing environments based on provided environment variables.
- Updated `ConnectInfrastructure` to utilize the new environment setup logic, improving infrastructure management.
…nce experiment manifests

- Set default values for MemoryConsumptionPercentage and MemoryConsumptionMebibytes to "0" in the NodeMemoryHog configuration, ensuring clarity in resource allocation.
- Removed unnecessary NODE_LABEL entries from the node-memory-hog and node-cpu-hog manifests to streamline configuration.
- Added comprehensive permissions for various Kubernetes resources in the experiment manifests, enhancing operational capabilities.
- Improved overall manifest structure by ensuring consistent labeling and resource definitions across chaos experiments.
- Updated the litmus-go-sdk dependency to version v0.0.0-20250513045254-3a81cc911979 in both go.mod and go.sum files, ensuring compatibility with the latest features and fixes.
// Check if we should use existing infrastructure
useExistingInfra, _ := strconv.ParseBool(os.Getenv("USE_EXISTING_INFRA"))
if useExistingInfra {
infraID := os.Getenv("EXISTING_INFRA_ID")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If INSTALL_INFRA is true, it means you need to create a new infrastructure and use it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And that is the approach we have currently implemented :
https://github.com/litmuschaos/chaos-ci-lib/pull/198/files#diff-70a81cc81e4424c3405f9c4048d010a079e692ccce2159739626bf889182db9a

installInfra, _ := strconv.ParseBool(os.Getenv("INSTALL_INFRA"))
	if !installInfra {
		klog.Info("INSTALL_INFRA is set to false, skipping infrastructure setup")
		return nil
	}

Comment on lines 57 to 59
if experimentsDetails.ConnectedInfraID == "" && experimentsDetails.UseExistingInfra && experimentsDetails.ExistingInfraID != "" {
experimentsDetails.ConnectedInfraID = experimentsDetails.ExistingInfraID
klog.Infof("Manually set ConnectedInfraID to %s from ExistingInfraID", experimentsDetails.ConnectedInfraID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code block can be added inside SetupInfrastructure function - when INSTALL_INFRA is set to false

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been refactored!

})
})
})

// ConstructContainerKillExperimentRequest constructs the experiment request by fetching template from external source
func ConstructContainerKillExperimentRequest(details *types.ExperimentDetails, experimentID string) (*models.SaveChaosExperimentRequest, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can remove unused functions. These functions are now moved to pkg/workflows

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These have been cleaned up!

…os experiment requests

- Eliminated HTTP fetching and YAML parsing logic from the chaos experiment request construction functions in multiple tests (container-kill, node-cpu-hog, node-io-stress, node-memory-hog, pod-cpu-hog, pod-memory-hog).
- Streamlined the experiment request construction process by removing unnecessary dependencies and simplifying the code structure.
- Improved overall maintainability and clarity of chaos experiment tests by focusing on essential functionalities.
…haos experiment tests

- Eliminated the logic for manually setting ConnectedInfraID from ExistingInfraID in multiple chaos experiment tests (container-kill, disk-fill, node-cpu-hog, node-io-stress, node-memory-hog, pod-autoscaler, pod-cpu-hog, pod-memory-hog, pod-network-corruption, pod-network-duplication, pod-network-latency, pod-network-loss, pod-delete).
- This change simplifies the infrastructure setup process by relying on the updated logic in the SetupInfrastructure function, enhancing code clarity and maintainability.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants