Artifacts
Artifacts are a crucial part of the mloda, enabling the storage and retrieval of intermediate results feature engineering processes. Example use cases are embeddings, Feature Matrices, Model Checkpoints and more.
Overview
Artifacts are managed through a set of abstract and concrete classes that define how artifacts are created, saved, and loaded. The primary classes involved in artifact management include:
BaseArtifact
: The base class for all artifacts.AbstractFeatureGroup
: An abstract class that defines the structure for feature groups, including methods for artifact management. This class may contain aBaseArtfact
.
Key Components
BaseArtifact
The BaseArtifact
class serves as the foundation for all artifacts. It provides the basic interface and functionality required for artifact management.
AbstractFeatureGroup
The AbstractFeatureGroup
class defines the structure for feature groups, including methods for creating data, calculating features, and managing artifacts. It includes methods such as artifact
and load_artifact
to handle artifact operations.
Example
Feature Group with Artifact implementation
The following example demonstrates how to implement and test an artifact.
Here, we create a FeatureGroup
with a configured BaseArtifact
.
from typing import Type, Any, Optional
from mloda_core.abstract_plugins.components.feature_set import FeatureSet
from mloda_core.abstract_plugins.abstract_feature_group import AbstractFeatureGroup
from mloda_core.abstract_plugins.components.base_artifact import BaseArtifact
from mloda_core.abstract_plugins.components.input_data.creator.data_creator import DataCreator
from mloda_core.abstract_plugins.components.input_data.base_input_data import BaseInputData
class BaseExampleArtifactFeature(AbstractFeatureGroup):
@classmethod
def input_data(cls) -> Optional[BaseInputData]:
return DataCreator({cls.get_class_name()})
@staticmethod
def artifact() -> Type[BaseArtifact] | None:
return BaseArtifact
@classmethod
def calculate_feature(cls, data: Any, features: FeatureSet) -> Any:
if features.artifact_to_save:
features.save_artifact = "BasicArtifact"
if features.artifact_to_load:
result = cls.load_artifact(features)
print(f"{result} is the loaded artifact.")
return {cls.get_class_name(): [1, 2, 3]}
Now, we run the query to the feature group to save the artifact. This example is very basic, but could be a much more complex artifact.
from mloda_core.api.request import mlodaAPI
from mloda_plugins.compute_framework.base_implementations.pyarrow.table import PyarrowTable
api = mlodaAPI(["BaseExampleArtifactFeature"], {PyarrowTable})
api._batch_run()
artifacts = api.get_artifacts()
print(artifacts)
Result:
{'BaseExampleArtifactFeature': 'BasicArtifact'}
Now, let us use this artifact.
from mloda_core.abstract_plugins.components.feature import Feature
feat = Feature(name="BaseExampleArtifactFeature", options=artifacts)
api = mlodaAPI([feat], {PyarrowTable})
api._batch_run()
Result:
"BasicArtifact is the loaded artifact."
Testing Artifacts
Testing artifact features involves creating test cases that ensure artifacts are correctly saved and loaded. The following example shows how to test the BaseTestArtifactFeature
class.
Complex Artifact Example: SklearnArtifact
For more advanced use cases, artifacts can handle complex data structures and multiple objects. The SklearnArtifact
demonstrates this with fitted scikit-learn transformers:
from mloda_plugins.feature_group.experimental.sklearn.sklearn_artifact import SklearnArtifact
class MySklearnFeatureGroup(AbstractFeatureGroup):
@staticmethod
def artifact() -> Type[BaseArtifact] | None:
return SklearnArtifact
@classmethod
def calculate_feature(cls, data: Any, features: FeatureSet) -> Any:
# Save multiple artifacts with unique keys
if features.artifact_to_save:
SklearnArtifact.save_sklearn_artifact(
features,
"my_transformer",
{"fitted_transformer": fitted_model, "feature_names": ["col1", "col2"]}
)
# Load specific artifact by key
if features.artifact_to_load:
artifact_data = SklearnArtifact.load_sklearn_artifact(features, "my_transformer")
fitted_model = artifact_data["fitted_transformer"]
This pattern supports file-based storage, multiple artifact management, and complex serialization.
Conclusion
Artifacts are a powerful feature in mloda, enabling efficient management of intermediate results in the machine learning pipeline. By understanding and utilizing the provided classes and methods, you can effectively manage artifacts in your projects.