(Feature) data

Access (feature) data

This framework provides several structured mechanisms for features to access and manage data, catering to diverse needs. Features can retrieve data through:

DataAccessCollection for global data loading management,
Feature Scope Data Access for feature-specific data loading management,
Api Data for using data directly with the mlodaAPI request.
Data Creator for generating data instead of loading data,
Input Features facilitates data sharing between features.

These methods ensure efficient data management while maintaining flexibility and scalability. A typical scenario involves a complex feature relying on three input features. These input features, in turn, may depend on other input features or load data using ApiData.

Advanced: For a detailed explanation of the underlying data access patterns (BaseInputData vs MatchData), see Data Access Patterns.

DataAccessCollection - global data access

The DataAccessCollection is designed to control the access to data of any kind. The main purpose of this class is to organize and simplify interactions with these different data elements, making it easier to work to ingest data of various form into the framework. It provides as an interface for accessing and storing of data on a global level.

The DataAccessCollection can only be added via mlodaAPI.

List options:

Files: Specifies the exact location of files: path/folder/text.txt
Folders: Points to directories where files are located: path/folder/
Credential dicts: Contains the necessary credentials to access data:
{host: example.com, password: example}
Initialized connection object: Stores connection objects that are already initialized: (DBConnectionObject)
Unitialized connection object: Stores not initialized connection objects: (UninitializedDBConnection)

You can apply these options like so:


data_access = DataAccessCollection()

# Add file paths, folder paths, credentials, and connection objects
data_access.add_file('path/to/folder/text.txt')
data_access.add_folder('path/to/folder/')
data_access.add_credential_dict({'host': 'example.com', 'password': 'example'})
data_access.add_initialized_connection_object('InitializedDBConnection')
data_access.add_uninitialized_connection_object('UninitializedDBConnection')

mlodaAPI.run_all(
    feature_list,
    data_access_collection=data_access)

Global Scope Data Access

A concrete, simplified global scope data access is shown in this example:

import os

from mloda_core.abstract_plugins.components.data_access_collection import DataAccessCollection
from mloda_core.api.request import mlodaAPI


file_path = os.getcwd()
file_path += "/docs/docs/in_depth"

data_access_collection = DataAccessCollection(folders={str(file_path)})

result = mlodaAPI.run_all(
            ["AExample", "BExample"], 
            compute_frameworks=["PandasDataframe"], 
            # Define data access on a global level
            data_access_collection=data_access_collection
        )
print(result)

Output

[  AExample  BExample
0   Value1         2
1   Value2         3]

Feature Scope Data Access

The Feature Scope Data access is instead designed to control the access to data of any kind on a local level.

If data needs to be added specifically for a single feature (or features from the same feature group), you can use the feature_scope_data_access_name functionality.

We show the ReadFileFeature as example. It uses the input_data ReadFile. In this case, we need to provide the specific reader class: CsvReader.


# This feature is already implemented as plug-in, so do not run it again. This will raise intentional errors.
class ReadFileFeature(AbstractFeatureGroup):
    @classmethod
    def input_data(cls) -> Optional[BaseInputData]:
        return ReadFile()

    @classmethod
    def calculate_feature(cls, data: Any, features: FeatureSet) -> Any:
        reader = cls.input_data()
        if reader is not None:
            data = reader.load(features)
            return data
        raise ValueError(f"Reading file failed for feature {features.get_name_of_one_feature()}.")

As a side note, the ReadFileFeature was also used for the global scope automatism.

To use it, we can simply:



from typing import Optional, Any, List
from pathlib import Path

from mloda_core.abstract_plugins.abstract_feature_group import AbstractFeatureGroup
from mloda_core.abstract_plugins.components.input_data.base_input_data import BaseInputData
from mloda_core.abstract_plugins.components.feature_set import FeatureSet
from mloda_plugins.feature_group.input_data.read_file import ReadFile
from mloda_core.abstract_plugins.components.feature import Feature
from mloda_plugins.feature_group.input_data.read_files.csv import CsvReader


file_path = os.getcwd()
file_path += "/docs/docs/in_depth"

feature_list: List[Feature | str] = []
feature_list.append(
    Feature(
        name="AExample",
        # Define data access on a feature level
        options={CsvReader.get_class_name(): file_path}),
)


result = mlodaAPI.run_all(feature_list, compute_frameworks=["PandasDataframe"])
print(result)

Output

[  AExample
0   Value1
1   Value2]

Of course, we do not always want to load data during run. We might want to give data to the framework. For this purpose, we have the ApiData.

ApiData

The ApiData can read data given to mlodaAPI at request time.

Use cases:

web requests
real-time prediction
features as parameters

The following example shows a simple ApiData setup.


from typing import List

from mloda_core.api.request import mlodaAPI
from mloda_plugins.compute_framework.base_implementations.pandas.dataframe import PandasDataframe
from mloda_core.abstract_plugins.components.input_data.api.api_input_data_collection import ApiInputDataCollection

# Setup the ApiInputDataCollection and api data.
# These 2 objects are needed to relate given data to the correct features.
api_input_data_collection = ApiInputDataCollection()
api_data = api_input_data_collection.setup_key_api_data(
    key_name="ExampleApiData", api_input_data={"FeatureInputAPITest": ["TestValue3", "TestValue4"]}
)

result = mlodaAPI.run_all(
        ["FeatureInputAPITest"],
        compute_frameworks={PandasDataframe},
        api_input_data_collection=api_input_data_collection,
        api_data=api_data,
)
for res in result:
    print(res)

Output:

  FeatureInputAPITest
0          TestValue3
1          TestValue4

Further, we do not want to always load data from outside, be it before or during the framework run, but we want to be able to create Data. For this purpose, we have the Data Creator.

Data Creator

The data creator can create data independent of any other dependency. It is essentially a base feature that does not need a DataAccessCollection or Feature Scope Data Access.

Usage:

test data,
sample data,
dummy data,
parameter data

One could imagine that for experimenting one wants to see data. Then one could use this feature as input feature to another feature instead of e.g. the true data.



from mloda_plugins.compute_framework.base_implementations.pandas.dataframe import PandasDataframe
from mloda_core.abstract_plugins.components.input_data.creator.data_creator import DataCreator

# Create a Creator FeatureGroup class, which delivers the data needed
class AFeatureInputCreator(AbstractFeatureGroup):

    # Define input_data with using DataCreator
    @classmethod
    def input_data(cls) -> Optional[BaseInputData]:
        return DataCreator({"AFeatureInputCreator"})

    # Define the data this feature creates
    @classmethod
    def calculate_feature(cls, data: Any, features: FeatureSet) -> Any:
        return {"AFeatureInputCreator": ["TestValue5", "TestValue6"]}

result = mlodaAPI.run_all(
        ["AFeatureInputCreator"],
        compute_frameworks={PandasDataframe},
)
for res in result:
     print(res)

Output

  AFeatureInputCreator
0          TestValue5
1          TestValue6

Finally, as also the most important way to get data, is actually to depent on data inside the framework already. For this purpose, a feature can load data depending on other features.

Input features

The input_features method allows a feature to access data from other features. This enables data sharing and collaboration between different components of your system.

This is one of the key aspects in how we achieve to split data from processes.

In the following example, we will use data from another feature.


from typing import Set

from mloda_core.abstract_plugins.components.options import Options
from mloda_core.abstract_plugins.components.feature_name import FeatureName
from mloda_core.abstract_plugins.components.plugin_option.plugin_collector import PlugInCollector

# Set this variable as convention
_mloda_source = "mloda_source"


# First, we create a class, which uses input features from another class
class AInputFeatureGroup(AbstractFeatureGroup):

    def input_features(self, options: Options, feature_name: FeatureName) -> Optional[Set[Feature]]:

        # We use the source to make this feature flexible.
        # One could give here different feature names via the configuration.
        mloda_source = options.get(_mloda_source)
        if mloda_source is None:
            raise ValueError(f"Option '{_mloda_source}' is required.")

        features = set()
        for source in mloda_source:
            features.add(Feature(name=source,                # source in this example is <AFeatureInputCreator>
                                 initial_requested_data=True # To see this feature also in the output, we can set this var to true.
                                 )
            )
        return features

    @classmethod
    def calculate_feature(cls, data: Any, features: FeatureSet) -> Any:
        data["AInputFeatureGroup"] = len(data)
        return data


feature_list = []
feature_list.append(
    Feature(name="AInputFeatureGroup", options={_mloda_source: frozenset(["AFeatureInputCreator"])})
)

result = mlodaAPI.run_all(
    feature_list,
    compute_frameworks={PandasDataframe},
    plugin_collector=PlugInCollector.enabled_feature_groups({AInputFeatureGroup, AFeatureInputCreator})

)
print(result)

Output:

[AFeatureInputCreator
0           TestValue5
1           TestValue6,    
AInputFeatureGroup
0                   2
1                   2]
.......

As the input features can be fulfilled by multiple other features, we can have the same processes running in different environments, migrations and processes.