mloda docs

Home

  • mloda

mloda Concepts

  • Table of Content
  • Intro to the core interfaces of mloda
  • What makes mloda unique?
  • Data, Feature, FeatureSets and FeatureGroups in mloda
  • Data Producer, User, Owner in mloda
    • Roles in mloda
    • Data Producer
      • FeatureGroup API (plug-in) in short
      • Simple example implementation of the FeatureGroup API
      • Complex plug-ins
      • Quality
      • Consequences
    • Data User Role
    • Data Owner Role
      • Using organization wide logging
      • Logging data size
    • Conclusion
    • Next steps

Getting Started

  • Installation
  • mloda + scikit-learn Integration: Basic Example
  • API Request
  • Feature Groups
  • Compute Frameworks
  • Extender

In Depth - Basics

  • mloda API
  • (Feature) data
  • Join data
  • Filter data
  • Artifacts

In Depth - Advanced

  • Data quality
  • Domain concept
  • Data Access Patterns
  • Compute Frameworks
    • Framework Transformers
    • Compute Framework Integration
    • Framework Connection Object
  • Feature Groups
    • Feature Chain Parser
    • Feature Group Matching
    • Feature Group Testing
    • Feature Group Versioning
    • Multiple Result Columns

Development

  • Contributors
  • License - Apache 2.0
  • Missing

Need Help?

  • FAQ
  • Need Help
mloda docs
  • mloda Concepts
  • Data Producer, User, Owner in mloda

Data Producer, User, Owner in mloda¶

Roles in mloda¶

In this notebook, we will describe three roles which exists in the mloda framework.

  • Data Producers: Provide access to raw data, business-layer data, or aggregated data.

    • A data scientist or analyst might create simpler datasets or analytical outputs
    • A data engineer might design access to complex data infrastructures, such as data lakes or warehouses
    • Shares plug-ins and with it access to data
  • Data User: Interacts with mloda by applying plug-ins while making requests via the mlodaAPI.

    • A data scientist or analyst who needs data and data transformations (features)
    • Consumes features from other parts of the organizations
  • Data Owner: Ensures lifetime value, availability and governance of data and features

Data Producer¶

Let us look closer into the role of a data producer.

What could a data producer create?

In [1]:
Copied!
# We reuse the data from the first example and just rerun it for the sake of the example, but just in one cell.

import os
from typing import List
from mloda_core.abstract_plugins.components.feature import Feature
from mloda_core.abstract_plugins.components.data_access_collection import DataAccessCollection
from mloda_plugins.feature_group.input_data.read_dbs.sqlite import SQLITEReader
from mloda_core.abstract_plugins.plugin_loader.plugin_loader import PluginLoader
from mloda_core.api.request import mlodaAPI
from mloda_plugins.compute_framework.base_implementations.pyarrow.table import PyarrowTable

plugin_loader = PluginLoader.all()

# Initialize a DataAccessCollection object
data_access_collection = DataAccessCollection()

# Define the folders containing the data
# Note: We use two paths to accommodate different possible root locations as it depends where the code is executed.
base_data_path = os.path.join(os.getcwd(), "docs", "docs", "examples", "mloda_basics", "base_data")
if not os.path.exists(base_data_path):
    base_data_path = os.path.join(os.getcwd(), "base_data")

# Add the folder to the DataAccessCollection
data_access_collection.add_folder(base_data_path)

# As a db cannot work with a folder, we need to add a connection for the db.
data_access_collection.add_credential_dict(
    credential_dict={SQLITEReader.db_path(): os.path.join(base_data_path, "example.sqlite")}
)

order_features: List[str | Feature] = ["order_id", "product_id", "quantity", "item_price"]
payment_features: List[str | Feature] = ["payment_id", "payment_type", "payment_status", "valid_datetime"]
location_features: List[str | Feature] = ["user_location", "merchant_location", "update_date"]
categorical_features: List[str | Feature] = ["user_age_group", "product_category", "transaction_type"]
all_features = order_features + payment_features + location_features + categorical_features

mlodaAPI.run_all(all_features, data_access_collection=data_access_collection, compute_frameworks={PyarrowTable})
# We reuse the data from the first example and just rerun it for the sake of the example, but just in one cell. import os from typing import List from mloda_core.abstract_plugins.components.feature import Feature from mloda_core.abstract_plugins.components.data_access_collection import DataAccessCollection from mloda_plugins.feature_group.input_data.read_dbs.sqlite import SQLITEReader from mloda_core.abstract_plugins.plugin_loader.plugin_loader import PluginLoader from mloda_core.api.request import mlodaAPI from mloda_plugins.compute_framework.base_implementations.pyarrow.table import PyarrowTable plugin_loader = PluginLoader.all() # Initialize a DataAccessCollection object data_access_collection = DataAccessCollection() # Define the folders containing the data # Note: We use two paths to accommodate different possible root locations as it depends where the code is executed. base_data_path = os.path.join(os.getcwd(), "docs", "docs", "examples", "mloda_basics", "base_data") if not os.path.exists(base_data_path): base_data_path = os.path.join(os.getcwd(), "base_data") # Add the folder to the DataAccessCollection data_access_collection.add_folder(base_data_path) # As a db cannot work with a folder, we need to add a connection for the db. data_access_collection.add_credential_dict( credential_dict={SQLITEReader.db_path(): os.path.join(base_data_path, "example.sqlite")} ) order_features: List[str | Feature] = ["order_id", "product_id", "quantity", "item_price"] payment_features: List[str | Feature] = ["payment_id", "payment_type", "payment_status", "valid_datetime"] location_features: List[str | Feature] = ["user_location", "merchant_location", "update_date"] categorical_features: List[str | Feature] = ["user_age_group", "product_category", "transaction_type"] all_features = order_features + payment_features + location_features + categorical_features mlodaAPI.run_all(all_features, data_access_collection=data_access_collection, compute_frameworks={PyarrowTable})
Out[1]:
[pyarrow.Table
 user_location: string
 update_date: int64
 merchant_location: string
 ----
 user_location: [["East","West","North","West","West",...,"West","North","North","North","East"]]
 update_date: [[1640995200000,1641632290909,1642269381818,1642906472727,1643543563636,...,1701518836363,1702155927272,1702793018181,1703430109090,1704067200000]]
 merchant_location: [["North","East","East","East","North",...,"North","South","East","South","North"]],
 pyarrow.Table
 product_id: int64
 item_price: double
 quantity: int64
 order_id: int64
 ----
 product_id: [[282,355,395,319,275,...,170,328,361,192,271]]
 item_price: [[74.86,154.56,67.7,186.21,118.69,...,80.51,98.91,162.98,194.98,107.73]]
 quantity: [[6,2,4,9,5,...,4,3,5,5,6]]
 order_id: [[1,2,3,4,5,...,96,97,98,99,100]],
 pyarrow.Table
 payment_type: string
 valid_datetime: timestamp[ns, tz=UTC]
 payment_status: string
 payment_id: int64
 ----
 payment_type: [["debit card","debit card","store credit","credit card","store credit",...,"paypal","debit card","credit card","store credit","paypal"]]
 valid_datetime: [[2024-01-11 23:01:49.090909090Z,2024-01-15 09:41:49.090909090Z,2024-01-29 19:23:38.181818181Z,2024-01-04 18:10:54.545454545Z,2024-01-06 15:16:21.818181818Z,...,2024-01-13 12:36:21.818181818Z,2024-01-10 01:56:21.818181818Z,2024-01-15 02:10:54.545454545Z,2024-01-28 20:50:54.545454545Z,2024-01-06 00:14:32.727272727Z]]
 payment_status: [["failed","pending","completed","pending","failed",...,"completed","pending","completed","failed","completed"]]
 payment_id: [[1,2,3,4,5,...,96,97,98,99,100]],
 pyarrow.Table
 product_category: string
 transaction_type: string
 user_age_group: string
 ----
 product_category: [["clothing","home","clothing","clothing","clothing",...,"clothing","home","books","beauty","clothing"]]
 transaction_type: [["online","online","telephone","telephone","mail-order",...,"mail-order","in-store","telephone","mail-order","mail-order"]]
 user_age_group: [["26-35","26-35","36-45","55+","26-35",...,"26-35","46-55","36-45","26-35","36-45"]]]

FeatureGroup API (plug-in) in short¶

In mloda, a data producer defines access by creating feature groups. Here's an example implementation:

class FeatureGroupClass(AbstractFeatureGroup):

    # Root feature definition
    @classmethod
    def input_data(...)
        ...

    # Features derived from other features
    def input_features(...)
        ...

    @classmethod
    def calculate_feature(...)
        ...

Simple example implementation of the FeatureGroup API¶

In the background, mloda loads the plug-ins, which were created before, like this one.

class ReadFileFeature(AbstractFeatureGroup):
    @classmethod
    def input_data(cls) -> Optional[BaseInputData]:
        return ReadFile()

    @classmethod
    def calculate_feature(cls, data: Any, features: FeatureSet) -> Any:
        reader = cls.input_data()
        if reader is not None:
            data = reader.load(features)
            return data
        raise ValueError(f"Reading file failed for feature {features.get_name_of_one_feature()}.")

We use composition to read different data sources. A ReadFile object looks like this:

class CsvReader(ReadFile):
    @classmethod
    def suffix(cls) -> Tuple[str, ...]:
        return (
            ".csv",
            ".CSV",
        )

    @classmethod
    def load_data(cls, data_access: Any, features: FeatureSet) -> Any:
        result = pyarrow_csv.read_csv(data_access)
        return result.select(list(features.get_all_names()))

    @classmethod
    def get_column_names(cls, file_name: str) -> Any:
        read_options = pyarrow_csv.ReadOptions(skip_rows_after_names=1)
        table = pyarrow_csv.read_csv(file_name, read_options=read_options)
        return table.schema.names

As you can see, the implementation is flexible in the sense that if you need something, you can adjust it quite easily. The other files like .json, .parquet and the sqlite access are implemented in a similar fashion.

In [2]:
Copied!
# In the following, we will just adjust a bit the CsvReader to handle a different delimiter.

from typing import Any, Optional

from pyarrow import csv as pyarrow_csv

from mloda_core.abstract_plugins.components.feature_set import FeatureSet
from mloda_core.abstract_plugins.components.input_data.base_input_data import BaseInputData
from mloda_plugins.feature_group.input_data.read_file_feature import ReadFileFeature
from mloda_plugins.feature_group.input_data.read_files.csv import CsvReader


class CsvReader2(CsvReader):
    # Adjusted CsvReader2 to handle the new delimiter
    _parse_options = pyarrow_csv.ParseOptions(
        delimiter=",",  # Default delimiter
        quote_char='"',  # Handles quoted strings
        ignore_empty_lines=True,  # Skips empty lines
    )

    @classmethod
    def load_data(cls, data_access: Any, features: FeatureSet) -> Any:
        result = pyarrow_csv.read_csv(data_access, parse_options=cls._parse_options)
        print("We used CsvReader2 to load the data.")
        return result.select(list(features.get_all_names()))


class ReadFileFeature2(ReadFileFeature):
    @classmethod
    def input_data(cls) -> Optional[BaseInputData]:
        return CsvReader2()

    @classmethod
    def validate_output_features(cls, data: Any, features: FeatureSet) -> Optional[bool]:
        for column_name in features.get_all_names():
            if column_name in data.column_names:
                column = data[column_name]
                if column.null_count == column.length:
                    return False
            return True
        return True
# In the following, we will just adjust a bit the CsvReader to handle a different delimiter. from typing import Any, Optional from pyarrow import csv as pyarrow_csv from mloda_core.abstract_plugins.components.feature_set import FeatureSet from mloda_core.abstract_plugins.components.input_data.base_input_data import BaseInputData from mloda_plugins.feature_group.input_data.read_file_feature import ReadFileFeature from mloda_plugins.feature_group.input_data.read_files.csv import CsvReader class CsvReader2(CsvReader): # Adjusted CsvReader2 to handle the new delimiter _parse_options = pyarrow_csv.ParseOptions( delimiter=",", # Default delimiter quote_char='"', # Handles quoted strings ignore_empty_lines=True, # Skips empty lines ) @classmethod def load_data(cls, data_access: Any, features: FeatureSet) -> Any: result = pyarrow_csv.read_csv(data_access, parse_options=cls._parse_options) print("We used CsvReader2 to load the data.") return result.select(list(features.get_all_names())) class ReadFileFeature2(ReadFileFeature): @classmethod def input_data(cls) -> Optional[BaseInputData]: return CsvReader2() @classmethod def validate_output_features(cls, data: Any, features: FeatureSet) -> Optional[bool]: for column_name in features.get_all_names(): if column_name in data.column_names: column = data[column_name] if column.null_count == column.length: return False return True return True
In [3]:
Copied!
from mloda_core.abstract_plugins.components.plugin_option.plugin_collector import PlugInCollector

result = mlodaAPI.run_all(
    order_features,
    data_access_collection=data_access_collection,
    compute_frameworks={PyarrowTable},
    plugin_collector=PlugInCollector.enabled_feature_groups({ReadFileFeature2}),
)

# We can see that the data was loaded using the new CsvReader2.
# However, this is a rather simple use case. In a real-world scenario, we would have more complex data and more complex operations.
from mloda_core.abstract_plugins.components.plugin_option.plugin_collector import PlugInCollector result = mlodaAPI.run_all( order_features, data_access_collection=data_access_collection, compute_frameworks={PyarrowTable}, plugin_collector=PlugInCollector.enabled_feature_groups({ReadFileFeature2}), ) # We can see that the data was loaded using the new CsvReader2. # However, this is a rather simple use case. In a real-world scenario, we would have more complex data and more complex operations.
We used CsvReader2 to load the data.

Complex plug-ins¶

These can be quite varied:

  • aggregate features
  • entity features
  • historical features

Additionally, one can also write feature groups for:

  • using feature stores
  • using orchestrator steps
  • lazy evaluated functions

Quality¶

The producer must optimize quality, which includes:

  • defining input and output validators
  • manage the storage and retrieval of artifacts
  • implementing software testing

An integration test could be done by using mlodaAPI.run_all and custom data.

An example of a unit test could look like:

def test_csv_reader_2(self) -> None: 
   def test_parse_options_are_customized(self, mock_read_csv):
        # Ensure the parse options are as expected
        expected_parse_options = pyarrow_csv.ParseOptions(
            delimiter=",",
            quote_char='"',
            ignore_empty_lines=True
        )

        # Call the method to trigger parse options usage
        CsvReader2.load_data(Mock(), Mock(spec=FeatureSet))

        # Verify that the _parse_options in CsvReader2 are customized
        self.assertEqual(CsvReader2._parse_options, expected_parse_options)

This allows us to apply software engineering practices consistently throughout the entire data workflow.

Consequences¶

Within mloda, the data producer is empowered as the primary driver, owing to the extensive and customizable range of available plugins.

Unlike traditional data toolchains, mloda provides data producers with the flexibility to define their specific start and end points. This enables the versatile application of mloda across different parts of machine learning lifecycle, such as prototyping, training data preparation, or real-time result monitoring.

This includes the autonomy to define the boundaries of the data producer's domain and to govern the outflow of data. Both aspects are managed through feature groups, which remain under the direct control of the data producer.

Data User Role¶

The Data User plays a pivotal role in configuring and utilizing the mloda API for machine learning and data workflows. The mloda API offers flexible configurations to cater to diverse use cases across the ML lifecycle. Below is an outline of the configurations and features that define the Data User's role:

class mlodaAPI:
    def __init__(
        self,
        requested_features: Union[Features, list[Union[Feature, str]]],
        compute_frameworks: Union[Set[Type[ComputeFrameWork]], Optional[list[str]]] = None,
        links: Optional[Set[Link]] = None,
        data_access_collection: Optional[DataAccessCollection] = None,
        global_filter: Optional[GlobalFilter] = None,
        api_input_data_collection: Optional[ApiInputDataCollection] = None,
        plugin_collector: Optional[PlugInCollector] = None,
    ) -> None:

data = mlodaAPI.run_all(requested_feature,...)

Let’s use the API to further explain the Data User role. As shown above, there are several configurations to consider. The key ones are:

  • Which features to request and if the compute_frameworks should be limited?
  • How data is linked?
  • What specific access rights and permissions does the user have?
  • How data is refined to meet the requirements of the use case?

With all the given configurations, the mloda core is designed, whenever feasible, to follow the process:

  • First, formulate an optimized execution plan
  • Second, to execute the plan accordingly

What the user mostly gains is that the process is repeatable and can be run in most environments, as long as the plug-ins are available and the accesses exist (firewalls, credentials).

The data user could run mloda API in following scenarios:

  • POC notebooks
  • Production code scenarios (model training or realtime prediction)
  • Micro service endpont
  • KPI or QA test data ingestion

With this, the whole ml lifecycle is represented and plug-ins can be reused in a testable and repeatable way along this cycle.

Data Owner Role¶

Data owners typically operate at various levels within an organization.

It can be the one who produces the data, the business stakeholder responsible for the service, or, in some cases, may not be explicitly defined.

In mloda, the data owner is the one in control of the governance. However, as to date, this system is not included in this open-source offering, as this platform is reserved for development until the plugin ecosystem has a higher degree of maturity.

We have the plug-in functionalities to integrate governance and operations systems in place. Two simple examples can be:

Using organization wide logging¶

class OtelExtender(WrapperFunctionExtender):
    def __init__(self) -> None:
        if trace is None:
            return

        # Function to be wrapped by the Extender
        self.wrapped = {WrapperFunctionEnum.FEATURE_GROUP_CALCULATE_FEATURE}

    def wraps(self) -> Set[WrapperFunctionEnum]:
        return self.wrapped

    def __call__(self, func: Any, *args: Any, **kwargs: Any) -> Any:
        logger.warning("OtelExtender")
        result = func(*args, **kwargs)
        return result

Logging data size¶

class LogSizeOfData(WrapperFunctionExtender):

    def wraps(self) -> Set[WrapperFunctionEnum]:
        # Function to be wrapped by the Extender
        return {WrapperFunctionEnum.VALIDATE_INPUT_FEATURE}

    def __call__(self, func: Any, *args: Any, **kwargs: Any) -> Any:
        result = func(*args, **kwargs)
        size = sys.getsizeof(result)
        print(f"Size: {size}")
        return result

Conclusion¶

In this notebook, we explored the roles of Data Producers, Data Users, and Data Owners within mloda. We delved into the responsibilities and functionalities associated with each role, highlighting how they contribute to the overall data lifecycle.

  • Data Producers are responsible for implementing the plugins, ensuring the accuracy of data access processes, and defining relevant configuration options

  • Data Users create usage configuration and apply the mlodaAPI to receive data

  • Data Owners, while not fully covered in this open-source offering, they are critical for governance and ensuring the ongoing availability and maintenance of essential plugins

Understanding these roles and their interactions, shows how mloda's modular and extensible design is vital in bringing the change to efficient data management practices and processing throughout the machine learning lifecycle

Next steps¶

While mloda holds a great potential to become a unified portal, we face a key challenge: its plugin coverage is not comprehensive enough to integrate seamlessly across all available tools and technologies. Therefore, active community contributions are absolutely essential to accelerate both its adoption and its ability to transform data management practices

Previous Next

Built with MkDocs using a theme provided by Read the Docs.
« Previous Next »