Intro to the core interfaces of mloda¶

mloda is a robust and flexible data framework tailored for professionals to efficiently manage data and feature engineering. It enables users to abstract processes away from data, in contrast to the current industry setup where processes are usually bound to specific data sets.

This introductory notebook provides a practical demonstration of how MLoda helps machine learning data workflows by emphasizing data processes over raw data manipulation.

It begins by loading data from various sources, such as order, payment, location, and categorical datasets.
Next, we showcase mloda's versatility in handling diverse compute frameworks, including PyArrow tables and Pandas DataFrames.
Then we leverage mloda's advanced capabilities to integrate data from various sources into cohesive and unified feature sets (details on feature sets are covered in chapter 3).

Finally, we will conclude by discussing the broader implications of what was done.

In [1]:

Copied!

# Load all available plugins into the python environment
from mloda_core.abstract_plugins.plugin_loader.plugin_loader import PluginLoader

plugin_loader = PluginLoader.all()

# Since there are potentially many plugins loaded, we'll focus on specific categories for clarity.
# Here, we demonstrate by listing the available 'read' and 'sql' plugins.
print(plugin_loader.list_loaded_modules("read"))
print(plugin_loader.list_loaded_modules("sql"))
# Load all available plugins into the python environment
from mloda_core.abstract_plugins.plugin_loader.plugin_loader import PluginLoader

plugin_loader = PluginLoader.all()

# Since there are potentially many plugins loaded, we'll focus on specific categories for clarity.
# Here, we demonstrate by listing the available 'read' and 'sql' plugins.
print(plugin_loader.list_loaded_modules("read"))
print(plugin_loader.list_loaded_modules("sql"))

['mloda_plugins.feature_group.input_data.read_file', 'mloda_plugins.feature_group.input_data.read_db_feature', 'mloda_plugins.feature_group.input_data.read_db', 'mloda_plugins.feature_group.input_data.read_file_feature', 'mloda_plugins.feature_group.input_data.read_files.json', 'mloda_plugins.feature_group.input_data.read_files.csv', 'mloda_plugins.feature_group.input_data.read_files.parquet', 'mloda_plugins.feature_group.input_data.read_files.feather', 'mloda_plugins.feature_group.input_data.read_files.orc', 'mloda_plugins.feature_group.input_data.read_dbs.sqlite']
['mloda_plugins.feature_group.input_data.read_dbs.sqlite']

In [2]:

Copied!





# Optional!
# We use synthetic dummy data to demonstrate the basic usage.
# You can run this cell in your own jupyter notebook.
# They are however not relevant for further understanding.
#
# from examples.mloda_basics import create_synthetic_data

# create_synthetic_data.create_ml_lifecylce_data()
# Optional!
# We use synthetic dummy data to demonstrate the basic usage.
# You can run this cell in your own jupyter notebook.
# They are however not relevant for further understanding.
#
# from examples.mloda_basics import create_synthetic_data

# create_synthetic_data.create_ml_lifecylce_data()

We should see 4 files in the base_data folder. One sqlite example for a db and 3 different file formats.

Now we want to load the data to look at the content, so we can look at the data.

In [3]:

Copied!

# Step 1: We want to load typical order information like order_id, product_id, quantity, and item_price.
from typing import List
from mloda_core.abstract_plugins.components.feature import Feature

order_features: List[str | Feature] = ["order_id", "product_id", "quantity", "item_price"]

payment_features: List[str | Feature] = ["payment_id", "payment_type", "payment_status", "valid_datetime"]

location_features: List[str | Feature] = ["user_location", "merchant_location", "update_date"]

categorical_features: List[str | Feature] = ["user_age_group", "product_category", "transaction_type"]
# Step 1: We want to load typical order information like order_id, product_id, quantity, and item_price.
from typing import List
from mloda_core.abstract_plugins.components.feature import Feature

order_features: List[str | Feature] = ["order_id", "product_id", "quantity", "item_price"]

payment_features: List[str | Feature] = ["payment_id", "payment_type", "payment_status", "valid_datetime"]

location_features: List[str | Feature] = ["user_location", "merchant_location", "update_date"]

categorical_features: List[str | Feature] = ["user_age_group", "product_category", "transaction_type"]

In [4]:

Copied!





# Step 2: We specify the data sources to load
import os
from mloda_core.abstract_plugins.components.data_access_collection import DataAccessCollection
from mloda_plugins.feature_group.input_data.read_dbs.sqlite import SQLITEReader

# Initialize a DataAccessCollection object
data_access_collection = DataAccessCollection()

# Define the folders containing the data
# Note: We use two paths to accommodate different possible root locations as it depends where the code is executed.
base_data_path = os.path.join(os.getcwd(), "docs", "docs", "examples", "mloda_basics", "base_data")
if not os.path.exists(base_data_path):
    base_data_path = os.path.join(os.getcwd(), "base_data")

# Add the folder to the DataAccessCollection
data_access_collection.add_folder(base_data_path)

# As a db cannot work with a folder, we need to add a connection for the db.
data_access_collection.add_credential_dict(
    credential_dict={SQLITEReader.db_path(): os.path.join(base_data_path, "example.sqlite")}
)
# Step 2: We specify the data sources to load
import os
from mloda_core.abstract_plugins.components.data_access_collection import DataAccessCollection
from mloda_plugins.feature_group.input_data.read_dbs.sqlite import SQLITEReader

# Initialize a DataAccessCollection object
data_access_collection = DataAccessCollection()

# Define the folders containing the data
# Note: We use two paths to accommodate different possible root locations as it depends where the code is executed.
base_data_path = os.path.join(os.getcwd(), "docs", "docs", "examples", "mloda_basics", "base_data")
if not os.path.exists(base_data_path):
    base_data_path = os.path.join(os.getcwd(), "base_data")

# Add the folder to the DataAccessCollection
data_access_collection.add_folder(base_data_path)

# As a db cannot work with a folder, we need to add a connection for the db.
data_access_collection.add_credential_dict(
    credential_dict={SQLITEReader.db_path(): os.path.join(base_data_path, "example.sqlite")}
)

In [5]:

Copied!





# Step 3: Request Data Using the Defined Access Collection and Desired Features
from mloda_core.api.request import mlodaAPI
from mloda_plugins.compute_framework.base_implementations.pyarrow.table import PyarrowTable


all_features = order_features + payment_features + location_features + categorical_features

# Retrieve data based on the specified feature list and access collection
result = mlodaAPI.run_all(
    all_features, data_access_collection=data_access_collection, compute_frameworks={PyarrowTable}
)

# Display the first five entries of each result table and its type
for data in result:
    print(data[:2], type(data))
# Step 3: Request Data Using the Defined Access Collection and Desired Features
from mloda_core.api.request import mlodaAPI
from mloda_plugins.compute_framework.base_implementations.pyarrow.table import PyarrowTable


all_features = order_features + payment_features + location_features + categorical_features

# Retrieve data based on the specified feature list and access collection
result = mlodaAPI.run_all(
    all_features, data_access_collection=data_access_collection, compute_frameworks={PyarrowTable}
)

# Display the first five entries of each result table and its type
for data in result:
    print(data[:2], type(data))

pyarrow.Table
user_location: string
update_date: int64
merchant_location: string
----
user_location: [["East","West"]]
update_date: [[1640995200000,1641632290909]]
merchant_location: [["North","East"]] <class 'pyarrow.lib.Table'>
pyarrow.Table
payment_status: string
payment_type: string
valid_datetime: timestamp[ns, tz=UTC]
payment_id: int64
----
payment_status: [["failed","pending"]]
payment_type: [["debit card","debit card"]]
valid_datetime: [[2024-01-11 23:01:49.090909090Z,2024-01-15 09:41:49.090909090Z]]
payment_id: [[1,2]] <class 'pyarrow.lib.Table'>
pyarrow.Table
item_price: double
order_id: int64
product_id: int64
quantity: int64
----
item_price: [[74.86,154.56]]
order_id: [[1,2]]
product_id: [[282,355]]
quantity: [[6,2]] <class 'pyarrow.lib.Table'>
pyarrow.Table
transaction_type: string
product_category: string
user_age_group: string
----
transaction_type: [["online","online"]]
product_category: [["clothing","home"]]
user_age_group: [["26-35","26-35"]] <class 'pyarrow.lib.Table'>

In [6]:

Copied!





# The data is initially loaded as a Pyarrow table. However, we can easily load it also as a PandasDataframe.
from mloda_plugins.compute_framework.base_implementations.pandas.dataframe import PandasDataframe

# Request data using the Pandas compute framework
result = mlodaAPI.run_all(
    all_features, data_access_collection=data_access_collection, compute_frameworks={PandasDataframe}
)

# Display the first five entries of each result table and its type
for data in result:
    print(data[:2], type(data))
# The data is initially loaded as a Pyarrow table. However, we can easily load it also as a PandasDataframe.
from mloda_plugins.compute_framework.base_implementations.pandas.dataframe import PandasDataframe

# Request data using the Pandas compute framework
result = mlodaAPI.run_all(
    all_features, data_access_collection=data_access_collection, compute_frameworks={PandasDataframe}
)

# Display the first five entries of each result table and its type
for data in result:
    print(data[:2], type(data))

  user_location    update_date merchant_location
0          East  1640995200000             North
1          West  1641632290909              East <class 'pandas.core.frame.DataFrame'>
  payment_status payment_type                      valid_datetime  payment_id
0         failed   debit card 2024-01-11 23:01:49.090909090+00:00           1
1        pending   debit card 2024-01-15 09:41:49.090909090+00:00           2 <class 'pandas.core.frame.DataFrame'>
   item_price  order_id  product_id  quantity
0       74.86         1         282         6
1      154.56         2         355         2 <class 'pandas.core.frame.DataFrame'>
  transaction_type product_category user_age_group
0           online         clothing          26-35
1           online             home          26-35 <class 'pandas.core.frame.DataFrame'>

In [7]:

Copied!





# Define features with specific compute frameworks
order_id = Feature(name="order_id", compute_framework="PandasDataframe")
product_id = Feature(name="product_id", compute_framework="PyarrowTable")
specific_framework_feature_list: List[Feature | str] = [order_id, product_id]

# Request data for the defined features
result = mlodaAPI.run_all(specific_framework_feature_list, data_access_collection=data_access_collection)

# Display the first few rows and data types of the results
for res in result:
    print("The resulting data structure differs based on the compute framework:")
    print("\n", res[:3], type(res))
# Define features with specific compute frameworks
order_id = Feature(name="order_id", compute_framework="PandasDataframe")
product_id = Feature(name="product_id", compute_framework="PyarrowTable")
specific_framework_feature_list: List[Feature | str] = [order_id, product_id]

# Request data for the defined features
result = mlodaAPI.run_all(specific_framework_feature_list, data_access_collection=data_access_collection)

# Display the first few rows and data types of the results
for res in result:
    print("The resulting data structure differs based on the compute framework:")
    print("\n", res[:3], type(res))

The resulting data structure differs based on the compute framework:

    order_id
0         1
1         2
2         3 <class 'pandas.core.frame.DataFrame'>
The resulting data structure differs based on the compute framework:

 pyarrow.Table
product_id: int64
----
product_id: [[282,355,395]] <class 'pyarrow.lib.Table'>

In [8]:

Copied!





# Demonstrating mloda's Flexibility with Different Data Technologies

# Import required modules
from typing import Any, List, Optional, Set

from mloda_core.abstract_plugins.abstract_feature_group import AbstractFeatureGroup
from mloda_core.abstract_plugins.components.feature_name import FeatureName
from mloda_core.abstract_plugins.components.feature_set import FeatureSet
from mloda_core.abstract_plugins.components.index.index import Index
from mloda_core.abstract_plugins.components.link import JoinType, Link
from mloda_core.abstract_plugins.components.options import Options
from mloda_plugins.feature_group.input_data.read_file_feature import ReadFileFeature


# Define the index for the join
index = Index(("order_id",))


# Extend ReadFileFeature to provide index columns
class ReadFileFeatureJoin(ReadFileFeature):
    @classmethod
    def index_columns(cls) -> Optional[List[Index]]:
        return [index]


# Define the link between the features
link = Link(jointype=JoinType.INNER, left=(ReadFileFeatureJoin, index), right=(ReadFileFeatureJoin, index))


# Create an example feature group to demonstrate joining
class ExampleMlLifeCycleJoin(AbstractFeatureGroup):
    # Define input features with different compute frameworks
    def input_features(self, options: Options, feature_name: FeatureName) -> Optional[Set[Feature]]:
        quantity = Feature(name="quantity", compute_framework="PandasDataframe")
        product_id = Feature(name="product_id", compute_framework="PyarrowTable")
        return {product_id, quantity}

    # Perform calculations on the joined data
    @classmethod
    def calculate_feature(cls, data: Any, features: FeatureSet) -> Any:
        print(
            "Data from two different sources is now combined into one feature within one data technology: \n",
            data,
            type(data),
            "\n",
        )
        return {"ExampleMlLifeCycleJoin": [1, 2, 3]}


# Run the pipeline
result = mlodaAPI.run_all(["ExampleMlLifeCycleJoin"], data_access_collection=data_access_collection, links={link})


# Display the final result
print(
    "Final result: ",
    result[0],
    "\nNote: As no specific compute framework was defined for the result, the output could be in either format.",
)

# Summary: mloda's abstraction layer enables complex process pipelines that handle different data technologies.
# This decouples processes from the underlying data structure, ensuring flexibility and scalability.
# Demonstrating mloda's Flexibility with Different Data Technologies

# Import required modules
from typing import Any, List, Optional, Set

from mloda_core.abstract_plugins.abstract_feature_group import AbstractFeatureGroup
from mloda_core.abstract_plugins.components.feature_name import FeatureName
from mloda_core.abstract_plugins.components.feature_set import FeatureSet
from mloda_core.abstract_plugins.components.index.index import Index
from mloda_core.abstract_plugins.components.link import JoinType, Link
from mloda_core.abstract_plugins.components.options import Options
from mloda_plugins.feature_group.input_data.read_file_feature import ReadFileFeature


# Define the index for the join
index = Index(("order_id",))


# Extend ReadFileFeature to provide index columns
class ReadFileFeatureJoin(ReadFileFeature):
    @classmethod
    def index_columns(cls) -> Optional[List[Index]]:
        return [index]


# Define the link between the features
link = Link(jointype=JoinType.INNER, left=(ReadFileFeatureJoin, index), right=(ReadFileFeatureJoin, index))


# Create an example feature group to demonstrate joining
class ExampleMlLifeCycleJoin(AbstractFeatureGroup):
    # Define input features with different compute frameworks
    def input_features(self, options: Options, feature_name: FeatureName) -> Optional[Set[Feature]]:
        quantity = Feature(name="quantity", compute_framework="PandasDataframe")
        product_id = Feature(name="product_id", compute_framework="PyarrowTable")
        return {product_id, quantity}

    # Perform calculations on the joined data
    @classmethod
    def calculate_feature(cls, data: Any, features: FeatureSet) -> Any:
        print(
            "Data from two different sources is now combined into one feature within one data technology: \n",
            data,
            type(data),
            "\n",
        )
        return {"ExampleMlLifeCycleJoin": [1, 2, 3]}


# Run the pipeline
result = mlodaAPI.run_all(["ExampleMlLifeCycleJoin"], data_access_collection=data_access_collection, links={link})


# Display the final result
print(
    "Final result: ",
    result[0],
    "\nNote: As no specific compute framework was defined for the result, the output could be in either format.",
)

# Summary: mloda's abstraction layer enables complex process pipelines that handle different data technologies.
# This decouples processes from the underlying data structure, ensuring flexibility and scalability.

Data from two different sources is now combined into one feature within one data technology: 
 pyarrow.Table
order_id: int64
product_id: int64
quantity: int64
----
order_id: [[1,2,3,4,5,...,96,97,98,99,100]]
product_id: [[282,355,395,319,275,...,170,328,361,192,271]]
quantity: [[6,2,4,9,5,...,4,3,5,5,6]] <class 'pyarrow.lib.Table'> 

Final result:  pyarrow.Table
ExampleMlLifeCycleJoin: int64
----
ExampleMlLifeCycleJoin: [[1,2,3]] 
Note: As no specific compute framework was defined for the result, the output could be in either format.

What Have We Observed So Far?¶

mloda unifies the interfaces for data for various sources, formats and technologies for the definition of the processes and applying the processes on the data. We used the FeatureGroup, the ComputeFramework and mlodaAPI as interfaces.
It integrates with any techologies, e.g. PyArrow and Pandas, enabling flexible tool choices for data processing.
mloda combines data access and computation, reducing complexity and providing a reusable approach to ML workflows. Data Access can be controlled centrally for different sources of data. Here, we showed folders and a database access.

We will further deepen the advantages of the used approach in the next notebook.