mloda docs

Home

  • mloda

mloda Concepts

  • Table of Content
  • Intro to the core interfaces of mloda
  • What makes mloda unique?
  • Data, Feature, FeatureSets and FeatureGroups in mloda
    • Data
    • Feature
      • How do we relate Data and Feature?
    • Feature Group
      • Why do we relate a Feature to a FeatureGroup
      • How do we relate a Feature to a FeatureGroup
      • Let us look a bit more into: _filter_feature_group_by_criteria
    • What if we have multiple Features sharing the same FeatureGroup?
    • FeatureSet
    • Conclusion
  • Data Producer, User, Owner in mloda

Getting Started

  • Installation
  • mloda + scikit-learn Integration: Basic Example
  • API Request
  • Feature Groups
  • Compute Frameworks
  • Extender

In Depth - Basics

  • mloda API
  • (Feature) data
  • Join data
  • Filter data
  • Artifacts

In Depth - Advanced

  • Data quality
  • Domain concept
  • Data Access Patterns
  • Compute Frameworks
    • Framework Transformers
    • Compute Framework Integration
    • Framework Connection Object
  • Feature Groups
    • Feature Chain Parser
    • Feature Group Matching
    • Feature Group Testing
    • Feature Group Versioning
    • Multiple Result Columns

Development

  • Contributors
  • License - Apache 2.0
  • Missing

Need Help?

  • FAQ
  • Need Help
mloda docs
  • mloda Concepts
  • Data, Feature, FeatureSets and FeatureGroups in mloda

Data, Feature, FeatureSets and FeatureGroups in mloda¶

mloda focuses on the processes around data. This means we need to abstract different parts of what is usually summed up in the term "data" into distinct objects.

These key objects are:

  • Data
  • Feature
  • FeatureGroup
  • FeatureSet

This notebook will explain the relations shown in the graph below.

In [11]:
Copied!
%%mermaidjs
graph LR
    
    User[mloda User] --> | requests | Feature

    Feature --> | matches | FeatureGroup

    FeatureSet --> | uses | CalculateFunction

    subgraph FeatureGroup
        FeatureSet
        CalculateFunction
    end

    CalculateFunction --> | accesses | Data
%%mermaidjs graph LR User[mloda User] --> | requests | Feature Feature --> | matches | FeatureGroup FeatureSet --> | uses | CalculateFunction subgraph FeatureGroup FeatureSet CalculateFunction end CalculateFunction --> | accesses | Data

FeatureGroup

requests

uses

accesses

matches

mloda User

Feature

FeatureSet

CalculateFunction

Data

Data¶

In mloda, data is considered an object that describes how to access the data. It could be:

  • a dataframe (pandas or polars)
  • an unstructured object (json)
  • a URL
  • an object containing a lazy evaluated function

As a hard requirement, there must be a way to relate data to a feature. Often, this is done using a name-based approach, other methods could be used as well.

Feature¶

A feature is an object that configures the procedural representation of data, but not the process nor data itself. A feature typically includes configuration options:

  • Name
  • Options (Configurations)
  • Domains
  • Compute Framework
  • Data Type

How do we relate Data and Feature?¶

We cannot do this directly. We need to relate the Feature with a FeatureGroup first.

Feature Group¶

A FeatureGroup group describe Features, which share data processes and share how the configuration of Features are applied to the data. A FeatureGroup also contains configurations created be the Data Producer if needed e.g. if a FeatureGroup is only valid for a specific technology.

Why do we relate a Feature to a FeatureGroup¶

We need to match the Feature with a FeatureGroup, as the FeatureGroup "knows" how to use the access description of the data. Additionally, the FeatureGroup "knows" which other required input Features the Feature needs. mloda will add these required input Features to be resolved as well.

# This could be:
def input_features(self, options: Options, feature_name: FeatureName) -> Optional[Set[Feature]]:
    return {OrderAmount, Datetime, ID}

How do we relate a Feature to a FeatureGroup¶

To match a Feature with a FeatureGroup, we use the IdentifyFeatureGroupClass functionality of the Engine, which checks the following properties of the FeatureGroup Plugins:

  • Is _filter_feature_group_by_criteria matching?
  • Is domain matching?
  • Is compute framework matching?
  • Are links matching?
  • When multiple feature groups match: Are these just feature groups which have inheritance? If so, use the child.

As a result, there should be only one feature group that is possible to use.

Let us look a bit more into: _filter_feature_group_by_criteria¶

@classmethod
def match_feature_group_criteria(
    feature_name: Union[FeatureName, str],
    options: Options,
    data_access_collection: Optional[DataAccessCollection] = None
    )
    ...

Every feature group must implement this function. However, most will inherit the default behaviour from the AbstractFeatureGroup.

The default behaviour covers mostly name based approaches to identify a feature (equal or prefix of a feature name). But this can be also a call to a webservice, which knows which data supports or could be any other algorithmic solution.

As example could be this sqlite database, where we check the table for metainformation.

@classmethod
def check_feature_in_data_access(cls, feature_name: str, data_access: Any) -> bool:
    # get tables in the database
    result, _ = cls.read_db(data_access, query="SELECT name FROM sqlite_master WHERE type='table';")
    table_names = [table[0] for table in result]

    # check if the feature_name is in the tables
    for table in table_names:
        result, _ = cls.read_db(data_access, query=f"PRAGMA table_info({table});")
        column_names = [column[1] for column in result]
        if feature_name in column_names:
            cls.set_table_name(data_access, table)
            return True
    return False

This means, you are open to customize this logic to match a Feature to a FeatureGroup. But please, do not query the database for every single feature to feature group match lookup. :)

What if we have multiple Features sharing the same FeatureGroup?¶

To group features under the same FeatureGroup, we use the FeatureSet object.

For features to share a FeatureSet, they must also have the same configuration and compute_framework. If the configurations differ, mloda will automatically create separate FeatureSets.

def has_similarity_properties(self) -> int:
    compute_frameworks_hashable = (
        frozenset(self.compute_frameworks) if self.compute_frameworks is not None else None
    )
    return hash((self.options, compute_frameworks_hashable))

Examples:

  • Testing migrations: Feature(A, Polars), Feature(B, Pandas)
  • Sliding time windows: Feature(A, 10 days), Feature(B, 20 days)

This means inbetween Data to Feature, we have another abstraction, the FeatureSet.

FeatureSet¶

A FeatureSet is a collection of Features, which share the same configuration and the same FeatureGroup. This FeatureSet is used by the FeatureGroup to apply the operations on the data to receive the requested Feature.

They are created by mloda. One can access its properties during the calculate_feature function, but are not created by any user.

@classmethod
def calculate_feature(cls, data: Any, features: FeatureSet) -> Any:
    ...

In that sense, the FeatureSet has informations of

  • the Features itself
  • filters
  • names
  • artifacts
  • and has some convenience functionalities for easier access for the user.

Conclusion¶

In this notebook, we explored the key concepts and components in mloda related to data, features, feature sets, and feature groups. We discussed how data is accessed and represented, the definition and role of features, and how features are grouped and managed within feature sets and feature groups. Understanding these components is crucial for building robust and reusable machine learning pipelines. By abstracting and organizing data and features in this manner, mloda ensures consistency, flexibility, and scalability in machine learning workflows.

In [7]:
Copied!
%%mermaidjs
graph LR
    
    User[mloda User] --> | requests | Feature

    Feature --> | matches | FeatureGroup

    FeatureSet --> | uses | CalculateFunction

    subgraph FeatureGroup
        FeatureSet
        CalculateFunction
    end

    CalculateFunction --> | accesses | Data
%%mermaidjs graph LR User[mloda User] --> | requests | Feature Feature --> | matches | FeatureGroup FeatureSet --> | uses | CalculateFunction subgraph FeatureGroup FeatureSet CalculateFunction end CalculateFunction --> | accesses | Data

FeatureGroup

requests

uses

accesses

matches

mloda User

Feature

FeatureSet

CalculateFunction

Data

In short: we abstracted away processes from data.

In [8]:
Copied!
%%mermaidjs
graph LR

    All[mloda] --> CalculateFunction
    subgraph Process
        All
        CalculateFunction
    end
    CalculateFunction --> | accesses | Data
%%mermaidjs graph LR All[mloda] --> CalculateFunction subgraph Process All CalculateFunction end CalculateFunction --> | accesses | Data

Process

accesses

mloda

CalculateFunction

Data

Previous Next

Built with MkDocs using a theme provided by Read the Docs.
« Previous Next »