Data Access Patterns: BaseInputData vs MatchData

Overview

mloda provides two distinct but complementary patterns for data access: BaseInputData and MatchData. While they may appear similar at first glance, they serve different purposes and are used in different contexts within the framework.

This document clarifies the differences between these concepts and their respective use cases. For practical examples of data access methods, see Data Access Overview.

BaseInputData Pattern

Purpose

BaseInputData is an abstract base class that defines how feature groups load and access data. It's the foundation for data loading mechanisms in mloda.

Key Characteristics

Data Loading Focus: Primarily concerned with how to load data from various sources
Feature Group Integration: Used by feature groups through the input_data() method
Inheritance-Based: Concrete implementations inherit from BaseInputData
Scope Management: Supports both global and feature-specific data access scopes
Universal Usage: Used by all feature groups that need to load data

Use Cases

Reading files (CSV, JSON, Parquet, etc.)
Connecting to databases
Creating synthetic/test data
Loading data from APIs
Managing data dependencies between features

For detailed examples of these use cases, see the data access documentation.

Example Implementation

from mloda_core.abstract_plugins.components.input_data.base_input_data import BaseInputData

class ReadFileFeature(AbstractFeatureGroup):
    @classmethod
    def input_data(cls) -> Optional[BaseInputData]:
        return ReadFile()  # BaseInputData implementation

    @classmethod
    def calculate_feature(cls, data: Any, features: FeatureSet) -> Any:
        reader = cls.input_data()
        if reader is not None:
            data = reader.load(features)
            return data
        raise ValueError("Reading file failed.")

Common BaseInputData Implementations

ReadFile: For file-based data loading (see access-feature-data)
DataCreator: For generating synthetic data (see access-feature-data)
ApiInputData: For runtime data injection (see access-feature-data)
DatabaseConnector: For database connections

MatchData Pattern

Purpose

MatchData is a specialized matching mechanism specifically designed for feature groups that require framework connection objects. It determines which data access method should be used when stateful connections are needed.

Key Characteristics

Connection-Specific: Only used for feature groups that need framework connection objects
Matching Logic: Determines which data source matches when connections are involved
Scope Resolution: Resolves conflicts between feature-scope and global-scope data access for stateful frameworks
Limited Usage: Only applies to specific compute frameworks (like DuckDB) that require persistent connections

When MatchData is Used

MatchData is only used in these specific scenarios: - Feature groups that work with stateful compute frameworks (e.g., DuckDB) - When framework connection objects are required - For data sources that need persistent connections (databases, connection pools)

Use Cases

Matching DuckDB features to appropriate DuckDB connections
Routing features to specific database connections based on credentials
Resolving data access when multiple connection objects are available
Enabling flexible connection configuration for stateful frameworks

Example Implementation

from mloda_core.abstract_plugins.components.match_data.match_data import MatchData

class DuckDBFeatureGroup(AbstractFeatureGroup, MatchData):
    @classmethod
    def match_data_access(
        cls,
        feature_name: str,
        options: Options,
        data_access_collection: Optional[DataAccessCollection] = None,
        framework_connection_object: Optional[Any] = None,
    ) -> Any:
        # Logic to determine if this matcher handles DuckDB connections
        if framework_connection_object and isinstance(framework_connection_object, duckdb.DuckDBPyConnection):
            return framework_connection_object
        if data_access_collection and data_access_collection.has_duckdb_connections():
            return data_access_collection.get_duckdb_connection()
        return None

Key Differences

Aspect	BaseInputData	MatchData
Primary Purpose	Data loading and access	Connection object matching for stateful frameworks
When Used	All feature groups that load data	Only feature groups requiring framework connection objects
Scope	Universal data access pattern	Specialized for stateful compute frameworks
Usage Pattern	`input_data()` method in feature groups	Multiple inheritance: `AbstractFeatureGroup, MatchData`
Connection Dependency	Works with or without connections	Specifically designed for connection objects
Framework Support	All compute frameworks	Only stateful frameworks (DuckDB, database connections)

How They Work Together

BaseInputData and MatchData serve different purposes and are used in different scenarios:

BaseInputData Workflow

Feature groups define their data loading strategy via input_data() method
BaseInputData implementations handle the actual data loading
Works with all compute frameworks (stateful and stateless)

MatchData Workflow (Connection-Specific)

Only used when feature groups need framework connection objects
MatchData determines which connection object to use for stateful frameworks
Only applies to specific compute frameworks like DuckDB that require persistent connections

Combined Usage Example

class DuckDBAnalyticsFeature(AbstractFeatureGroup, MatchData):
    @classmethod
    def input_data(cls) -> Optional[BaseInputData]:
        # BaseInputData for general data loading
        return ReadFile()

    @classmethod
    def match_data_access(cls, feature_name: str, options: Options, 
                         data_access_collection: Optional[DataAccessCollection] = None,
                         framework_connection_object: Optional[Any] = None) -> Any:
        # MatchData for connection object matching
        if framework_connection_object and isinstance(framework_connection_object, duckdb.DuckDBPyConnection):
            return framework_connection_object
        return None

Practical Examples

Scenario 1: Standard File Processing (BaseInputData Only)

Use Case: Reading CSV files with Pandas Pattern: Only BaseInputData is needed

class CsvProcessingFeature(AbstractFeatureGroup):
    @classmethod
    def input_data(cls) -> Optional[BaseInputData]:
        return ReadFile()  # BaseInputData handles file reading

Scenario 2: DuckDB Analytics (BaseInputData + MatchData)

Use Case: Analytics with DuckDB requiring connection objects Pattern: Both BaseInputData and MatchData are needed

class DuckDBAnalyticsFeature(AbstractFeatureGroup, MatchData):
    @classmethod
    def input_data(cls) -> Optional[BaseInputData]:
        return ReadFile()  # BaseInputData for data loading

    @classmethod
    def match_data_access(cls, ...):
        # MatchData for connection matching
        return appropriate_duckdb_connection

For database connection patterns, see Framework Connection Object.

Scenario 3: In-Memory Processing (BaseInputData Only)

Use Case: Creating synthetic data with Pandas Pattern: Only BaseInputData is needed

class SyntheticDataFeature(AbstractFeatureGroup):
    @classmethod
    def input_data(cls) -> Optional[BaseInputData]:
        return DataCreator({"synthetic_data"})  # BaseInputData for data creation

Integration with Other Concepts

Compute Frameworks

BaseInputData: Works with all compute frameworks
MatchData: Only works with stateful frameworks requiring connection objects

For more information, see: - Compute Frameworks - Framework Connection Object - Compute Framework Integration

Feature Groups

These patterns are fundamental to how feature groups access data. For more details, see: - Feature Groups - Feature Group Matching

Best Practices

When to Use BaseInputData Only

Working with stateless compute frameworks (Pandas, PyArrow, Polars)
File-based data loading
API data injection
Synthetic data generation
Most standard data processing scenarios

When to Use BaseInputData + MatchData

Working with stateful compute frameworks (DuckDB)
Database connections requiring persistent state
Connection pooling scenarios
When framework connection objects are required

Design Considerations

BaseInputData: Focus on robust data loading, error handling, and performance
MatchData: Focus on accurate connection matching and state management
Integration: Use MatchData only when framework connection objects are actually needed

(Feature) data - Comprehensive guide to data access in mloda
Framework Connection Object - Managing stateful connections (essential for understanding MatchData)
Feature Groups - Introduction to feature groups
Compute Frameworks - Overview of compute framework system
Feature Group Matching - How features are matched to implementations

Summary

BaseInputData and MatchData serve different and specialized roles in mloda's data access architecture:

BaseInputData is the universal pattern for data loading - used by all feature groups that need to load data
MatchData is a specialized pattern for connection object matching - only used by feature groups that require framework connection objects

Key Understanding: - Most feature groups only use BaseInputData - MatchData is only needed when working with stateful compute frameworks like DuckDB - They are not alternatives - they solve different problems in different contexts

Understanding this distinction is crucial for: - Choosing the right pattern for your use case - Implementing feature groups correctly - Working with stateful vs stateless compute frameworks - Leveraging mloda's connection management capabilities

This separation allows mloda to provide both universal data access (BaseInputData) and specialized connection management (MatchData) while keeping the complexity contained to only those scenarios that actually need it.

Data Access Patterns: BaseInputData vs MatchData

Overview

BaseInputData Pattern

Purpose

Key Characteristics

Use Cases

Example Implementation

Common BaseInputData Implementations

MatchData Pattern

Purpose

Key Characteristics

When MatchData is Used

Use Cases

Example Implementation

Key Differences

How They Work Together

BaseInputData Workflow

MatchData Workflow (Connection-Specific)

Combined Usage Example

Practical Examples

Scenario 1: Standard File Processing (BaseInputData Only)

Scenario 2: DuckDB Analytics (BaseInputData + MatchData)

Scenario 3: In-Memory Processing (BaseInputData Only)

Integration with Other Concepts

Compute Frameworks

Feature Groups

Best Practices

When to Use BaseInputData Only

When to Use BaseInputData + MatchData

Design Considerations

Related Documentation

Summary