Data Quality
Ensuring data quality is crucial for the success of any data-driven project. This document outlines the various aspects of data quality and the measures taken to maintain it.
Feature validation
Feature validation ensures that the input and output features meet the expected standards and requirements.
Validate Input Features
Input features are validated to ensure they conform to the expected formats, ranges, and distributions. This includes:
- Checking for missing values and handling them appropriately.
- Validating data types and ranges.
- Ensuring data consistency and integrity.
Example: Input Feature Validation Using Custom Validators
Simple validator
from typing import Any, Optional, Set
from mloda_core.abstract_plugins.abstract_feature_group import AbstractFeatureGroup
from mloda_core.abstract_plugins.components.feature_set import FeatureSet
from mloda_core.abstract_plugins.components.options import Options
from mloda_core.abstract_plugins.components.feature_name import FeatureName
from mloda_core.abstract_plugins.components.feature import Feature
from mloda_core.api.request import mlodaAPI
from mloda_plugins.compute_framework.base_implementations.pyarrow.table import PyarrowTable
class DocSimpleValidateInputFeatures(AbstractFeatureGroup):
@classmethod
def calculate_feature(cls, data: Any, features: FeatureSet) -> Any:
return {cls.get_class_name(): [1, 2, 3]}
def input_features(self, options: Options, feature_name: FeatureName) -> Optional[Set[Feature]]:
return {Feature(name="BaseValidateInputFeaturesBase", options=options)}
@classmethod
def validate_input_features(cls, data: Any, features: FeatureSet) -> Optional[bool]:
"""This function is a naive implementation of a validator."""
if len(data["BaseValidateInputFeaturesBase"]) == 3:
raise ValueError("Data should have 3 elements")
return True
As we run it, it will return an error.
results = mlodaAPI.run_all(
["DocSimpleValidateInputFeatures"], {PyarrowTable}
)
ValueError: Data should have 3 elements
Loading a validator based on BaseValidator
In the following example, we replace the validate input features function. This function shows 2 examples:
- Loading a validator from the feature config.
- Instantiating a validator inplace.
class DocCustomValidateInputFeatures(DocSimpleValidateInputFeatures):
@classmethod
def validate_input_features(cls, data: Any, features: FeatureSet) -> Optional[bool]:
"""This function should be used to validate the input data."""
validation_rules = {
"BaseValidateInputFeaturesBase": Column(int, Check.in_range(1, 2)),
}
# Loading from feature config
if features.get_options_key("DocExamplePanderaValidator") is not None:
validator = features.get_options_key("DocExamplePanderaValidator")
if not isinstance(validator, DocExamplePanderaValidator):
raise ValueError("DocExamplePanderaValidator should be an instance of DocExamplePanderaValidator")
else:
validation_log_level = features.get_options_key("ValidationLevel")
# Instantiating a validator inplace
validator = DocExamplePanderaValidator(validation_rules, validation_log_level)
return validator.validate(data) # type: ignore
The DocExamplePanderaValidator is based on the BaseValidator, which provides basic functionalities around logging.
We show here an example by using the tool Pandera.
from mloda_core.abstract_plugins.components.base_validator import BaseValidator
import pyarrow as pa
from pandera import pandas
from pandera import Column, Check
from pandera.errors import SchemaError
class DocExamplePanderaValidator(BaseValidator):
"""Custom validator to validate input features based on a specific rule."""
def validate(self, data: pa.Table) -> Optional[bool]:
"""This function should be used to validate the input data."""
# Convert PyArrow Table to Pandas DataFrame if necessary
if isinstance(data, pa.Table): # If the data is a PyArrow Table
data = data.to_pandas()
schema = pandas.DataFrameSchema(self.validation_rules)
try:
schema.validate(data)
except SchemaError as e:
self.handle_log_level("SchemaError:", e)
return True
The validator should raise an error again.
results = mlodaAPI.run_all(
["DocCustomValidateInputFeatures"], {PyarrowTable}
)
Log only validator and Extender use
- We throw a warning instead of raising an error.
- We use the extender functionality to print out the runtime as an example.
from mloda_core.abstract_plugins.function_extender import WrapperFunctionEnum, WrapperFunctionExtender
from tests.test_documentation.test_documentation import DokuValidateInputFeatureExtender
example_feature = Feature("DocCustomValidateInputFeatures", {"ValidationLevel": "warning"})
results = mlodaAPI.run_all(
[example_feature], {PyarrowTable}, function_extender={DokuValidateInputFeatureExtender()}
)
This time it does not raise an error, we should see the following output:
"Time taken: 0.19909930229187012"
Validate Output Features
Output features are validated to ensure they meet the expected outcomes and performance metrics. This includes:
- Comparing output features against expected results.
- Validating the statistics of data.
- Ensuring the output data has the right types.
The implementation and use is very similar to validating input features.
Simple validator
from mloda_core.abstract_plugins.components.input_data.base_input_data import BaseInputData
from mloda_core.abstract_plugins.components.input_data.creator.data_creator import DataCreator
from tests.test_plugins.integration_plugins.test_validate_features.example_validator import BaseValidateOutputFeaturesBase
class DocBaseValidateOutputFeaturesBase(AbstractFeatureGroup):
@classmethod
def input_data(cls) -> Optional[BaseInputData]:
return DataCreator({cls.get_class_name()})
@classmethod
def calculate_feature(cls, data: Any, features: FeatureSet) -> Any:
return {cls.get_class_name(): [1, 2, 3]}
@classmethod
def validate_output_features(cls, data: Any, config: Options) -> Optional[bool]:
"""This function should be used to validate the output data."""
if len(data[cls.get_class_name()]) != 3:
raise ValueError("Data should have 3 elements")
return True
results = mlodaAPI.run_all(
["DocBaseValidateOutputFeaturesBase"], {PyarrowTable}
)
results
As this case works, we should not see an error. However, we see how similar the functionalities of input and output validations are.
Loading a validator based on BaseValidator
After this simple validation, let's reuse the pandera example from before.
class DocBaseValidateOutputFeaturesBaseNegativePandera(DocBaseValidateOutputFeaturesBase):
"""Pandera example test case. This one is related to the pandera testcase for validate_input_features."""
@classmethod
def validate_output_features(cls, data: Any, features: FeatureSet) -> Optional[bool]:
"""This function should be used to validate the output data."""
validation_rules = {
cls.get_class_name(): Column(int, Check.in_range(1, 2)),
}
validator = DocExamplePanderaValidator(validation_rules, features.get_options_key("ValidationLevel"))
return validator.validate(data)
This one should fail:
results = mlodaAPI.run_all(
["DocBaseValidateOutputFeaturesBaseNegativePandera"], {PyarrowTable}
)
Log only validator and Extender use
We can of course also use an extender, which was defined somewhere else.
from tests.test_plugins.integration_plugins.test_validate_features.test_validate_output_features import ValidateOutputFeatureExtender
results = mlodaAPI.run_all(
["DocBaseValidateOutputFeaturesBase"], {PyarrowTable},
function_extender={ValidateOutputFeatureExtender()}
)
Output similar to:
"Time taken: 3.409385681152344e-05"
Artifacts
Artifacts
can also be used for validation as the full API is available. A use case could be to store statistics of a feature and then validate them later on.
For more details on artifacts, refer to the artifact documentation.
Conclusion
In conclusion, feature validation is crucial for ensuring data quality in both input and output stages. By leveraging custom validators and extenders, validation can be tailored to specific needs while maintaining flexibility. This process helps detect inconsistencies early, improving the accuracy and robustness of data and feature pipelines.
Software Testing
Software testing is an integral part of maintaining data quality. It ensures that the software components used in data processing and analysis are functioning correctly.
Unit tests
Unit tests are written to test individual components of the software. These tests ensure that each function and method works as expected in isolation. Unit tests are typically run using a testing framework like pytest
.
Integration Tests
Integration tests are used to test the interaction between different components of the software. These tests ensure that the components work together as expected and that data flows correctly through the system.
Data Comparison
Data comparison involves comparing different datasets to ensure consistency and accuracy. This includes: - Comparing new datasets with historical data to identify any discrepancies. - Validating data transformations to ensure they produce the expected results. - Using statistical methods to compare distributions and identify anomalies.