TemporalScope Tutorial: XAI Data Quality Validation¶
Purpose¶
This tutorial demonstrates how to validate time series data quality using research-backed thresholds. Data quality is critical for XAI (eXplainable AI) because poor quality data can lead to misleading explanations and unreliable models.
What You'll Learn¶
- How to validate time series data using research-backed thresholds
- Why each validation check matters for XAI
- How to integrate validation into production pipelines
Research-Backed Validation¶
Our validation thresholds come from key research papers:
Grinsztajn et al. (2022):
- Minimum 3,000 samples needed for reliable model training
- Feature-to-sample ratio should be < 1/10 to prevent overfitting
- Categorical features should have ≤ 20 unique values
Shwartz-Ziv et al. (2021):
- Maximum 50,000 samples for medium-sized datasets
- At least 4 features needed for meaningful complexity
Gorishniy et al. (2021):
- Keep features under 500 to avoid dimensionality issues
- Numerical features should have ≥ 10 unique values
These thresholds help ensure your data is suitable for XAI analysis.
import pandas as pd
import numpy as np
import narwhals as nw
from temporalscope.datasets.dataset_validator import DatasetValidator, ValidationResult
# Create sample time series data
data = pd.DataFrame(
{
"time": pd.date_range("2023-01-01", periods=5000),
"price": np.random.normal(100, 10, 5000),
"target": np.random.choice([0, 1], 5000),
}
)
# Initialize validator with research-backed thresholds
validator = DatasetValidator(
time_col="time", target_col="target", min_samples=3000, max_feature_ratio=0.1, enable_warnings=True
)
# Run validation
results = validator.fit_transform(data)
print("Data Quality Validation Report:")
validator.print_report(results)
Data Quality Validation Report: Dataset Validation Report +---------------------+----------+---------------------------------------------------------------------------------------------------------+------------------------------------+ | Check | Status | Message | Details | +=====================+==========+=========================================================================================================+====================================+ | sample_size | ✓ | Check passed | num_samples: 5000 | +---------------------+----------+---------------------------------------------------------------------------------------------------------+------------------------------------+ | feature_count | ✗ | Dataset has 1 features, fewer than recommended minimum (4). This may result in an oversimplified model. | num_features: 1 | +---------------------+----------+---------------------------------------------------------------------------------------------------------+------------------------------------+ | feature_ratio | ✓ | Check passed | ratio: 0.0 | +---------------------+----------+---------------------------------------------------------------------------------------------------------+------------------------------------+ | feature_variability | ✓ | Check passed | numeric_feature: True, price: 5000 | +---------------------+----------+---------------------------------------------------------------------------------------------------------+------------------------------------+ Note: These are research-backed recommendations and may not apply to all use cases.
/home/docs/checkouts/readthedocs.org/user_builds/temporalscope/envs/latest/lib/python3.10/site-packages/temporalscope/datasets/dataset_validator.py:284: UserWarning: Dataset has 1 features, fewer than recommended minimum (4). This may result in an oversimplified model. warnings.warn(msg) /home/docs/checkouts/readthedocs.org/user_builds/temporalscope/envs/latest/lib/python3.10/site-packages/temporalscope/datasets/dataset_validator.py:425: UserWarning: Some validation checks failed. These are research-backed recommendations and may not apply to all use cases. Adjust thresholds as needed. warnings.warn(
Understanding Validation Checks¶
Each validation check has a specific purpose:
Sample Size (min_samples=3000):
- WHY: Too few samples → unstable models
- WHY: Too many samples → computational issues
Feature Count:
- WHY: Too few features → oversimplified model
- WHY: Too many features → curse of dimensionality
Feature Ratio (max_feature_ratio=0.1):
- WHY: High ratio → risk of overfitting
- WHY: Based on statistical learning theory
Feature Variability:
- WHY: Low variability → uninformative features
- WHY: Affects model's learning capacity
# Create data with quality issues
problematic_data = pd.DataFrame(
{
"time": pd.date_range("2023-01-01", periods=1000),
"feature1": np.random.choice([1, 2], 1000),
"feature2": np.random.normal(0, 1, 1000),
"feature3": [None] * 100 + list(range(900)),
"target": np.random.choice([0, 1], 1000),
}
)
try:
# Initialize validator
validator = DatasetValidator(
time_col="time", target_col="target", min_samples=3000, min_unique_values=10, enable_warnings=True
)
# Attempt validation
results = validator.fit_transform(problematic_data)
except ValueError as e:
print("Validation Failed:")
print(f"Error: {str(e)}")
print("\nWhy This Matters:")
print("1. Too few samples (1000 < 3000) → unstable models")
print("2. Missing values → unreliable predictions")
print("3. Low feature variability → poor model learning")
Validation Failed: Error: Missing values detected in columns: feature3 Why This Matters: 1. Too few samples (1000 < 3000) → unstable models 2. Missing values → unreliable predictions 3. Low feature variability → poor model learning
Production Pipeline Integration¶
The DatasetValidator uses Narwhals for backend-agnostic operations, making it suitable for production environments:
Production Environment:
- Uses pandas + narwhals for efficient operations
- Lightweight deployment with minimal dependencies
- Consistent behavior in production
Test Environment:
- Supports multiple backends via hatch
- Validates across different DataFrame implementations
- Ensures reliability across environments
Core Operations:
- Uses @nw.narwhalify for backend conversions
- Pure Narwhals operations throughout
- Consistent behavior across supported types
@nw.narwhalify
def validate_production_data(df, time_col, target_col):
"""Production-ready validation function using Narwhals.
Key Features:
1. Backend-agnostic operations
2. Proper error handling
3. Detailed logging
Args:
df: Input DataFrame to validate
time_col: Name of time column
target_col: Name of target column
Returns:
Tuple of (results, summary)
"""
validator = DatasetValidator(time_col=time_col, target_col=target_col, min_samples=3000, max_feature_ratio=0.1)
try:
results = validator.fit_transform(df)
summary = ValidationResult.get_validation_summary(results)
failed = ValidationResult.get_failed_checks(results)
if failed:
for check_name, result in failed.items():
log_entry = result.to_log_entry()
print(f"Failed Check: {check_name}")
print(f"Details: {log_entry}")
if result.severity == "ERROR":
raise ValueError(f"Critical validation failure: {check_name}")
return results, summary
except Exception as e:
print(f"Validation failed: {str(e)}")
raise
# Test the production validation
sample_data = pd.DataFrame(
{
"time": pd.date_range("2023-01-01", periods=5000),
"value": np.random.normal(0, 1, 5000),
"target": np.random.choice([0, 1], 5000),
}
)
results, summary = validate_production_data(sample_data, "time", "target")
print("\nValidation Summary:")
print(f"Total Checks: {summary['total_checks']}")
print(f"Failed Checks: {summary['failed_checks']}")
Failed Check: feature_count
Details: {'validation_passed': False, 'validation_message': 'Dataset has 1 features, fewer than recommended minimum (4). This may result in an oversimplified model.', 'validation_details': {'num_features': 1}, 'log_level': 'WARNING'}
Validation Summary:
Total Checks: 4
Failed Checks: 1
/home/docs/checkouts/readthedocs.org/user_builds/temporalscope/envs/latest/lib/python3.10/site-packages/temporalscope/datasets/dataset_validator.py:284: UserWarning: Dataset has 1 features, fewer than recommended minimum (4). This may result in an oversimplified model. warnings.warn(msg) /home/docs/checkouts/readthedocs.org/user_builds/temporalscope/envs/latest/lib/python3.10/site-packages/temporalscope/datasets/dataset_validator.py:425: UserWarning: Some validation checks failed. These are research-backed recommendations and may not apply to all use cases. Adjust thresholds as needed. warnings.warn(
Best Practices¶
Pipeline Integration:
# Airflow DAG Example with DAG('data_validation_pipeline') as dag: validate_task = PythonOperator( task_id='validate_dataframe', python_callable=validate_production_data, op_kwargs={ 'df': '{{ task_instance.xcom_pull(task_ids="load_data") }}', 'time_col': 'timestamp', 'target_col': 'target' } )
Monitoring Setup:
- Track validation metrics over time
- Set up alerts for critical failures
- Monitor feature drift
Threshold Configuration:
- Start with research-backed defaults
- Adjust based on domain requirements
- Document threshold decisions
Error Handling:
- Define clear failure policies
- Set up fallback procedures
- Maintain audit trails
Further Reading¶
- Grinsztajn et al. (2022) - Data quality thresholds
- Shwartz-Ziv et al. (2021) - Dataset size guidelines
- Gorishniy et al. (2021) - Feature complexity analysis
Info
This tutorial was auto-generated from the TemporalScope repository.
If you would like to suggest enhancements or report issues, please submit a Pull Request following the contribution guidelines.
Source notebook: 3_xai_data_quality_checks.ipynb
Disclaimer & Copyright
THIS SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES, OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT, OR OTHERWISE, ARISING FROM, OUT OF, OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
THIS SOFTWARE IS INTENDED FOR ACADEMIC AND INFORMATIONAL PURPOSES ONLY. IT SHOULD NOT BE USED IN PRODUCTION ENVIRONMENTS OR FOR CRITICAL DECISION-MAKING WITHOUT PROPER VALIDATION. ANY USE OF THIS SOFTWARE IS AT THE USER'S OWN RISK.
© 2024 Philip Ndikum