5. Narwhals Data Validation Patterns¶
This notebook demonstrates essential patterns for validating data in AI/ML pipelines using Narwhals. These patterns ensure data quality and consistency across both training and inference:
| Task | ✅ Good Pattern (Backend-Agnostic) | ❌ Bad Pattern (Backend-Specific) |
|---|---|---|
| DataFrame Creation | df = nw.from_native(df_pd) |
df_pd = pd.DataFrame(data) (without conversion) |
| Column Access | nw.col("feature") |
df["feature"] or df.feature |
| Type Casting | nw.col("x").cast(nw.Float64()) |
df["x"].astype("float64") |
| Null Checking | nw.col("x").is_null().sum() |
df["x"].isnull().sum() |
| Mean Imputation | nw.col("x").fill_null(mean_val) |
df["x"].fillna(df["x"].mean()) |
| String Operations | nw.col("x").str.to_uppercase() |
df["x"].str.upper() |
We'll explore two fundamental ML workflow patterns:
Feature Validation (Eager)
- Use
eager_only=Truefor immediate validation results - Return Python types for ML pipeline decisions
- Example: Checking numeric features before training
- Use
Feature Processing (Lazy)
- Use lazy evaluation for transformation chains
- Let Narwhals optimize the operations
- Example: Converting features to ML-ready format
The examples show how to handle common ML scenarios (missing values, mixed types, inconsistent categories) using proper Narwhals patterns that work across any DataFrame backend.
Backend-Agnostic DataFrame Creation¶
Narwhals provides a consistent interface across different DataFrame backends (Pandas, Polars, etc.). The key pattern for backend-agnostic code is:
- Create your DataFrame with any supported backend (e.g., Pandas)
- Convert to Narwhals format using
nw.from_native() - Use Narwhals operations that work across all backends
This pattern ensures your code works regardless of the underlying DataFrame implementation.
import narwhals as nw
from narwhals.typing import FrameT # Type hint for backend-agnostic DataFrames
import pandas as pd
import numpy as np
from typing import Dict, List, Optional, Union, Any
# Create sample data with common ML data issues
data = {
# Numeric features with different representations
"integer_feature": [1, 2, None, 4, 5], # Has null
"float_feature": [1.5, 2.5, 3.5, None, 5.5], # Has null
"string_number": ["1.0", "2.0", "bad", "4.0", "5.0"], # Has invalid value
# Categorical features with inconsistencies
"category_clean": ["A", "B", "A", "C", "B"],
"category_messy": ["a", "B", None, "c", "b"], # Mixed case + null
# Target variable
"target": [0, 1, 1, 0, 1], # Binary classification
}
# Pattern: Backend-Agnostic Conversion
df_pd = pd.DataFrame(data) # Create with any backend
df = nw.from_native(df_pd) # Convert to Narwhals format
df
┌───────────────────────────────────────┐ | Narwhals DataFrame | | Use `.to_native` to see native output | └───────────────────────────────────────┘
Pattern 1: Simple ML Type Validation Functions¶
In ML workflows, we commonly need to validate two types of features:
- Numeric features (can be converted to float, handle nulls)
- Categorical features (consistent categories, handle case sensitivity)
Let's create simple validation functions that work across any backend.
@nw.narwhalify(eager_only=True)
def validate_numeric_column(df: FrameT, column: str) -> Dict[str, Any]:
"""Validate if a column can be used as numeric feature.
Common ML checks:
- Can convert to float
- Count of nulls
- Basic statistics
"""
try:
# Try float conversion
stats = df.select(
[nw.col(column).cast(nw.Float64()).mean().alias("mean"), nw.col(column).is_null().sum().alias("nulls")]
)
return {"valid": True, "null_count": stats["nulls"].item(), "mean": stats["mean"].item()}
except Exception as e:
return {"valid": False, "error": str(e)}
@nw.narwhalify(eager_only=True)
def validate_categorical_column(df: FrameT, column: str) -> Dict[str, Any]:
"""Validate if a column can be used as categorical feature.
Common ML checks:
- Unique categories
- Null handling
- Case consistency
"""
try:
# Get stats using Narwhals operations only
stats = df.select(
[nw.col(column).cast(nw.String()).n_unique().alias("unique"), nw.col(column).is_null().sum().alias("nulls")]
)
# Get categories using Narwhals operations
categories = df.select([nw.col(column).cast(nw.String()).alias(column)]).unique()
return {
"valid": True,
"null_count": stats["nulls"].item(),
"unique_count": stats["unique"].item(),
"n_categories": categories.select([nw.col(column).count()]).item(),
}
except Exception as e:
return {"valid": False, "error": str(e)}
# Test with Pandas backend
print("Testing with Pandas backend:")
print("\nNumeric validation:")
print("integer_feature:", validate_numeric_column(df, "integer_feature"))
print("string_number:", validate_numeric_column(df, "string_number"))
print("\nCategorical validation:")
print("category_clean:", validate_categorical_column(df, "category_clean"))
print("category_messy:", validate_categorical_column(df, "category_messy"))
Testing with Pandas backend:
Numeric validation:
integer_feature: {'valid': True, 'null_count': np.int64(1), 'mean': np.float64(3.0)}
string_number: {'valid': False, 'error': "could not convert string to float: 'bad'"}
Categorical validation:
category_clean: {'valid': True, 'null_count': np.int64(0), 'unique_count': np.int64(3), 'n_categories': np.int64(3)}
category_messy: {'valid': True, 'null_count': np.int64(1), 'unique_count': np.int64(5), 'n_categories': np.int64(5)}
Next we'll show these same functions working with Polars:¶
import polars as pl
# Create Polars DataFrame
df_pl = pl.DataFrame(data)
df_pl_nw = nw.from_native(df_pl)
print("Testing with Polars backend:")
print("\nNumeric validation:")
print("integer_feature:", validate_numeric_column(df_pl_nw, "integer_feature"))
print("string_number:", validate_numeric_column(df_pl_nw, "string_number"))
print("\nCategorical validation:")
print("category_clean:", validate_categorical_column(df_pl_nw, "category_clean"))
print("category_messy:", validate_categorical_column(df_pl_nw, "category_messy"))
Testing with Polars backend:
Numeric validation:
integer_feature: {'valid': True, 'null_count': 1, 'mean': 3.0}
string_number: {'valid': False, 'error': 'conversion from `str` to `f64` failed in column \'string_number\' for 1 out of 5 values: ["bad"]'}
Categorical validation:
category_clean: {'valid': True, 'null_count': 0, 'unique_count': 3, 'n_categories': 3}
category_messy: {'valid': True, 'null_count': 1, 'unique_count': 5, 'n_categories': 4}
Results Analysis¶
The validation functions work consistently across backends with some notable differences:
Numeric Validation
- Both backends detect invalid numeric values ("bad" in string_number)
- Error messages differ but convey the same information
- Null counting and mean calculation work identically
Categorical Validation
- Both backends count nulls consistently
- Category counting differs slightly:
- Pandas counts None as a category (5 categories)
- Polars excludes None (4 categories)
- This backend difference is important to note for ML pipelines
Pattern 2: Feature Processing¶
Now that we can validate features, let's look at processing them for ML. This uses lazy evaluation since we're transforming data, not validating it.
@nw.narwhalify
def process_numeric_feature(df: FrameT, column: str) -> FrameT:
"""Process a numeric feature for ML.
Common ML transformations:
- Convert to float
- Fill nulls with mean
- Standardize format
"""
# Get mean first
mean_val = df.select([nw.col(column).cast(nw.Float64()).mean()]).item()
# Then use it for filling nulls
return df.select([nw.col(column).cast(nw.Float64()).fill_null(mean_val).alias(column)])
@nw.narwhalify
def process_categorical_feature(df: FrameT, column: str) -> FrameT:
"""Process a categorical feature for ML.
Common ML transformations:
- Standardize case
- Fill nulls with UNKNOWN
- Consistent string format
"""
return df.select([nw.col(column).cast(nw.String()).str.to_uppercase().fill_null("UNKNOWN").alias(column)])
# Test with both backends
print("Pandas backend:")
print("\nProcessing numeric feature:")
numeric_result = process_numeric_feature(df, "integer_feature")
print(numeric_result)
print("\nProcessing categorical feature:")
categorical_result = process_categorical_feature(df, "category_messy")
print(categorical_result)
print("\nPolars backend:")
print("\nProcessing numeric feature:")
numeric_result_pl = process_numeric_feature(df_pl_nw, "integer_feature")
print(numeric_result_pl)
print("\nProcessing categorical feature:")
categorical_result_pl = process_categorical_feature(df_pl_nw, "category_messy")
print(categorical_result_pl)
Pandas backend: Processing numeric feature: integer_feature 0 1.0 1 2.0 2 3.0 3 4.0 4 5.0 Processing categorical feature: category_messy 0 A 1 B 2 NONE 3 C 4 B Polars backend: Processing numeric feature: shape: (5, 1) ┌─────────────────┐ │ integer_feature │ │ --- │ │ f64 │ ╞═════════════════╡ │ 1.0 │ │ 2.0 │ │ 3.0 │ │ 4.0 │ │ 5.0 │ └─────────────────┘ Processing categorical feature: shape: (5, 1) ┌────────────────┐ │ category_messy │ │ --- │ │ str │ ╞════════════════╡ │ A │ │ B │ │ UNKNOWN │ │ C │ │ B │ └────────────────┘
Summary: Core Narwhals Patterns¶
| Pattern | When to Use | Why This Pattern | Example Use Cases |
|---|---|---|---|
| Eager Validation ( eager_only=True) |
• Need immediate results • Returning Python types • Checking data quality |
• Validation needs results now • Can't defer error checking • Must verify before processing |
• Type compatibility checks • Null value detection • Category validation |
| Lazy Transformation (default) |
• Chaining operations • Data transformations • Feature engineering |
• Let Narwhals optimize • Better memory usage • More efficient pipelines |
• Type conversions • Missing value imputation • Feature standardization |
Typical Validation Workflows¶
Data Quality Validation (Eager)
- Use when immediate validation results needed
- Return Python types for pipeline decisions
- Example: Checking numeric features before training
Data Transformation (Lazy)
- Use for transformation chains
- Let Narwhals optimize operations
- Example: Converting features to ML-ready format
These patterns ensure your data validation code:
- Works consistently across DataFrame backends
- Uses appropriate evaluation strategies
- Follows best practices for validation
- Maintains code clarity and purpose
Info
This tutorial was auto-generated from the TemporalScope repository.
If you would like to suggest enhancements or report issues, please submit a Pull Request following the contribution guidelines.
Source notebook: 5_narwhals_ai_data_validation.ipynb
Disclaimer & Copyright
THIS SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES, OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT, OR OTHERWISE, ARISING FROM, OUT OF, OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
THIS SOFTWARE IS INTENDED FOR ACADEMIC AND INFORMATIONAL PURPOSES ONLY. IT SHOULD NOT BE USED IN PRODUCTION ENVIRONMENTS OR FOR CRITICAL DECISION-MAKING WITHOUT PROPER VALIDATION. ANY USE OF THIS SOFTWARE IS AT THE USER'S OWN RISK.
© 2024 Philip Ndikum