Narwhals Machine Learning Patterns¶
Narwhals provides a unified interface for DataFrame operations across Pandas, Polars, and Dask. This tutorial outlines patterns for building backend-agnostic functions tailored to typical machine learning workflows, including data validation, feature engineering, and time series processing. These patterns help create scalable and maintainable pipelines suitable for both prototyping and production, benefiting both OSS and enterprise-grade developers.
Overview of Patterns¶
Backward Compatibility Policy:
Narwhals ensures stability for library maintainers with strict backward compatibility. Code written withimport narwhals.stable.v1 as nwwill remain functional indefinitely, even if breaking changes occur. Breaking changes are isolated innarwhals.stable.v2or higher, whilenarwhals.stable.v1remains unaffected, enabling safe coexistence of multiple versions.Data Validation:
Use@nw.narwhalifyto decorate functions for consistent backend-agnostic validation. Leveragenw.col(...)for type casting, null handling, and basic statistics across backends.Feature Engineering:
Chain transformations like.cast(nw.Float64())or.fill_null(...)to preprocess numeric and categorical data. Defer materialization until necessary to optimize memory usage.Time Series Handling:
Validate temporal data by grouping with.group_by([id_col, time_col]), checking for uniqueness or applying rolling windows. Maintain backend independence without additional logic.Workflow Optimization:
Begin with lazy mode (e.g.,nw.from_native(dask_df)) for scalability and switch to eager mode using.collect()or.to_native()for tasks needing immediate results.OSS and Enterprise Notes:
Use tools like Hatch to manage lean environments for production and comprehensive setups for testing.- Define lean setups in
[tool.hatch.envs.default]for minimal dependencies (e.g., Narwhals and Pandas). - Use
[tool.hatch.envs.test]for broader testing across multiple backends (e.g., Polars and Dask).
Handle unsupported objects gracefully by settingpass_through=Trueorstrict=False.
- Define lean setups in
import narwhals as nw
from narwhals.typing import FrameT
import pandas as pd
import polars as pl
import dask.dataframe as dd
import numpy as np
from typing import Dict, List, Optional, Union, Any
Pattern 1: Backward Compatibility Policy¶
This pattern demonstrates how Narwhals guarantees backward compatibility, ensuring stability for production-grade workflows and eliminating breaking changes across updates.
- Stable Namespace: Code written with
narwhals.stable.v1remains functional indefinitely, even if breaking changes occur in Narwhals or its backends. - Version Migration: Developers can adopt new features by explicitly switching to updated namespaces, such as
narwhals.stable.v2. - Integration: Multiple Narwhals versions can coexist in a single project, ensuring smooth collaboration without dependency conflicts.
# Import the stable API for guaranteed compatibility
import narwhals.stable.v1 as nw
from narwhals.typing import IntoFrameT
# Example dataset
data = {"feature1": [1, 2, 3], "feature2": [4, 5, 6]}
# Workflow using Narwhals stable.v1
def backward_compatible_workflow(df: IntoFrameT) -> IntoFrameT:
"""Use Narwhals stable.v1 API to process data."""
# Convert to Narwhals lazy frame
df_nw = nw.from_native(df)
# Perform transformations
df_transformed = df_nw.select(
[
nw.col("feature1").mean().alias("mean_feature1"),
nw.col("feature2").sum().alias("sum_feature2"),
]
)
# Convert back to native format (e.g., Pandas)
return df_transformed.to_native()
# Testing the backward-compatible workflow
import pandas as pd
df_pd = pd.DataFrame(data)
result = backward_compatible_workflow(df_pd)
print("Result using Narwhals stable API v1:")
print(result)
Result using Narwhals stable API v1: mean_feature1 sum_feature2 0 2.0 15
Pattern 1: Lazy-to-Eager Frame Transitions¶
This pattern demonstrates how Narwhals provides a unified interface for transitioning between lazy and eager evaluation, regardless of the backend. There's no need to distinguish between different materialization methods - Narwhals handles this automatically. The example shows:
- Unified Collection: Using
collect()provides a consistent way to materialize results, automatically handling the transition from any lazy backend (like Dask) to an eager Pandas DataFrame - Lazy Operations: Start with any lazy backend for memory-efficient processing of large datasets, with operations optimized and deferred until needed
- Backend Transitions: Narwhals automatically manages the transition from lazy (e.g., Dask) to eager (Pandas) evaluation, simplifying ML workflows
- ML Integration: Final
to_native()call provides the DataFrame in the format needed by ML libraries
# Create sample ML dataset
data = {
"numeric_feature": [1.5, 2.0, None, 4.0, 5.5], # Has missing value
"categorical_feature": ["A", "B", "A", "C", "B"], # Needs encoding
}
df_pd = pd.DataFrame(data)
# Start with any lazy backend (e.g., Dask)
df_dask = dd.from_pandas(df_pd, npartitions=2)
# Unified lazy-to-eager pattern
df_nw = nw.from_native(df_dask) # Works with any lazy frame
df_processed = df_nw.select(
[ # Lazy operations
nw.col("numeric_feature").fill_null(0).cast(nw.Float64()).alias("numeric_feature"),
nw.col("categorical_feature").cast(nw.String()).alias("categorical_feature"),
]
)
df_collected = df_processed.collect() # Unified collection - no compute() needed
df_pandas = df_collected.to_native() # Ready for ML libraries
print("Original Dask DataFrame (lazy):")
print(df_dask)
print("\nProcessed Pandas DataFrame (eager):")
print(df_pandas)
Original Dask DataFrame (lazy):
Dask DataFrame Structure:
numeric_feature categorical_feature
npartitions=2
0 float64 string
3 ... ...
4 ... ...
Dask Name: frompandas, 1 expression
Expr=df
Processed Pandas DataFrame (eager):
numeric_feature categorical_feature
0 1.5 A
1 2.0 B
2 0.0 A
3 4.0 C
4 5.5 B
Pattern 2: Data Validation - Handling Unsupported Types¶
This pattern demonstrates how to build robust data validation pipelines that can handle unexpected or unsupported data types. In data engineering workflows, you often need to validate and process data from various sources that may contain custom objects, mixed types, or other non-standard formats. The pass_through parameter provides explicit control over validation behavior:
- Development Mode: Using
pass_through=Trueenables initial data exploration and debugging by allowing inspection of problematic data types. This is crucial when investigating data quality issues or understanding new data sources. - Production Mode: Using
pass_through=False(default) enforces strict validation rules in production pipelines, preventing unexpected data types from silently propagating through the system and causing downstream issues. - Error Recovery: Both modes provide clear error messages that help identify and handle data quality issues at the appropriate stage of your pipeline.
- Pipeline Integration: Choose development mode during data exploration and pipeline development, then switch to production mode for robust, production-grade data validation.
class CustomObject:
def __init__(self, value):
self.value = value
def __str__(self):
return f"Custom({self.value})"
# Create DataFrame with unsupported objects
data = {"feature1": [1, 2, 3], "unsupported": [CustomObject(1), CustomObject(2), CustomObject(3)]}
df_pd = pd.DataFrame(data)
# Development mode - allows inspection
print("Development Mode (pass_through=True):")
print("Expected: Can load and view all data")
df_nw1 = nw.from_native(df_pd, pass_through=True)
print(df_nw1.to_native())
print("\nTrying operations on normal column:")
print("Expected: Should work normally")
result = df_nw1.select([nw.col("feature1").cast(nw.Float64())])
print(result.to_native())
print("\nTrying operations on unsupported column:")
print("Expected: Should fail gracefully")
try:
result = df_nw1.select([nw.col("unsupported").cast(nw.Float64())])
except Exception as e:
print(f"Error: Cannot cast custom objects to Float64")
# Production mode - strict type checking
print("\nProduction Mode (pass_through=False):")
print("Expected: Should fail on unsupported types")
try:
df_nw2 = nw.from_native(df_pd, pass_through=False)
result = df_nw2.select([nw.col("unsupported").cast(nw.Float64())])
except Exception as e:
print("Error: Unsupported types not allowed in strict mode")
Development Mode (pass_through=True): Expected: Can load and view all data feature1 unsupported 0 1 Custom(1) 1 2 Custom(2) 2 3 Custom(3) Trying operations on normal column: Expected: Should work normally feature1 0 1.0 1 2.0 2 3.0 Trying operations on unsupported column: Expected: Should fail gracefully Error: Cannot cast custom objects to Float64 Production Mode (pass_through=False): Expected: Should fail on unsupported types Error: Unsupported types not allowed in strict mode
Pattern 3: Data Validation¶
This pattern shows how to validate data types and quality across DataFrame operations in a backend-agnostic way. By using Narwhals native types and operations, you can ensure consistent validation behavior regardless of the underlying DataFrame implementation. The example shows:
- Type Safety: Using Narwhals native types (Float64, String) ensures consistent type handling across backends, preventing type-related errors in ML pipelines
- Validation Workflow: Backend-agnostic operations for checking nulls, type compatibility, and data quality enable robust validation pipelines
- Error Handling: Graceful error recovery and clear error messages help identify data quality issues early in the pipeline
- ML Integration: Consistent validation behavior across training and inference ensures reliable model deployment
Note: Dask backend requires different handling for lazy evaluation. See Pattern 1 for lazy-to-eager transition patterns.
# Create sample data with validation issues
data = {
"numeric": [1, 2, None, 4], # Has null
"mixed": ["1", "2", "bad", "4"], # Has invalid value
}
# Test across backends
backends = {"Pandas": pd.DataFrame(data), "Polars": pl.DataFrame(data)}
for name, df in backends.items():
print(f"\nTesting {name} backend:")
print("=" * 50)
df_nw = nw.from_native(df)
# Validate numeric column (should succeed with nulls)
print("\nValidating numeric column with nulls:")
print("Expected: Should compute mean and null count")
try:
result = df_nw.select(
[
nw.col("numeric").cast(nw.Float64()).mean().alias("mean"),
nw.col("numeric").is_null().sum().alias("nulls"),
]
)
print(f"Success - Mean: {result['mean'].item()}")
print(f"Success - Null count: {result['nulls'].item()}")
except Exception as e:
print(f"Failed as expected: {str(e)}")
# Try invalid column (should fail with type error)
print("\nTrying to convert invalid strings to float:")
print("Expected: Should fail with type conversion error")
try:
result = df_nw.select([nw.col("mixed").cast(nw.Float64()).mean()])
print("Unexpected success!")
except Exception as e:
print(f"Failed as expected: {str(e)}")
Testing Pandas backend: ================================================== Validating numeric column with nulls: Expected: Should compute mean and null count Success - Mean: 2.3333333333333335 Success - Null count: 1 Trying to convert invalid strings to float: Expected: Should fail with type conversion error Failed as expected: could not convert string to float: 'bad' Testing Polars backend: ================================================== Validating numeric column with nulls: Expected: Should compute mean and null count Success - Mean: 2.3333333333333335 Success - Null count: 1 Trying to convert invalid strings to float: Expected: Should fail with type conversion error Failed as expected: conversion from `str` to `f64` failed in column 'mixed' for 1 out of 4 values: ["bad"]
Pattern 4: Feature Engineering & Collect-Then-Item Pattern¶
This pattern demonstrates how to build efficient feature engineering pipelines for ML workflows. A critical challenge in ML pipelines is handling both lazy backends (like Dask for large datasets) and eager backends (like Pandas for interactive development). The key is following a consistent materialization pattern when you need concrete values (like computing means for imputation).
Key principles for consistent feature engineering:
- Consistent Materialization: Always check if collect() is needed before item() - use
result.item()for eager frames andresult.collect().item()for lazy frames. This pattern ensures your code works correctly whether using Dask for large-scale processing or Pandas for development. - Evaluation Strategy: Use
eager_only=Truefor functions that compute statistics (means, medians, etc.) since these require materialized values. Keep other transformations (type casting, null filling) lazy to let Narwhals optimize them. - Backend Independence: By following the collect-then-item pattern, your functions work automatically with any backend - they'll use collect() when needed (Dask) and skip it when unnecessary (Pandas).
- Memory Efficiency: Only use the materialization pattern when you absolutely need concrete values (like computing statistics). Let other operations stay lazy so Narwhals can optimize them together.
# Create sample data with preprocessing needs
data = {
"integer_feature": [1, 2, None, 4, 5], # Needs mean imputation
"category_messy": ["a", "B", None, "c", "b"], # Needs standardization
}
# Test across backends
backends = {
"Pandas": pd.DataFrame(data),
"Polars": pl.DataFrame(data),
"Dask": dd.from_pandas(pd.DataFrame(data), npartitions=2), # Lazy Dask frame
}
@nw.narwhalify(eager_only=True) # Decorator handles materialization consistently
def process_numeric_feature(df: FrameT, column: str) -> FrameT:
"""Process a numeric feature for ML.
Common ML transformations:
- Convert to float
- Fill nulls with mean
- Standardize format
"""
# Get mean first - handle both lazy and eager cases
result = df.select([nw.col(column).cast(nw.Float64()).mean()])
mean_val = result.item() if not hasattr(result, "collect") else result.collect().item()
# Then use it for filling nulls
return df.select([nw.col(column).cast(nw.Float64()).fill_null(mean_val).alias(column)])
@nw.narwhalify # Default lazy evaluation for chainable operations
def process_categorical_feature(df: FrameT, column: str) -> FrameT:
"""Process a categorical feature for ML.
Common ML transformations:
- Handle nulls first
- Standardize case
- Ensure string format
"""
return df.select(
[
nw.col(column)
.fill_null("UNKNOWN") # Handle nulls before casting
.cast(nw.String())
.str.to_uppercase()
.alias(column)
]
)
# Test features across backends
for name, df in backends.items():
print(f"\n{name} backend:")
print("=" * 50)
df_nw = nw.from_native(df) # Works with any backend (lazy or eager)
print("\nProcessing numeric feature:")
print("Expected: Nulls filled with mean value, cast to Float64")
numeric_result = process_numeric_feature(df_nw, "integer_feature")
print(numeric_result)
print("\nProcessing categorical feature:")
print("Expected: Uppercase strings, nulls filled with UNKNOWN")
categorical_result = process_categorical_feature(df_nw, "category_messy")
print(categorical_result)
Pandas backend:
==================================================
Processing numeric feature:
Expected: Nulls filled with mean value, cast to Float64
integer_feature
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
Processing categorical feature:
Expected: Uppercase strings, nulls filled with UNKNOWN
category_messy
0 A
1 B
2 UNKNOWN
3 C
4 B
Polars backend:
==================================================
Processing numeric feature:
Expected: Nulls filled with mean value, cast to Float64
shape: (5, 1)
┌─────────────────┐
│ integer_feature │
│ --- │
│ f64 │
╞═════════════════╡
│ 1.0 │
│ 2.0 │
│ 3.0 │
│ 4.0 │
│ 5.0 │
└─────────────────┘
Processing categorical feature:
Expected: Uppercase strings, nulls filled with UNKNOWN
shape: (5, 1)
┌────────────────┐
│ category_messy │
│ --- │
│ str │
╞════════════════╡
│ A │
│ B │
│ UNKNOWN │
│ C │
│ B │
└────────────────┘
Dask backend:
==================================================
Processing numeric feature:
Expected: Nulls filled with mean value, cast to Float64
Dask DataFrame Structure:
integer_feature
npartitions=2
0 float64
3 ...
4 ...
Dask Name: getitem, 9 expressions
Expr=Assign(frame=df)[['integer_feature']]
Processing categorical feature:
Expected: Uppercase strings, nulls filled with UNKNOWN
Dask DataFrame Structure:
category_messy
npartitions=2
0 string
3 ...
4 ...
Dask Name: getitem, 11 expressions
Expr=Assign(frame=df)[['category_messy']]
Pattern 5: Time Series Validation & Materialization¶
This pattern demonstrates how to validate temporal data quality in ML/DS workflows. Time series data processing often requires complex transformations (interpolation, resampling, windowing) that need to work efficiently across different scales - from interactive analysis in Pandas to large-scale processing in Dask. By handling lazy and eager evaluation consistently, you can write temporal validation code once and use it across your entire ML pipeline.
Key principles for time series validation:
- Consistent Materialization: Follow the collect-then-item pattern when materializing results, ensuring consistent behavior across Pandas (eager) and Dask (lazy) backends
- Entity Grouping: Handle multiple time series efficiently by grouping operations, letting Narwhals optimize the execution plan
- Data Quality: Detect and report duplicate timestamps that could skew temporal analysis, using lazy evaluation for memory efficiency
- Backend Independence: Write validation functions that work identically whether processing historical data with Dask or real-time data with Pandas
The output shows the same validation working seamlessly across backends (Pandas, Polars, Dask), with each backend displaying results in its native format while maintaining consistent behavior. This enables developers to focus on temporal logic rather than backend-specific implementations.
# Generate hourly timestamps with duplicates
base_timestamps = pd.date_range(start="2023-01-01", periods=3, freq="h")
timestamps = [
base_timestamps[0], # First timestamp
base_timestamps[0], # Duplicate for id=1
base_timestamps[1],
base_timestamps[0], # Duplicate for id=2
base_timestamps[2],
]
# Create synthetic dataset with known properties
data = {
# Entity identifier and temporal index
"id": [1, 1, 1, 2, 2],
"timestamp": timestamps,
# Numeric features for validation
"feature1": [1.0, 2.0, None, 4.0, 5.0], # Float with missing values
"feature2": [1.5, 2.5, 3.5, None, 5.5], # Float with missing values
"feature3": [10, 20, 30, 40, 50], # Integer without missing values
}
@nw.narwhalify
def validate_temporal_uniqueness(df: FrameT, id_col: str, time_col: str) -> FrameT:
"""Validate temporal uniqueness within entity groups.
Parameters
----------
df : FrameT
Input DataFrame with time series data
id_col : str
Column name for entity identifier
time_col : str
Column name for temporal index
Returns
-------
FrameT
DataFrame containing any duplicate timestamps found
"""
# Group by entity and timestamp - stays lazy for efficiency
counts = df.group_by([id_col, time_col]).agg([nw.col(time_col).count().alias("count")])
# Filter for duplicates - still lazy until results needed
return counts.filter(nw.col("count") > 1)
# Initialize DataFrames for each backend
df_pd = pd.DataFrame(data) # Pandas backend
df_pl = pl.DataFrame(data) # Polars backend
df_dask = dd.from_pandas(df_pd, npartitions=2) # Dask backend
# Test across backends
print("Temporal Uniqueness Validation Results")
print("=" * 50)
for name, df in [("Pandas", df_pd), ("Polars", df_pl), ("Dask", df_dask)]:
print(f"\n{name} Backend Results:")
print("-" * 30)
df_nw = nw.from_native(df)
result = validate_temporal_uniqueness(df_nw, "id", "timestamp")
print(result)
Temporal Uniqueness Validation Results
==================================================
Pandas Backend Results:
------------------------------
id timestamp count
0 1 2023-01-01 2
Polars Backend Results:
------------------------------
shape: (1, 3)
┌─────┬─────────────────────┬───────┐
│ id ┆ timestamp ┆ count │
│ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[μs] ┆ u32 │
╞═════╪═════════════════════╪═══════╡
│ 1 ┆ 2023-01-01 00:00:00 ┆ 2 │
└─────┴─────────────────────┴───────┘
Dask Backend Results:
------------------------------
Dask DataFrame Structure:
id timestamp count
npartitions=1
int64 datetime64[ns] int64
... ... ...
Dask Name: loc, 10 expressions
Expr=Loc(frame=ResetIndex(frame=ColumnsSetter(frame=(GroupbyAggregation(frame=df, arg=defaultdict(<class 'list'>, {'timestamp': ['count']}), observed=True, dropna=False))[[('timestamp', 'count')]], columns=('count',))), iindexer=RenameSeries(frame=RenameSeries(frame=(ResetIndex(frame=ColumnsSetter(frame=(GroupbyAggregation(frame=df, arg=defaultdict(<class 'list'>, {'timestamp': ['count']}), observed=True, dropna=False))[[('timestamp', 'count')]], columns=('count',))))['count'] > 1, index='count'), index='count'))
Info
This tutorial was auto-generated from the TemporalScope repository.
If you would like to suggest enhancements or report issues, please submit a Pull Request following the contribution guidelines.
Source notebook: 1_narwhals_ml_patterns.ipynb
Disclaimer & Copyright
THIS SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES, OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT, OR OTHERWISE, ARISING FROM, OUT OF, OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
THIS SOFTWARE IS INTENDED FOR ACADEMIC AND INFORMATIONAL PURPOSES ONLY. IT SHOULD NOT BE USED IN PRODUCTION ENVIRONMENTS OR FOR CRITICAL DECISION-MAKING WITHOUT PROPER VALIDATION. ANY USE OF THIS SOFTWARE IS AT THE USER'S OWN RISK.
© 2024 Philip Ndikum