Understanding Narwhals Patterns for Backend-Agnostic Code¶
This tutorial demonstrates how to write robust, backend-agnostic DataFrame operations using Narwhals. We'll cover:
- Core Narwhals Concepts
- Expression-Based Operations
- Type Safety and Validation
- Lazy vs Eager Evaluation
Core Narwhals Concepts¶
Narwhals provides three key mechanisms for backend-agnostic operations:
@nw.narwhalify Decorator:
- Automatically handles backend conversion
- Combines from_native() and to_native()
- Supports eager_only=True for multiple inputs
# Example: Basic narwhalify usage @nw.narwhalify def process(df: FrameT) -> FrameT: return df.select([...]) # Example: Eager execution for multiple inputs @nw.narwhalify(eager_only=True) def process_multi(df: FrameT, values: List) -> int: return df.filter(...).count()
Manual Conversion (Testing Only):
- nw.from_native(): Convert to Narwhals format
- nw.to_native(): Convert back to original backend
- Only needed for testing or debugging
# Example: Manual conversion (testing only) df_nw = nw.from_native(df_pd) result = df_nw.select([...]) df_pd = result.to_native()
Execution Modes:
- Lazy: Operations are deferred (default)
- Eager: Immediate execution (eager_only=True)
- Mixed: Handle both with compute()/collect()
# Example: Lazy evaluation result = df.select([...]) # Operation is deferred # Example: Mixed mode handling if hasattr(result, "compute"): result = result.compute() elif hasattr(result, "collect"): result = result.collect()
Let's see these patterns in action with real examples.
import narwhals as nw
from narwhals.typing import FrameT, DataFrameT
from typing import Dict, List, Optional, Union, Any, Literal
# Pattern 1: Basic Column Operations
@nw.narwhalify
def calculate_stats(df: FrameT, value_col: str) -> FrameT:
"""Calculate statistics using Narwhals expressions.
Key Pattern: Expression Chaining
- Use nw.col() for column references
- Chain operations for readability
- Cast results to specific types
Common Pitfalls:
- Don't use df[col] - use nw.col(col)
- Don't forget to cast aggregation results
- Don't chain operations after select()
Example:
```python
# Good: Chain within select()
df.select([nw.col("value").mean().cast(nw.Float64())])
# Bad: Chain after select()
df.select([nw.col("value")]).mean() # May fail
```
"""
return df.select(
[
# Mean with proper type casting
nw.col(value_col)
.mean()
.cast(nw.Float64()) # Always cast aggregations
.alias("mean"), # Always alias results
# Count nulls with type casting
nw.col(value_col)
.is_null() # Check nulls first
.sum() # Then aggregate
.cast(nw.Int64()) # Cast to integer
.alias("null_count"), # Clear alias
]
)
# Pattern 2: Group-by Operations
@nw.narwhalify
def group_and_aggregate(df: FrameT, group_col: str, value_col: str) -> FrameT:
"""Group-by operations with proper aggregation.
Key Pattern: Group-by with Aggregation
- Use group_by() for grouping
- Chain agg() for aggregations
- Sort results if needed
Performance Notes:
- group_by() materializes results
- Sort after aggregation
- Multiple aggregations in one agg()
Example:
```python
# Good: Multiple aggregations in one call
df.group_by(group_col).agg([nw.col(value_col).mean(), nw.col(value_col).std()])
# Bad: Multiple group_by calls
means = df.group_by(group_col).agg([...])
stds = df.group_by(group_col).agg([...]) # Inefficient
```
"""
return (
df.group_by(group_col)
.agg(
[
# Multiple aggregations in one call
nw.col(value_col).mean().alias("mean"),
nw.col(value_col).std().alias("std"),
nw.col(value_col).count().alias("count"),
]
)
.sort(group_col)
) # Sort after aggregation
# Pattern 3: Horizontal Operations
@nw.narwhalify
def combine_columns(df: FrameT, col1: str, col2: str) -> FrameT:
"""Combine columns horizontally.
Key Pattern: Horizontal Operations
- Use with_columns for new columns
- Use sum_horizontal for row-wise ops
- Handle multiple columns together
Backend Notes:
- sum_horizontal works across backends
- Column arithmetic (+, -, etc.) may vary
- Check null handling per backend
Example:
```python
# Good: Use sum_horizontal
df.with_columns([nw.sum_horizontal("a", "b")])
# Also works: Column arithmetic
df.with_columns([nw.col("a") + nw.col("b")])
```
"""
return df.with_columns(
[
# Two ways to combine columns
nw.sum_horizontal(col1, col2).alias("sum"), # Preferred
(nw.col(col1) + nw.col(col2)).alias("sum_alt"), # Alternative
]
)
# Pattern 4: Multiple Inputs (Eager Only)
@nw.narwhalify(eager_only=True)
def filter_by_values(df: DataFrameT, values: List[Any], col_name: str) -> int:
"""Filter DataFrame using external values.
Key Pattern: Eager Execution
- Use eager_only=True for multiple inputs
- Return Python types (int, float, etc.)
- Handle external data structures
Why eager_only=True?
- Multiple inputs need immediate results
- Can't defer with external data
- Better for interactive analysis
Example:
```python
# Good: Eager execution with multiple inputs
@nw.narwhalify(eager_only=True)
def func(df: FrameT, values: List) -> int:
return df.filter(...).count()
# Bad: Lazy execution with multiple inputs
@nw.narwhalify # May fail or give unexpected results
def func(df: FrameT, values: List) -> FrameT:
return df.filter(...)
```
"""
return (
df.filter(
nw.col(col_name).is_in(values) # Filter using external values
)
.select(
[
nw.col(col_name).count().alias("count") # Count matches
]
)
.item()
) # Get scalar result
Testing the Patterns¶
Let's test these patterns with different backends to understand:
- How Narwhals handles conversion
- Backend-specific behavior
- Error handling differences
import pandas as pd
import polars as pl
import pyarrow as pa
# Create test data with edge cases
data = {
"group": ["A", "A", "B", "B", "C"], # Groups for aggregation
"value1": [10, None, 30, 40, 50], # Has null value
"value2": [1, 2, 3, None, 5], # Has null value
}
# Test with different backends
df_pd = pd.DataFrame(data) # Pandas backend
df_pl = pl.DataFrame(data) # Polars backend
# Test 1: Basic Stats - Shows null handling
print("Basic Stats (Pandas):")
print(calculate_stats(df_pd, "value1"))
print("\nBasic Stats (Polars):")
print(calculate_stats(df_pl, "value1"))
# Test 2: Group-by - Shows aggregation
print("\nGroup Aggregation (Pandas):")
print(group_and_aggregate(df_pd, "group", "value1"))
# Test 3: Horizontal Ops - Shows null propagation
print("\nHorizontal Operations (Pandas):")
print(combine_columns(df_pd, "value1", "value2"))
# Test 4: Eager Operation - Shows immediate execution
print("\nFiltered Count (Pandas):")
print(filter_by_values(df_pd, ["A", "B"], "group"))
Basic Stats (Pandas): mean null_count 0 32.5 1 Basic Stats (Polars): shape: (1, 2) ┌──────┬────────────┐ │ mean ┆ null_count │ │ --- ┆ --- │ │ f64 ┆ i64 │ ╞══════╪════════════╡ │ 32.5 ┆ 1 │ └──────┴────────────┘ Group Aggregation (Pandas): group mean std count 0 A 10.0 NaN 1 1 B 35.0 7.071068 2 2 C 50.0 NaN 1 Horizontal Operations (Pandas): group value1 value2 sum sum_alt 0 A 10.0 1.0 11.0 11.0 1 A NaN 2.0 2.0 NaN 2 B 30.0 3.0 33.0 33.0 3 B 40.0 NaN 40.0 NaN 4 C 50.0 5.0 55.0 55.0 Filtered Count (Pandas): 4
Lazy vs Eager Evaluation¶
Narwhals supports both lazy and eager evaluation modes, each with specific use cases:
Lazy Evaluation (Default)
- Operations are deferred until needed
- Supports optimization across operations
- Use compute() or collect() to materialize
# Example: Lazy chain of operations result = df.select([...]) # Deferred .filter([...]) # Still deferred .sort([...]) # Still deferred final = result.compute() # Now executes
Eager Evaluation
- Use eager_only=True for immediate execution
- Required for multiple inputs
- Better for interactive analysis
# Example: Eager execution @nw.narwhalify(eager_only=True) def get_count(df: FrameT) -> int: return df.select([...]).item()
Mixed Mode Handling
- Check hasattr(df, "compute") or hasattr(df, "collect")
- Handle both modes gracefully
- Materialize only when needed
# Example: Handle both modes if hasattr(df, "compute"): df = df.compute() # Dask style elif hasattr(df, "collect"): df = df.collect() # Polars style
Key Takeaways¶
Core Operations
- select() for transformations
- with_columns() for new columns
- group_by().agg() for aggregations
Type Safety
- Use class-based dtypes (nw.Int64, nw.Float64)
- Cast aggregation results explicitly
- Handle PyArrow scalars with .as_py()
Execution Modes
- Default to lazy evaluation
- Use eager_only when needed
- Handle both modes safely
Error Handling
- Validate inputs early
- Use type hints properly
- Handle backend differences
These patterns form the foundation for implementing TemporalScope's core utilities using Narwhals.
Info
This tutorial was auto-generated from the TemporalScope repository.
If you would like to suggest enhancements or report issues, please submit a Pull Request following the contribution guidelines.
Source notebook: 2_narwhals_general_patterns.ipynb
Disclaimer & Copyright
THIS SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES, OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT, OR OTHERWISE, ARISING FROM, OUT OF, OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
THIS SOFTWARE IS INTENDED FOR ACADEMIC AND INFORMATIONAL PURPOSES ONLY. IT SHOULD NOT BE USED IN PRODUCTION ENVIRONMENTS OR FOR CRITICAL DECISION-MAKING WITHOUT PROPER VALIDATION. ANY USE OF THIS SOFTWARE IS AT THE USER'S OWN RISK.
© 2024 Philip Ndikum