Understanding Lazy vs Eager Evaluation in Narwhals¶
This notebook demonstrates when and why to use lazy vs eager evaluation in Narwhals, with practical examples from TemporalScope. We'll explore:
When to Use Lazy Evaluation (Default)¶
By default, Narwhals uses lazy evaluation, which means operations don't execute until you need the results. This is great for:
DataFrame Transformations
- Chain multiple operations together
- Let Narwhals optimize the execution
- Example: TemporalScope's target shifting operations
Memory Efficiency
- Avoid storing intermediate results
- Better for large datasets
- Example: Processing time series data
When to Use Eager Evaluation (eager_only=True)¶
Sometimes you need results immediately. Use eager evaluation when:
Computing Metrics
- Need scalar values (counts, means)
- Validation checks
- Example: Checking null values before processing
External Data
- Working with Python lists/dicts
- Need immediate Python types
- Example: Validation functions that return counts
Let's see these patterns in action with real examples from TemporalScope.
import narwhals as nw
from narwhals.typing import FrameT
import pandas as pd
import polars as pl
from typing import Tuple
# Sample data with missing values
data = {
"temporal_index": range(1, 6),
"observation": [10, None, 30, None, 50], # Has null values
"stratum": ["A", "A", "B", "B", "C"], # Groups for aggregation
}
# Create DataFrames with different backends
df_pd = pd.DataFrame(data)
df_pl = pl.DataFrame(data)
# Example 1: Lazy Evaluation in Transformation Chains
@nw.narwhalify
def analyze_temporal_sequence(df: FrameT) -> FrameT:
"""Implementation of a lazy evaluation chain for temporal data analysis.
This function demonstrates the advantages of operation fusion in temporal
sequence processing. The entire transformation chain is optimized as a
single execution plan.
Parameters
----------
df : FrameT
Input DataFrame containing temporal sequences
Returns
-------
FrameT
Transformed DataFrame with computed metrics
"""
print("Initializing transformation chain (deferred execution)")
# Stage 1: Feature Engineering (deferred)
result = df.select([nw.col("observation").alias("raw_value"), (nw.col("observation") * 2).alias("scaled_value")])
# Stage 2: Missing Value Handling (deferred)
result = result.filter(~nw.col("raw_value").is_null())
print("Transformation chain constructed but not yet executed")
return result
# Example 2: Eager Evaluation for Validation
@nw.narwhalify(eager_only=True)
def validate_temporal_sequence(df: FrameT, metric_col: str) -> Tuple[int, float]:
"""Implementation of eager evaluation for immediate metric computation.
This function demonstrates the necessity of eager evaluation when computing
validation metrics that require immediate materialization.
Parameters
----------
df : FrameT
Input DataFrame containing temporal sequences
metric_col : str
Column name for metric computation
Returns
-------
Tuple[int, float]
Count of missing values and mean of valid observations
"""
# Immediate computation of validation metrics
metrics = df.select(
[
nw.col(metric_col).is_null().sum().cast(nw.Int64).alias("null_count"),
nw.col(metric_col).mean().cast(nw.Float64).alias("mean_value"),
]
)
return (metrics.select([nw.col("null_count")]).item(), metrics.select([nw.col("mean_value")]).item())
# Example 3: Hybrid Approach (Mixed Evaluation)
@nw.narwhalify
def compute_temporal_aggregates(df: FrameT) -> FrameT:
"""Implementation of a hybrid evaluation strategy for temporal aggregation.
This function demonstrates a practical combination of lazy and eager
evaluation patterns in temporal data processing.
Parameters
----------
df : FrameT
Input DataFrame containing temporal sequences
Returns
-------
FrameT
Aggregated results by stratum
"""
# Stage 1: Lazy Transformation
result = df.select([nw.col("observation").alias("value"), nw.col("stratum").alias("group")])
# Stage 2: Lazy Aggregation
return result.group_by("group").agg(
[nw.col("value").mean().alias("stratum_mean"), nw.col("value").std().alias("stratum_std")]
)
# Let's see these patterns in action
print("Example 1: Lazy Evaluation - Watch the Execution Order")
print("----------------------------------------")
# Notice: Nothing happens until we print
lazy_result = analyze_temporal_sequence(df_pd)
print("\nNow we trigger computation (Pandas):")
print(lazy_result) # NOW it executes
print("\nSame behavior with Polars:")
print(analyze_temporal_sequence(df_pl)) # Works the same way
print("\nExample 2: Eager Evaluation - Immediate Results")
print("----------------------------------------")
# These execute right away because we need the values
pandas_metrics = validate_temporal_sequence(df_pd, "observation")
polars_metrics = validate_temporal_sequence(df_pl, "observation")
print(f"Pandas Metrics - Missing: {pandas_metrics[0]}, Mean: {pandas_metrics[1]:.2f}")
print(f"Polars Metrics - Missing: {polars_metrics[0]}, Mean: {polars_metrics[1]:.2f}")
print("\nExample 3: Hybrid Approach - Best of Both Worlds")
print("----------------------------------------")
# Lazy until we need results
temporal_results = compute_temporal_aggregates(df_pd)
print("\nFinal Results:")
print(temporal_results)
Example 1: Lazy Evaluation - Watch the Execution Order ---------------------------------------- Initializing transformation chain (deferred execution) Transformation chain constructed but not yet executed Now we trigger computation (Pandas): raw_value scaled_value 0 10.0 20.0 2 30.0 60.0 4 50.0 100.0 Same behavior with Polars: Initializing transformation chain (deferred execution) Transformation chain constructed but not yet executed shape: (3, 2) ┌───────────┬──────────────┐ │ raw_value ┆ scaled_value │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═══════════╪══════════════╡ │ 10 ┆ 20 │ │ 30 ┆ 60 │ │ 50 ┆ 100 │ └───────────┴──────────────┘ Example 2: Eager Evaluation - Immediate Results ---------------------------------------- Pandas Metrics - Missing: 2, Mean: 30.00 Polars Metrics - Missing: 2, Mean: 30.00 Example 3: Hybrid Approach - Best of Both Worlds ---------------------------------------- Final Results: group stratum_mean stratum_std 0 A 10.0 NaN 1 B 30.0 NaN 2 C 50.0 NaN
Key Takeaways¶
Use Lazy Evaluation When:
- Chaining DataFrame operations
- Need optimization across operations
- Working with large datasets
- Example: TemporalScope's target shifting uses lazy evaluation for efficient data transformations
Use Eager Evaluation When:
- Need immediate scalar values
- Doing validation checks
- Working with external data
- Example: TemporalScope's validation functions use eager evaluation to check data quality
Combine Both When:
- Complex pipelines need both patterns
- Some operations need immediate results
- Others can be optimized together
- Example: TemporalScope's data loading combines validation (eager) with transformations (lazy)
Remember: The choice between lazy and eager evaluation can significantly impact your code's performance. Choose based on whether you need immediate results (eager) or can benefit from optimization (lazy).
Info
This tutorial was auto-generated from the TemporalScope repository.
If you would like to suggest enhancements or report issues, please submit a Pull Request following the contribution guidelines.
Source notebook: 3_narwhals_lazy_vs_eager.ipynb
Disclaimer & Copyright
THIS SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES, OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT, OR OTHERWISE, ARISING FROM, OUT OF, OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
THIS SOFTWARE IS INTENDED FOR ACADEMIC AND INFORMATIONAL PURPOSES ONLY. IT SHOULD NOT BE USED IN PRODUCTION ENVIRONMENTS OR FOR CRITICAL DECISION-MAKING WITHOUT PROPER VALIDATION. ANY USE OF THIS SOFTWARE IS AT THE USER'S OWN RISK.
© 2024 Philip Ndikum