Complex Narwhals Functions: Time Operations and Aggregations¶
This notebook demonstrates advanced Narwhals patterns used in TemporalScope, focusing on:
Time-Based Operations
- Converting between datetime and numeric representations
- Validating temporal uniqueness
- Handling mixed-frequency time series data
Complex Aggregations
- Efficient null/NaN checking with lazy evaluation
- Multi-column type validation
- Time-based sorting with different backends
We'll show both lazy and eager evaluation patterns for each operation type.
import narwhals as nw
from narwhals.typing import FrameT
import pandas as pd
import polars as pl
from typing import Dict, List, Optional, Union, Any, Literal
# Create sample time series data with mixed frequencies and null values
data = {
"patient_id": [1, 1, 1, 2, 2],
"timestamp": pd.date_range("2023-01-01", periods=5, freq="D"),
"value": [10.0, None, 30.0, None, 50.0],
"category": ["A", "A", "B", "B", "C"],
}
# Create DataFrames with different backends
df_pd = pd.DataFrame(data)
df_pl = pl.DataFrame(data)
# Example 1: Time Column Conversion (Lazy)
@nw.narwhalify
def convert_time_column(df: FrameT, time_col: str, to_type: Literal["numeric", "datetime"]) -> FrameT:
"""Convert time column between datetime and numeric formats.
This demonstrates lazy evaluation for time column conversion.
Operations are deferred until the result is needed.
Why Lazy?
- Time conversions are often part of a larger transformation chain
- Allows Narwhals to optimize the entire operation sequence
- No need for immediate results, can be computed when needed
"""
print(f"Initializing {to_type} conversion (deferred execution)")
if to_type == "numeric":
# Convert datetime to Unix timestamp (microseconds)
result = df.with_columns([nw.col(time_col).dt.timestamp(time_unit="us").cast(nw.Float64()).alias(time_col)])
else:
# Convert numeric to datetime
result = df.with_columns([nw.col(time_col).cast(nw.Datetime()).alias(time_col)])
print("Conversion chain constructed but not yet executed")
return result
# Example 2: Null Value Checking (Eager)
@nw.narwhalify(eager_only=True)
def check_column_nulls(df: FrameT, columns: List[str]) -> Dict[str, int]:
"""Check for null values in specified columns.
This demonstrates eager evaluation for immediate validation results.
Results are computed immediately due to eager_only=True.
Why Eager?
- Validation needs immediate results to proceed
- Returns Python types (Dict[str, int]) that can't be lazy
- Used in TemporalScope's validation checks before processing
"""
result = {}
for col in columns:
# Immediate computation of null counts
null_check = df.select([nw.col(col).is_null().sum().cast(nw.Int64).alias("null_count")])
result[col] = null_check.select([nw.col("null_count")]).item()
return result
# Example 3: Temporal Uniqueness Validation (Mixed)
@nw.narwhalify
def validate_unique_timestamps(df: FrameT, time_col: str, group_col: Optional[str] = None) -> FrameT:
"""Validate and report temporal uniqueness.
This demonstrates a hybrid approach:
- Lazy: Group and count operations (can be optimized)
- Eager: Final validation check (needed immediately)
Why Mixed?
- Heavy computations (grouping) benefit from lazy optimization
- Final validation needs immediate results
- Common pattern in TemporalScope's data loading
"""
print("Starting temporal uniqueness check (mixed evaluation)")
# Stage 1: Group by time (and optionally group_col) - Lazy
group_cols = [time_col] if group_col is None else [group_col, time_col]
counts = df.group_by(group_cols).agg([nw.col(time_col).count().alias("count")])
# Stage 2: Filter for duplicates - Lazy
duplicates = counts.filter(nw.col("count") > 1)
print("Uniqueness check chain constructed")
return duplicates # Only computed when needed
# Let's demonstrate these patterns with real examples
print("Example 1: Time Column Conversion (Lazy)")
print("-" * 50)
# Convert to numeric (operations deferred)
numeric_time = convert_time_column(df_pd, "timestamp", "numeric")
print("\nNumeric Timestamp Result (Pandas):")
print(numeric_time) # NOW it executes
# Convert back to datetime (operations deferred)
datetime_time = convert_time_column(numeric_time, "timestamp", "datetime")
print("\nDatetime Result (Pandas):")
print(datetime_time) # NOW it executes
print("\nExample 2: Null Value Checking (Eager)")
print("-" * 50)
# These execute immediately because we need the results
pandas_nulls = check_column_nulls(df_pd, ["value", "category"])
polars_nulls = check_column_nulls(df_pl, ["value", "category"])
print(f"Pandas Nulls: {pandas_nulls}")
print(f"Polars Nulls: {polars_nulls}")
print("\nExample 3: Temporal Uniqueness (Mixed)")
print("-" * 50)
# Check uniqueness by patient (lazy until print)
duplicates = validate_unique_timestamps(df_pd, "timestamp", "patient_id")
print("\nDuplicate Timestamps by Patient:")
print(duplicates) # NOW it executes
Example 1: Time Column Conversion (Lazy)
--------------------------------------------------
Initializing numeric conversion (deferred execution)
Conversion chain constructed but not yet executed
Numeric Timestamp Result (Pandas):
patient_id timestamp value category
0 1 1.672531e+15 10.0 A
1 1 1.672618e+15 NaN A
2 1 1.672704e+15 30.0 B
3 2 1.672790e+15 NaN B
4 2 1.672877e+15 50.0 C
Initializing datetime conversion (deferred execution)
Conversion chain constructed but not yet executed
Datetime Result (Pandas):
patient_id timestamp value category
0 1 2023-01-01 10.0 A
1 1 2023-01-02 NaN A
2 1 2023-01-03 30.0 B
3 2 2023-01-04 NaN B
4 2 2023-01-05 50.0 C
Example 2: Null Value Checking (Eager)
--------------------------------------------------
Pandas Nulls: {'value': np.int64(2), 'category': np.int64(0)}
Polars Nulls: {'value': 2, 'category': 0}
Example 3: Temporal Uniqueness (Mixed)
--------------------------------------------------
Starting temporal uniqueness check (mixed evaluation)
Uniqueness check chain constructed
Duplicate Timestamps by Patient:
Empty DataFrame
Columns: [patient_id, timestamp, count]
Index: []
Key Takeaways¶
Time Operations (Lazy Evaluation)
- Use lazy evaluation for time conversions to optimize chains of operations
- Let Narwhals handle backend-specific datetime implementations
- Operations only execute when results are needed
- Example: TemporalScope's time column conversions are lazy to allow optimization
Validation Checks (Eager Evaluation)
- Use eager evaluation when immediate results are needed (like null checks)
- Perfect for validation that must happen before proceeding
- Results are computed right away
- Example: TemporalScope's data validation needs immediate results
Complex Operations (Mixed Evaluation)
- Combine lazy and eager patterns for optimal performance
- Use lazy for heavy computations that can be optimized
- Use eager for final validation or results
- Example: TemporalScope's data loading combines validation (eager) with transformations (lazy)
These patterns are used throughout TemporalScope to ensure efficient and correct handling of temporal data across different backends. The choice between lazy and eager evaluation significantly impacts performance:
- Lazy Evaluation: Better for large datasets and complex transformations
- Eager Evaluation: Better for validation and immediate results
- Mixed Approach: Best for real-world scenarios with both needs
Info
This tutorial was auto-generated from the TemporalScope repository.
If you would like to suggest enhancements or report issues, please submit a Pull Request following the contribution guidelines.
Source notebook: 4_narwhals_complex_functions.ipynb
Disclaimer & Copyright
THIS SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES, OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT, OR OTHERWISE, ARISING FROM, OUT OF, OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
THIS SOFTWARE IS INTENDED FOR ACADEMIC AND INFORMATIONAL PURPOSES ONLY. IT SHOULD NOT BE USED IN PRODUCTION ENVIRONMENTS OR FOR CRITICAL DECISION-MAKING WITHOUT PROPER VALIDATION. ANY USE OF THIS SOFTWARE IS AT THE USER'S OWN RISK.
© 2024 Philip Ndikum