Tutorial 6: Time Series Operations with Narwhals¶
This notebook demonstrates essential patterns for working with time series data using Narwhals, focusing on:
Group-by Time Series Operations
- Efficient grouping by ID columns
- Temporal aggregations within groups
- Mixed frequency handling
Time Series Validation
- Temporal uniqueness checks
- Group-level validation
- Data quality assurance
Pattern 1: Time Series Group-by Operations¶
NOTICE: The following examples are provided AS-IS for academic purposes and are subject to the user's specific requirements and use cases. Always refer to the latest API documentation for production use.
Common time series tasks require grouping by an ID column and performing temporal operations within each group. These patterns are essential for ML workflows in various domains.
The following table shows key use cases where these patterns are critical:
| Use Case | Description |
|---|---|
| Healthcare Analytics | Patient vitals monitoring over time (patient_id), treatment response analysis, longitudinal health studies |
| Quantitative Finance | Multi-asset portfolio analysis (ticker_id), market regime detection, cross-sectional momentum studies |
| Signal Processing | Multi-sensor data fusion (sensor_id), anomaly detection across sensor networks, distributed system monitoring |
Key Narwhals patterns for time series operations:
| Operation | ✅ Good Pattern | ❌ Bad Pattern |
|---|---|---|
| Group By | df.group_by(id_col).agg([nw.col("value").mean()]) |
df.groupby(id_col).agg({"value": "mean"}) |
| Rolling | df.with_columns([nw.col("value").rolling_mean(2)]) |
df["value"].rolling(2).mean() |
| Time Sort | df.sort([time_col]) |
df.sort_values(time_col) |
| Null Check | nw.col("value").is_null().sum() |
df["value"].isnull().sum() |
Implementation considerations for robust time series processing:
| Consideration | Description |
|---|---|
| Lazy Evaluation | • Chain operations for optimization • Let Narwhals handle backend-specific optimizations • Avoid unnecessary materializations |
| Temporal Ordering | • Ensure proper time-based sorting • Handle mixed frequencies • Maintain group boundaries |
| Backend Compatibility | • Use elementary aggregations • Follow Narwhals patterns for operations • Let Narwhals handle backend-specific details |
| Data Quality | • Validate temporal uniqueness • Handle missing values • Check frequency consistency |
These patterns ensure robust time series processing across different ML scenarios and DataFrame backends. Note that some DataFrame implementations like Dask have specific requirements (e.g., known divisions for rolling operations) due to their distributed nature. While Narwhals handles these backend-specific details, always consult the latest API documentation if you need to customize error handling or implement special cases.
Best Practices Summary:
- ✅ Use elementary operations that work across all backends
- ✅ Let Narwhals handle backend-specific optimizations
- ✅ Pre-compute complex operations before grouping
- ✅ Document backend-specific requirements in docstrings
- ❌ Don't try to handle backend-specific details in your code
- ❌ Don't chain operations after group-by
- ❌ Don't use backend-specific operations
The recommended approach is to use Narwhals patterns and let it manage backend-specific optimizations and limitations. This ensures your code remains maintainable and works consistently across different DataFrame implementations.
# Import required libraries
import narwhals as nw
from narwhals.typing import FrameT
import pandas as pd
import polars as pl
import dask.dataframe as dd
from typing import Dict, List, Optional, Union, Any, Literal
@nw.narwhalify
def compute_group_metrics(df: FrameT, id_col: str, time_col: str, value_col: str) -> FrameT:
"""Compute time series metrics by group.
This function demonstrates proper Narwhals patterns for universal backend support:
1. Uses only elementary aggregations (min, max, mean, count, sum)
2. Works consistently across Pandas, Polars, and Dask
3. Properly handles missing values and time ranges
Returns a DataFrame with metrics per group:
- start_time: First timestamp in group
- end_time: Last timestamp in group
- avg_value: Mean value in group
- total_records: Count of records
- missing_records: Count of nulls
"""
# Pre-compute null indicators
df_prep = df.with_columns([nw.col(value_col).is_null().alias("is_null")])
# Use elementary aggregations
return df_prep.group_by(id_col).agg(
[
# Time range metrics (elementary)
nw.col(time_col).min().alias("start_time"),
nw.col(time_col).max().alias("end_time"),
# Value statistics (elementary)
nw.col(value_col).mean().alias("avg_value"),
# Count metrics (elementary)
nw.col(value_col).count().alias("total_records"),
nw.col("is_null").sum().alias("missing_records"),
]
)
@nw.narwhalify
def compute_rolling_stats(df: FrameT, id_col: str, time_col: str, value_col: str, window: int) -> FrameT:
"""Compute rolling statistics within groups.
This function demonstrates proper handling of complex operations:
1. Pre-computes rolling means before grouping
2. Uses elementary aggregations for group operations
3. Lets Narwhals handle backend-specific details
Note: Rolling operations have backend-specific requirements:
- Works with Pandas and Polars out of the box
- Dask requires known divisions (see Dask documentation)
- Let Narwhals handle these requirements through its error handling
Returns a DataFrame with rolling metrics per group:
- rolling_mean_{window}: Mean of rolling window values
"""
# First sort and compute rolling means
df_prep = df.sort([id_col, time_col]).with_columns([nw.col(value_col).rolling_mean(window).alias("rolling_value")])
# Then do elementary group-by aggregation
return df_prep.group_by(id_col).agg([nw.col("rolling_value").mean().alias(f"rolling_mean_{window}")])
# Test data
test_data = {
"id": [1, 1, 2, 2, 3],
"timestamp": pd.date_range("2023-01-01", periods=5, freq="D"),
"value": [10.0, 20.0, 30.0, None, 50.0],
}
# Create DataFrames
df_pd = pd.DataFrame(test_data)
df_pl = pl.DataFrame(test_data)
df_dd = dd.from_pandas(df_pd, npartitions=2)
# Test functions
print("Testing compute_group_metrics:")
print(compute_group_metrics(nw.from_native(df_pd), "id", "timestamp", "value"))
print(compute_group_metrics(nw.from_native(df_pl), "id", "timestamp", "value"))
print(compute_group_metrics(nw.from_native(df_dd), "id", "timestamp", "value"))
print("\nTesting compute_rolling_stats:")
window = 2
print(compute_rolling_stats(nw.from_native(df_pd), "id", "timestamp", "value", window))
print(compute_rolling_stats(nw.from_native(df_pl), "id", "timestamp", "value", window))
try:
print(compute_rolling_stats(nw.from_native(df_dd), "id", "timestamp", "value", window))
except ValueError as e:
print(f"\nDask rolling operations require known divisions:\n{str(e)}")
Testing compute_group_metrics:
id start_time end_time avg_value total_records missing_records
0 1 2023-01-01 2023-01-02 15.0 2 0
1 2 2023-01-03 2023-01-04 30.0 1 1
2 3 2023-01-05 2023-01-05 50.0 1 0
shape: (3, 6)
┌─────┬─────────────────────┬─────────────────────┬───────────┬───────────────┬─────────────────┐
│ id ┆ start_time ┆ end_time ┆ avg_value ┆ total_records ┆ missing_records │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[ns] ┆ datetime[ns] ┆ f64 ┆ u32 ┆ u32 │
╞═════╪═════════════════════╪═════════════════════╪═══════════╪═══════════════╪═════════════════╡
│ 2 ┆ 2023-01-03 00:00:00 ┆ 2023-01-04 00:00:00 ┆ 30.0 ┆ 1 ┆ 1 │
│ 3 ┆ 2023-01-05 00:00:00 ┆ 2023-01-05 00:00:00 ┆ 50.0 ┆ 1 ┆ 0 │
│ 1 ┆ 2023-01-01 00:00:00 ┆ 2023-01-02 00:00:00 ┆ 15.0 ┆ 2 ┆ 0 │
└─────┴─────────────────────┴─────────────────────┴───────────┴───────────────┴─────────────────┘
Dask DataFrame Structure:
id start_time end_time avg_value total_records missing_records
npartitions=1
int64 datetime64[ns] datetime64[ns] float64 int64 int64
... ... ... ... ... ...
Dask Name: reset_index, 10 expressions
Expr=ResetIndex(frame=ColumnsSetter(frame=(GroupbyAggregation(frame=Assign(frame=df), arg=defaultdict(<class 'list'>, {'timestamp': ['min', 'max'], 'value': ['mean', 'count'], 'is_null': ['sum']}), observed=True, dropna=False))[[('timestamp', 'min'), ('timestamp', 'max'), ('value', 'mean'), ('value', 'count'), ('is_null', 'sum')]], columns=('start_time', 'end_time', 'avg_value', 'total_records', 'missing_records')))
Testing compute_rolling_stats:
id rolling_mean_2
0 1 15.0
1 2 25.0
2 3 NaN
shape: (3, 2)
┌─────┬────────────────┐
│ id ┆ rolling_mean_2 │
│ --- ┆ --- │
│ i64 ┆ f64 │
╞═════╪════════════════╡
│ 1 ┆ 15.0 │
│ 2 ┆ 25.0 │
│ 3 ┆ null │
└─────┴────────────────┘
Dask rolling operations require known divisions:
Can only rolling dataframes with known divisions
See https://docs.dask.org/en/latest/dataframe-design.html#partitions
for more information.
Pattern 2: Time Series Validation¶
Time series data often requires validation to ensure quality and consistency. Here are key validation patterns us
# Import required libraries
import narwhals as nw
from narwhals.typing import FrameT
import pandas as pd
import polars as pl
import dask.dataframe as dd
from typing import Dict, List, Optional, Union, Any, Literal
@nw.narwhalify
def validate_temporal_uniqueness(df: FrameT, id_col: str, time_col: str) -> FrameT:
"""Validate and report temporal uniqueness.
This demonstrates lazy evaluation for validation:
- Group and count operations can be optimized
- Return DataFrame for further processing
- Let caller decide when to materialize
"""
# Stage 1: Group by ID and time
counts = df.group_by([id_col, time_col]).agg([nw.col(time_col).count().alias("count")])
# Stage 2: Filter for duplicates
return counts.filter(nw.col("count") > 1)
@nw.narwhalify
def validate_uniform_frequency(df: FrameT, id_col: str, time_col: str) -> FrameT:
"""Validate time frequency consistency.
This demonstrates lazy evaluation for validation:
- Group and compute time spans
- Return DataFrame for further processing
- Let caller decide when to materialize
"""
# Stage 1: Group by ID and compute time spans
return df.group_by(id_col).agg(
[
nw.col(time_col).min().alias("start_time"),
nw.col(time_col).max().alias("end_time"),
nw.col(time_col).count().alias("points"),
]
)
# Test data
test_data = {
"id": [1, 1, 1, 2, 2, 2],
"timestamp": [
pd.Timestamp("2023-01-01"),
pd.Timestamp("2023-01-01"), # Duplicate
pd.Timestamp("2023-01-02"),
pd.Timestamp("2023-01-01"),
pd.Timestamp("2023-01-02"),
pd.Timestamp("2023-01-04"), # Non-uniform gap
],
"value": [10.0, 20.0, 30.0, 40.0, 50.0, 60.0],
}
# Test with different backends
print("Testing temporal uniqueness validation:")
print("-" * 50)
for backend, df in [
("Pandas", pd.DataFrame(test_data)),
("Polars", pl.DataFrame(test_data)),
("Dask", dd.from_pandas(pd.DataFrame(test_data), npartitions=2)),
]:
print(f"\n{backend} Result:")
try:
result = validate_temporal_uniqueness(nw.from_native(df), "id", "timestamp")
if hasattr(result, "compute"):
result = result.compute()
print(result)
except Exception as e:
print(f"Error: {str(e)}")
print("\nTesting uniform frequency validation:")
print("-" * 50)
for backend, df in [
("Pandas", pd.DataFrame(test_data)),
("Polars", pl.DataFrame(test_data)),
("Dask", dd.from_pandas(pd.DataFrame(test_data), npartitions=2)),
]:
print(f"\n{backend} Result:")
try:
result = validate_uniform_frequency(nw.from_native(df), "id", "timestamp")
if hasattr(result, "compute"):
result = result.compute()
print(result)
except Exception as e:
print(f"Error: {str(e)}")
Testing temporal uniqueness validation: -------------------------------------------------- Pandas Result: id timestamp count 0 1 2023-01-01 2 Polars Result: shape: (1, 3) ┌─────┬─────────────────────┬───────┐ │ id ┆ timestamp ┆ count │ │ --- ┆ --- ┆ --- │ │ i64 ┆ datetime[μs] ┆ u32 │ ╞═════╪═════════════════════╪═══════╡ │ 1 ┆ 2023-01-01 00:00:00 ┆ 2 │ └─────┴─────────────────────┴───────┘ Dask Result: id timestamp count 0 1 2023-01-01 2 Testing uniform frequency validation: -------------------------------------------------- Pandas Result: id start_time end_time points 0 1 2023-01-01 2023-01-02 3 1 2 2023-01-01 2023-01-04 3 Polars Result: shape: (2, 4) ┌─────┬─────────────────────┬─────────────────────┬────────┐ │ id ┆ start_time ┆ end_time ┆ points │ │ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ datetime[μs] ┆ datetime[μs] ┆ u32 │ ╞═════╪═════════════════════╪═════════════════════╪════════╡ │ 1 ┆ 2023-01-01 00:00:00 ┆ 2023-01-02 00:00:00 ┆ 3 │ │ 2 ┆ 2023-01-01 00:00:00 ┆ 2023-01-04 00:00:00 ┆ 3 │ └─────┴─────────────────────┴─────────────────────┴────────┘ Dask Result: id start_time end_time points 0 1 2023-01-01 2023-01-02 3 1 2 2023-01-01 2023-01-04 3
Key Takeaways - Time Series Validation Patterns¶
This example demonstrates robust patterns for validating time series data across different DataFrame backends.
Key Patterns¶
Keep Operations Lazy
- ✅ Return DataFrames for further processing
- ✅ Use elementary operations (group_by, agg)
- ✅ Let caller decide when to materialize
- ❌ Don't extract scalars in validation functions
- ❌ Don't use over() for window operations
Backend Compatibility
- ✅ Use group_by and agg for aggregations
- ✅ Handle compute() at caller level
- ✅ Use only Narwhals operations
- ❌ Don't use backend-specific operations
- ❌ Don't assume eager evaluation
Validation Results
- ✅ Return DataFrames with validation details
- ✅ Include all relevant information
- ✅ Allow further processing if needed
- ❌ Don't force immediate materialization
- ❌ Don't return only boolean results
Example Results¶
- Temporal Uniqueness
id timestamp count
0 1 2023-01-01 2 # Shows duplicate timestamps
- Frequency Validation
id start_time end_time points
0 1 2023-01-01 2023-01-02 3 # Regular frequency
1 2 2023-01-01 2023-01-04 3 # Irregular frequency
Why This Pattern Works¶
Universal Backend Support
- Works with Pandas, Polars, and Dask
- No backend-specific code needed
- Consistent results across backends
Efficient Processing
- Lazy evaluation enables optimization
- No unnecessary materializations
- Chainable with other operations
Rich Validation Results
- Full details for analysis
- Supports further processing
- Clear validation outcomes
Pattern 3: Mixed Frequency Handling¶
Time series data often has mixed frequencies that need special handling:
Info
This tutorial was auto-generated from the TemporalScope repository.
If you would like to suggest enhancements or report issues, please submit a Pull Request following the contribution guidelines.
Source notebook: 6_narwhals_ai_time_series.ipynb
Disclaimer & Copyright
THIS SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES, OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT, OR OTHERWISE, ARISING FROM, OUT OF, OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
THIS SOFTWARE IS INTENDED FOR ACADEMIC AND INFORMATIONAL PURPOSES ONLY. IT SHOULD NOT BE USED IN PRODUCTION ENVIRONMENTS OR FOR CRITICAL DECISION-MAKING WITHOUT PROPER VALIDATION. ANY USE OF THIS SOFTWARE IS AT THE USER'S OWN RISK.
© 2024 Philip Ndikum