Converting Columns with Narwhals: ML-Ready Data Types¶
This notebook demonstrates how to convert columns to specific data types using Narwhals' native types. This is crucial for ML preprocessing where specific column types are required. We'll cover:
Native Narwhals Types
- Using
nw.Int64,nw.Float64, etc. - Why string-based types (
'int64','float64') should be avoided - Backend-agnostic type safety
- Using
Column Type Conversion
- Converting single columns
- Handling multiple columns with different types
- Validating column types for ML
ML Preprocessing Patterns
- Ensuring numeric features
- Handling categorical columns
- Validating feature types
import narwhals as nw
from narwhals.typing import FrameT
import pandas as pd
import polars as pl
from typing import Dict, List, Optional, Union, Any, Literal
# Create sample data with mixed types
data = {
"int_col": ["1", "2", "3", "4", "5"], # Strings that should be integers
"float_col": [1, 2, None, 4, 5], # Integers that should be floats
"mixed_col": [1.5, None, "3.0", 4, 5.5], # Mixed numeric types
"cat_col": ["A", "B", "A", "C", "B"], # Categorical
}
# Create DataFrames with different backends
df_pd = pd.DataFrame(data)
# For Polars, convert mixed types to strings first
pl_data = {
"int_col": [str(x) for x in data["int_col"]],
"float_col": data["float_col"],
"mixed_col": [str(x) if x is not None else None for x in data["mixed_col"]],
"cat_col": data["cat_col"],
}
df_pl = pl.DataFrame(pl_data)
# Example 1: Single Column Type Conversion
@nw.narwhalify
def convert_to_numeric(df: FrameT, col: str, target_type: Union[nw.Int64, nw.Float64]) -> FrameT:
"""Convert a column to a specific numeric type using Narwhals native types.
Why Native Types?
- Backend-agnostic type safety
- Consistent behavior across implementations
- Better error messages for type mismatches
"""
print(f"Converting {col} to {target_type.__name__}")
# Try direct cast first
try:
result = df.with_columns([nw.col(col).cast(target_type).alias(col)])
return result
except Exception:
# If direct cast fails, try string conversion
result = df.with_columns(
[nw.col(col).cast(nw.String()).str.replace(",", "").cast(nw.Float64()).cast(target_type).alias(col)]
)
return result
# Example 2: Multi-Column Type Conversion
@nw.narwhalify
def convert_feature_columns(df: FrameT, type_map: Dict[str, Any]) -> FrameT:
"""Convert multiple columns to specific types for ML preprocessing.
Why This Pattern?
- Common ML requirement: specific types for features
- Handles multiple columns efficiently
- Maintains type safety across operations
"""
result = df
for col, target_type in type_map.items():
# Try direct cast first
try:
result = result.with_columns([nw.col(col).cast(target_type).alias(col)])
except Exception:
# If direct cast fails, try string conversion for numeric types
if target_type in (nw.Int64(), nw.Float64()):
result = result.with_columns(
[nw.col(col).cast(nw.String()).str.replace(",", "").cast(nw.Float64()).cast(target_type).alias(col)]
)
else:
# For non-numeric types, just try string cast
result = result.with_columns([nw.col(col).cast(nw.String()).alias(col)])
return result
# Example 3: ML Feature Type Validation
@nw.narwhalify(eager_only=True)
def validate_feature_types(df: FrameT, required_types: Dict[str, Any]) -> Dict[str, bool]:
"""Validate that columns have correct types for ML.
Why Eager?
- Type validation needs immediate results
- Used before model training
- Returns Python dict for easy checking
"""
result = {}
for col, required_type in required_types.items():
# Check if column can be cast to required type
try:
if required_type in (nw.Int64(), nw.Float64()):
# Try numeric conversion
_ = df.select(
[nw.col(col).cast(nw.String()).str.replace(",", "").cast(nw.Float64()).cast(required_type)]
)
else:
# Try direct cast for other types
_ = df.select([nw.col(col).cast(required_type)])
result[col] = True
except Exception:
result[col] = False
return result
# Let's demonstrate these patterns
print("Example 1: Single Column Type Conversion")
print("-" * 50)
# Convert string integers to actual integers
int_df = convert_to_numeric(df_pd, "int_col", nw.Int64)
print("\nConverted int_col:")
print(int_df["int_col"].dtype)
print(int_df["int_col"])
print("\nExample 2: Multi-Column Type Conversion")
print("-" * 50)
# Define required types for ML features
type_map = {"int_col": nw.Int64(), "float_col": nw.Float64(), "mixed_col": nw.Float64()}
ml_ready = convert_feature_columns(df_pd, type_map)
print("\nML-Ready DataFrame:")
print(ml_ready.dtypes)
print("\nExample 3: ML Feature Type Validation")
print("-" * 50)
# Check if columns can be used for ML
required_types = {"int_col": nw.Int64(), "float_col": nw.Float64(), "mixed_col": nw.Float64(), "cat_col": nw.String()}
validation = validate_feature_types(df_pd, required_types)
print("\nFeature Type Validation:")
for col, is_valid in validation.items():
print(f"{col}: {'✓' if is_valid else '✗'}")
Example 1: Single Column Type Conversion -------------------------------------------------- Converting int_col to Int64 Converted int_col: int64 0 1 1 2 2 3 3 4 4 5 Name: int_col, dtype: int64 Example 2: Multi-Column Type Conversion -------------------------------------------------- ML-Ready DataFrame: int_col int64 float_col float64 mixed_col float64 cat_col object dtype: object Example 3: ML Feature Type Validation -------------------------------------------------- Feature Type Validation: int_col: ✓ float_col: ✓ mixed_col: ✗ cat_col: ✓
Key Takeaways¶
Use Native Narwhals Types
- Always use
nw.Int64(),nw.Float64(), etc. - Never use string-based types like
'int64' - Ensures backend-agnostic type safety
- Always use
Column Type Conversion
- Handle string-to-numeric conversion safely
- Convert multiple columns efficiently
- Use type maps for clear requirements
ML Preprocessing
- Validate types before model training
- Handle mixed-type columns properly
- Use eager evaluation for validation
These patterns show how to use Narwhals for ML-specific column type handling, which is crucial for:
- Feature preprocessing
- Model input validation
- Type safety across backends
This completes our Narwhals tutorial series:
- General Patterns (1_narwhals_patterns.ipynb)
- Lazy vs Eager (2_narwhals_lazy_vs_eager.ipynb)
- Complex Functions (3_narwhals_complex_functions.ipynb)
- Column Types (4_narwhals_converting_columns.ipynb)
With these patterns, TemporalScope can eliminate DataFrame-to-DataFrame conversion functions and use Narwhals' native types for all data handling.
Info
This tutorial was auto-generated from the TemporalScope repository.
If you would like to suggest enhancements or report issues, please submit a Pull Request following the contribution guidelines.
Source notebook: 7_narwhals_converting_columns.ipynb
Disclaimer & Copyright
THIS SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES, OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT, OR OTHERWISE, ARISING FROM, OUT OF, OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
THIS SOFTWARE IS INTENDED FOR ACADEMIC AND INFORMATIONAL PURPOSES ONLY. IT SHOULD NOT BE USED IN PRODUCTION ENVIRONMENTS OR FOR CRITICAL DECISION-MAKING WITHOUT PROPER VALIDATION. ANY USE OF THIS SOFTWARE IS AT THE USER'S OWN RISK.
© 2024 Philip Ndikum