Feature Engineering Techniques Used in the Tutorial Categorical Feature Encoding: Target Encoding: Encodes categorical features using the mean target value per category. Smoothing: Reduces overfitting by blending global mean with category mean based on observation count. K-Fold Target Encoding: Avoids data leakage by encoding using out-of-fold statistics. Categorify: Converts categorical variables into integers, optionally grouping low-frequency categories as "others." Combining Features: Feature Combination: Creates new features by concatenating values of two or more categorical columns. Group-by Aggregations: Calculates counts or other statistics for combined features. Numerical Feature Transformations: Normalization: Standard Scaling: Standardizes features to have a mean of 0 and standard deviation of 1. Log Transformation: Normalizes skewed data using logarithmic scaling. Min-Max Scaling: Scales features to a 0-1 range. Gauss Rank Transformation: Converts arbitrary distributions to a Gaussian distribution. Binning: Groups continuous variables into discrete bins, either fixed or category-specific. Example: Price binning by quantiles for different product categories. Time Series Feature Engineering: Rolling Window Features: Aggregates historical data within a specified time window (e.g., 3 days, 7 days). Difference Features: Computes differences between current and historical values (e.g., price changes over time). Sparse Feature Handling: Handling Missing Data: Fills missing categorical values with a placeholder ("unknown") or numerical values with mean/median. Low-Frequency Categories: Groups rare categories into a single "other" category. Performance Optimization: GPU Acceleration: Leveraging RAPIDS libraries (cuDF, dask-cuDF) for faster computation on GPUs. Distributed Computation: Scaling workflows using Dask for parallelism across large datasets. Integration with Frameworks: NVIDIA NVTabular: Streamlines feature engineering pipelines for recommendation systems. Provides pre-built operators for feature transformations and supports large-scale data processing. Data Loaders: Optimized data feeding into training frameworks like TensorFlow, PyTorch, or XGBoost. This systematic and GPU-accelerated approach enables fast experimentation and scalability to production-level recommender systems.
Awesome talk
Feature Engineering Techniques Used in the Tutorial
Categorical Feature Encoding:
Target Encoding: Encodes categorical features using the mean target value per category.
Smoothing: Reduces overfitting by blending global mean with category mean based on observation count.
K-Fold Target Encoding: Avoids data leakage by encoding using out-of-fold statistics.
Categorify: Converts categorical variables into integers, optionally grouping low-frequency categories as "others."
Combining Features:
Feature Combination: Creates new features by concatenating values of two or more categorical columns.
Group-by Aggregations: Calculates counts or other statistics for combined features.
Numerical Feature Transformations:
Normalization:
Standard Scaling: Standardizes features to have a mean of 0 and standard deviation of 1.
Log Transformation: Normalizes skewed data using logarithmic scaling.
Min-Max Scaling: Scales features to a 0-1 range.
Gauss Rank Transformation: Converts arbitrary distributions to a Gaussian distribution.
Binning: Groups continuous variables into discrete bins, either fixed or category-specific.
Example: Price binning by quantiles for different product categories.
Time Series Feature Engineering:
Rolling Window Features: Aggregates historical data within a specified time window (e.g., 3 days, 7 days).
Difference Features: Computes differences between current and historical values (e.g., price changes over time).
Sparse Feature Handling:
Handling Missing Data: Fills missing categorical values with a placeholder ("unknown") or numerical values with mean/median.
Low-Frequency Categories: Groups rare categories into a single "other" category.
Performance Optimization:
GPU Acceleration:
Leveraging RAPIDS libraries (cuDF, dask-cuDF) for faster computation on GPUs.
Distributed Computation:
Scaling workflows using Dask for parallelism across large datasets.
Integration with Frameworks:
NVIDIA NVTabular:
Streamlines feature engineering pipelines for recommendation systems.
Provides pre-built operators for feature transformations and supports large-scale data processing.
Data Loaders:
Optimized data feeding into training frameworks like TensorFlow, PyTorch, or XGBoost.
This systematic and GPU-accelerated approach enables fast experimentation and scalability to production-level recommender systems.
Great tutorial. May I know where the source code can be downloaded?
Up
It's inside the talk: github.com/rapidsai/deeplearning/tree/main/RecSys2020Tutorial
@@silversnow111 You are the best. 🙏