Hi, the distribution of absence hours in most samples is highly right-skewed. Would it help to first log-transform y in your example? Or can you also train a poisson or gamma regression model which is likely to fit the data better?
Is there a way to ensure that we are not scaling the dummy variables and just scaling the numerical columns? I feel leaving them as it is might increase the R-square.
In my experience, scaling the dummies does nothing to inhibit the performance of the model. What is important is that the relative differences between the values is maintained, not the actual values themselves. However, if you want to scale only the numerics, just split the DataFrame into two sections (numeric and categorical): numeric_columns = [] numeric_data = df.loc[:, numeric_columns].copy() categorical_data = df.drop(numeric_columns, axis=1).copy() Then apply the scaler's fit and transform functions to only the numeric data: scaler = () numeric_data = scaler.fit_transform(numeric_data) Finally, concatenate the DataFrames back together: df = pd.concat([numeric_data, categorical_data], axis=1) Hope this helps! :)
Very High Quality Content! You deserve a lot more Subs!
Hi, the distribution of absence hours in most samples is highly right-skewed. Would it help to first log-transform y in your example? Or can you also train a poisson or gamma regression model which is likely to fit the data better?
Hi there, could you please tell me how "workload" variable has been coded? Like what does workload/day= 253 actually mean?
Is there a way to ensure that we are not scaling the dummy variables and just scaling the numerical columns? I feel leaving them as it is might increase the R-square.
In my experience, scaling the dummies does nothing to inhibit the performance of the model. What is important is that the relative differences between the values is maintained, not the actual values themselves.
However, if you want to scale only the numerics, just split the DataFrame into two sections (numeric and categorical):
numeric_columns = []
numeric_data = df.loc[:, numeric_columns].copy()
categorical_data = df.drop(numeric_columns, axis=1).copy()
Then apply the scaler's fit and transform functions to only the numeric data:
scaler = ()
numeric_data = scaler.fit_transform(numeric_data)
Finally, concatenate the DataFrames back together:
df = pd.concat([numeric_data, categorical_data], axis=1)
Hope this helps! :)