Python ML #08: Sales Forecast Tutorial with Linear Regression Model
ฝัง
- เผยแพร่เมื่อ 15 ต.ค. 2024
- In this machine learning tutorial, you will learn how to forecast sales and compare actual and forecasted sales using different metrics such as mean squared error, mean absolute error and R2 score using Linear Regression model.
We are going to use sales data from different stores from 2013 to 2017
[ items sold per day ].
**Google Collab** is being used in this tutorial instead of VS Code.
✨Download the dataset file : github.com/Bek...
✨ GitHub Repo: github.com/Bek...
🔗 Social Media
--------------------------
Facebook : / bekbrace
Twitter : / bekbrace
Instagram : / bek_brace
Tech Blog : ttps://dev.to/bekbrace
GitHub profile : github.com/Bek...
Website : bekbrace.com
Join this channel to get access to perks:
/ @bekbrace
The CSV file is available for you guys to download : github.com/BekBrace/Sales-Forecast-data-csv
The Repo Link: github.com/BekBrace/Sales-Forecast/tree/main
This is awesome! Just one quick note that you've probably caught already but around 22 minute mark you mistake the total for Feb 2013 for January. In fact January 2013's data was dropped after you created the sales_diff column and then dropped null values in the next line. January 2013 would have been null because it was the first row in the data (there would have been no value to call the difference). Anyway, not a big deal just wanted to point that out in case it tripped anyone up. Also, at 22:30 I think you meant to plot monthly_sales['sales_diff'] but you actually just re-plotted monthly sales. Regardless, still a great tutorial for figuring out the correct syntax.
Thank you for the kind words and for pointing that out! You're absolutely right-January 2013's data was dropped after creating the sales_diff column and dropping null values because it was the first row with no previous value to calculate the difference. Also, good catch on the plotting mistake at 22:30; it should have been monthly_sales['sales_diff'] instead of re-plotting monthly_sales. Appreciate the feedback!
This is great! The methodology I was searching for to solve my problem. Thanks a lot!
I'm glad I could help
Awesome tutorial Bek.
Just to point out, the second plot still showed the sales trend and not sales_diff. I could notice cos I was looking out for a change in the increasing trend after you performed the sales difference.
Heyyy ! Oh that must be an error from my side , sorry for that !
Hi bek was the ratio of your train set and test set approximately 66:34
Yup, I changed the code to:
# Visualisation
plt.figure(figsize=(15,5))
plt.bar(monthly_sales['date'], monthly_sales['sales_diff'], width=12)
plt.xlabel("Date")
plt.ylabel("Sales")
plt.title("Monthly Customer Sales Difference")
plt.show()
Notice how I changed the plot type to bar also; I prefer that vis :)
And the Actual vs Predicted to:
# Vis the predictions vs the actual sales
plt.figure(figsize=(15,5))
# Actual Sales:
plt.bar(monthly_sales['date'], monthly_sales['sales'], width=10)
# Actual Sales:
plt.plot(predict_df['date'], predict_df['Linear Prediction'], color='red')
plt.xlabel("Date")
plt.ylabel("Sales")
plt.title("Predictions vs Actual")
plt.show()
Great vid @BekBrace 😀
I was busy looking for examples / tutorials
Found this video
Turns out im using the same dataset as you😂
Lets gooooo
AWESOOOOOOOOOOOOOOME
Awesome tutorial ! Keep the videos coming !
I wanted to ask is these algorithms work on qualitative data for exemple what if in "store" , it's not "1" but "Amazon" for exemple ?
Thanks !
Thank you very much.
I am really not sure, if you would want to tweak in your code or not, I will find out and let you know
Hi @Bek, great video.
I see a few other people also asking the same question as mine. How can we use the fit model to predict sales for upcoming days? The sample data is at the day level so let's assume predicting daily sales for the upcoming month. Maybe you can record a new video as that will really add a lot of value.
Thanks.
Hey Hey 👋 thank you
I might create a follow-up video on this specific point / Thanks for the suggestion
But to complete the loop and get an answer to the question in hand, do you have any recommendations for how to predict the upcoming days/weeks? Thanks again.
@@gurtejbains Hey, I'm really interested to know how to forecast next days, months with this method as well !! Did you manage to find a solution?
@@BekBrace
Hi, @Bek Brace
lr_model=LinearRegression()
lr_model.fit(x_train,y_train)
I am getting this error after putting in that code, what could be the issue
ValueError: Found input variables with inconsistent numbers of samples: [1, 33]
@@BekBrace Hi there mate... thanks a lot for the video, it's amazing. what about this other video to show how to make the predictions for the upcoming days... this is actually what matters as there is no sense in predicting something is already passed. You are a great teacher and pass the info clearly wed love to have this video from you continuing with the explanation, please.
Thank you, Bek for this nice job, helping others!
Thank you for your support, Paulo 🙏
Hi Bek, it was a great tutorial but i have question, why you calculated lr_mse, lr_mae, lr_r2 vars as you are not using them anywhere?
Salam Saqib. Thank you for watching, brother, there was supposed to be a second part for the tutorial, unfortunately I haven't had the chance to finish it, that's why. Hope you're not disappointed, and thank you for being a good friend for the channel's 🙂🙏
Hello, i'm trying to apply this for university project but i'm not sure about what the process would be to make the predictions for the following months that we don't have information, could you help me? Many thanks
Sure
Good question Luis. @bek, any answer for how to achieve this? Use the fit model to predict sales for the upcoming quarter?
Same question
good question mate... did you find out how to do it?
You need time series analysis for this phenomena or put your values of nex quarters manually but take in your consideration extrapolation is not always perfect
Bek Thank you for the tutorial. i have a one question. why you take 13 rows for the actual sales and not 12?
Bek, I wanna ask, did you drop the 'date' and 'sales' when you make supervised _data?
Yes, the 'date' and 'sales' columns were not used directly in the supervised data. Instead, the 'sales' column was transformed into 'sales_diff' to capture the monthly sales differences. The 'date' column was not included in the supervised data.
I wish there was a tutorial for forecasting demand by items and by stores with this same dataset.
Noted
Awesome video, thanks so much for putting this out. Please I’m working on a project to predict sales for 28 days for Walmart store. Is it possible to follow this code format?
Thanks man, I'd say yes 🙂
@@BekBrace please how do i do the foreast for just 28 days? what should i do please
@@ehiztheo166 hi there bro... above @SHUVRO AHMED said its about the threshold for the loop, check it out and try diminishing your as he pointed. It seems @Bek Brace is too busy to reply so many questions... hehehehe
Hello,
Can I use the same method you used here, in yearly gross production data?
Thanks in advance
Yes of course you can 👍
Hello, Bek, thank you for making this video, it helps me alot.
But, I want to ask something. When I first create the linear regression, the time when I add x/_train and y_train to model fit. It says, "Found input varibales with inconsistent number of samples." Any clue?
I notice that when I try to look up in range 1-13, for differences sales each store each month, I got the result is different with you.
This is odd. I have got to find the time to check out the code, but please feel free to ask the friends on the channel, they might be able to answer you quicker
I try to re-run everything and re-chechk everything the found out something odd, my supervised_data for sales_diff is totally different with you. Mine start with 3130 while you even start from minus value@@BekBrace Any clue?
I’m glad to hear the video has been helpful! Regarding the error you encountered-“Found input variables with inconsistent numbers of samples”-this typically occurs when the X_train and y_train datasets do not have the same number of rows. Here’s how you can address this issue:
* Check Lengths: Make sure that both X_train and y_train have the same number of rows. You can check this by printing their shapes:
print(X_train.shape)
print(y_train.shape)
* Synchronize Data: Ensure that during your data preparation phase, when you split the data or create features, you keep the dataset synchronized. For example, if you're creating lagged features or handling missing values, make sure each operation maintains alignment between your features (X_train) and targets (y_train).
* Handling Missing Data: If your preprocessing steps (like calculating differences or dropping rows) introduce missing values, ensure that you handle these consistently across both feature and target datasets. For instance, if you drop rows with NaN values in X_train, do the same for y_train:
# Assuming you've identified rows with NaNs in X_train
X_train = X_train.dropna()
y_train = y_train.loc[X_train.index] # Align y_train with X_train
*Review Data Preparation: Go back and review the steps where you prepare X_train and y_train. There might be a step where the data gets out of sync, such as when splitting the data or creating features.
By ensuring that both X_train and y_train are correctly aligned and contain the same number of samples, you should be able to resolve this error. If the issue persists, feel free to share more details about how you are preparing your datasets, and I’ll help you debug further!
Hello, Bek. I trying to do like yours code but with different data and i have problem in 'the preparing supervised data' . when i run it, it all have NaN values so i have nothing (they get drop). What should i do with that problem? can you give insight?
Btw, awesome tutorial, Bek. Thank you for sharing this with us.
# sorry if you not understand what i am saying, english is not my first language.
Hey hey 👋 your English is perfect 👍 and i understand your problem. Only one thing, when you try to clean the data from NaN, what do you get ?
@@BekBrace i got nothing just a column name like yours.
btw, thank you for responding
@SHUVRO AHMED nice you replied him, otherwise he'd still be lost... do you also get the need to have the predictions for the upcoming days? as its not part of this tutorial... im kinda lost of what this is for without the prediction for the upcoming days... if you do, can you share with me?
When you want to plot the sales_difference you forgot to write (at Y axis data) difference. So the plot is wrong at 23:09
Thanks for the heads up
First of all thanks alot for awesome tutorial.
Could you please answer how to apply the model to predict for next year, in this case 2019?
I will probably create a whole video to explain that, thanks for the suggestion my friend :)
This video is already done?@@BekBrace
Thanks a lot Bek, I saw your video on React and Fast API (FARM Stack) in freeCodeCamp, thanks a lot for that video. I am here to request you a video on Next Js and Fast API authentication. I am really waiting for your video and reply on this topic. Have a great day :)
Thank you very much for you kind words.
Your request is taken in consideration :)
@@BekBrace I'll be waiting.
when i going plot the chart, i obteined an error: TypeError: float() argument must be a string or a real number, not 'Period'.
Please, you can help me?
Sure. The error message indicates that you are trying to convert a 'Period' object to a float, which is not possible. To resolve this issue, you need to convert the 'Period' object to a numeric value before using it in your sales forecast chart, you can convert the 'Period' object to a numeric value by accessing its 'value' attribute :
numeric_value = period.value
Make sure to check your code and ensure that you are applying the necessary conversion where needed.
Your voice is awsm
Thank you 😊
indeed... really smooth and you know how to explain it well. Congrats... just wish you could have a video showing how to predict the upcoming sales for the next 3 months.
Excellent tutorial! I have a couple of questions. In the graphs you presented, "Monthly Customer Sales" and "Monthly Customer Sales Difference," they appear to be identical. Shouldn't the second graph include the "Sales Difference" column instead of "Sales" on the y-axis? I apologize for the confusion, but I would greatly appreciate it if you could clarify this.
Hi, thank you so much for watching :) - Yes, and that was a mistake from my side
Nice sample, but if I need to make predictions to 01-2019, 02-2019 … What I need to change?
Hi, Bek! How could I predict for the next months using the same methodology?
Good question! That may trigger a future video to explain in details
@@BekBrace i need this too Sir
Thanks Bek! 🔥
Thanks a lot for the support ☺️
Hi , I have a dataset who's data - granularity is monthly and I receive data for multiple items and stores but only monthly ie 1st of each month. How can I accommodate the code accordingly and forecast shares?
hey, how did you do it , cause I have the same issue
I have a question about how to interpret the supervised data:
I'm following the code and getting the same data as you no errors, but I'm confused on how the supervised_data ended up with 47 rows. How do these rows represent the sales of each store number if we dropped the store number in the very start of the video????
I also have this same question
Please how come you are grouping by no longer using the sales store df but just monthly sales that got me confused and i suggest next time you allow the code to run so you can see what you are getting you just run it but dont see your resutlts if its what you want before moving to the next
Hi bek!how to future predict in this method?for example for next 3 or 6 months?
Depends
Hi there mate... did you find out how to do it?
Sir what are the accuracy percentage of this project ?? Means how the accurate is the prediction ??
Hi there.
The accuracy of a sales forecast in this tutorial depends on various factors, including the quality and relevance of the data, the appropriateness of the assumptions made, and the complexity of the sales patterns being modeled. As you saw, I used linear regression which is a commonly used technique for sales forecasting because it provides a straightforward way to model the relationship between independent variables (e.g., time, marketing spend) and sales.
However, the accuracy of the predictions produced by a linear regression model can vary. It is important to evaluate the model's performance using appropriate metrics such as mean squared error (MSE), mean absolute error (MAE), or R-squared (coefficient of determination). These metrics quantify the difference between the predicted sales values and the actual sales values.
I hope this answers your question, and don't leave the channel, as soon I am going be doing Credit Card Fraud detection analysis tutorial.
how could the sum of sales for 2013-02-01 be 459417?
This total likely represents the aggregated sales across all stores and items for the entire month of January 2013. In data preprocessing, the dates might be shifted or labeled to reflect the period they represent, such as using the first day of the following month to indicate the total sales of the previous month. Always ensure that the date handling aligns with how your data is structured and aggregated.
Thank you ❤
thanks for your effort ı really appreciated, but ı stuck to figure out logic behind supervised data can someone please explain it?
Thank you :)
The supervised data creation is about structuring the dataset for supervised learning. We transform the time series data into a supervised learning problem by creating input-output pairs :)
Shift the Data: We use the shift method to create lagged versions of the data. For example, if you want to predict sales based on the previous month, you shift the sales data by one month.
Concatenate the Data: Combine these lagged features with the original data, aligning them properly to ensure each row contains the sales data for the current and previous months.
Drop NaNs: Any rows that have NaN values (which occur because of shifting) are dropped to maintain a consistent dataset.
This results in a dataset where each row can be used as an input-output pair for training a model. The input features are the lagged sales data, and the output is the sales for the current month.
Here's a small code snippet to illustrate this:
supervised_data = pd.concat([monthly_sales.shift(i) for i in range(1, n+1)], axis=1)
supervised_data.dropna(inplace=True)
beware my friend, n is the number of lagged months you want to use as input features.
it's a long answer, but hopefully this cleared any mysteries for you :)
In splitdata into train and test:
Coding of minmaxscaler feature range (-1,1) show error found arry with 0samplee while a minimum of 1 is requires by minmaxscaler how to fix this value error
Hey friend!
The error you're encountering suggests that your dataset has some features with zero samples, and the HeMinMaxScaler requires at least one sample for each feature to determine the scaling parameters.
To fix this issue, you should ensure that your dataset has at least one sample for each feature before applying the MinMaxScaler.
just wanna add, if you face some error with this code line
monthly_sales = store_sales.groupby('date').sum().reset_index()
change to this
monthly_sales = store_sales.groupby('date').agg({'sales':'sum'}).reset_index()
Hi! If in my dataset, I've gotten a negative R2 score what does it mean?
Hi Julian :) - well, the R-squared (R2) score is a statistical measure that indicates the proportion of the variance in the dependent variable that can be explained by the independent variables in a regression model. It ranges from 0 to 1, with higher values indicating a better fit of the model to the data.
If you obtained a negative R2 score, it means that the regression model you used performed worse than a horizontal line (i.e., a constant model that ignores the independent variables) in explaining the variance in the dependent variable. In other words, the model's predictions are even worse than simply using the mean value of the dependent variable as a constant.
What changes should I imply to predict for the next 3 years?
I can't read the csv file,
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb2 in position 17: invalid start byte
What am i doing wrong?
Hey friend.
This error occurs when the CSV file you're trying to read contains characters that are not in the UTF-8 character encoding, and I think one way to solve it before reading the file is as follows:
import codecs
with codecs.open('file.csv', 'rb', encoding='iso-8859-1') as f:
# read the file here
Try that and let me know
is this code applicable for multiple linear regression?
Yes, the code can be adapted for multiple linear regression. Multiple linear regression simply involves more input features.
from sklearn.linear_model import LinearRegression
import numpy as np
# Assuming 'train_data' is your training data with multiple features
X_train = train_data.drop('target', axis=1)
y_train = train_data['target']
# Initialize and train the model
model = LinearRegression()
model.fit(X_train, y_train)
# Prepare the last 12 months of features for prediction
last_12_months_features = np.array([last_12_months_data])
# Prepare list to store predictions
future_predictions = []
for _ in range(12):
# Predict next month's target
next_month_prediction = model.predict(last_12_months_features)
# Append the prediction to future_predictions
future_predictions.append(next_month_prediction)
# Update last_12_months_features for next prediction
last_12_months_features = update_features(last_12_months_features, next_month_prediction)
# future_predictions now contains the forecast for the next 12 months
i dont get the logic behind monthly_Sales =df.groupby('Date').sum().reset_index()
monthly_Sales grouping by with month when you will later convert again to timestamp later on
This just gives you the total sales for a particular month ...first grouped by month and then take the sum of all the sales in that month...He changes the data type of the 'date' for the sake of time series plot .
Thank you Naren for the answer
lr_mse = np.sqrt(mean_squared_error(predict_df['Linear Prediction'], monthly_sales['sales'][-12:]))
lr_mae = mean_absolute_error(predict_df['Linear Prediction'], monthly_sales['sales'][-12:])
lr_r2 = r2_score = (predict_df['Linear Prediction'], monthly_sales['sales'][-12:])
print("Linear Regression MSE", lr_mse)
print("Linear Regression MAE", lr_mae)
print("Linear Regression R2", lr_r2)
Bro, I have run this code but the accuracy is not displaying .
I follow what you said to cut this specific code and give runall and again i paste and run the code it again show like that only not show the accuracy?What to do now?
i have the same problem as yours
Hi @user-ev8cs6yw4r and @saloualakhdar6659
You need to change this line from:
lr_r2 = r2_score = (predict_df['Linear Prediction'], monthly_sales['sales'][-12:])
To:
lr_r2 = r2_score(predict_df['Linear Prediction'], monthly_sales['sales'][-12:])
The change in the video happens at 52:24 -> 52:25, but it isn't mentioned ;)
why mine 'DataFrame' object has no attribute 'reset' how to solve it?
The error "'DataFrame' object has no attribute 'reset'" likely occurs due to a typo. The correct method is reset_index(). Ensure you use:
df.reset_index(drop=True)
Double-check your DataFrame's name to avoid referencing errors.
Hi Nice explanation, can you give google colab file to me?
Unfortunately it was lost and didn't keep a copy of it, I'll look through my old files though and keep you posted. Thank you for watching 🙂
How to use this model to predict 2018 forecast
# Assuming 'model' is your trained linear regression model and 'scaler' is your Min-Max scaler
# Last 12 months sales data from 2017
input_features = np.array([Dec_2016_sales, Jan_2017_sales, ..., Nov_2017_sales, Dec_2017_sales])
# Scale the features as the model expects scaled input
scaled_features = scaler.transform([input_features])
# Make prediction
predicted_sales_Jan_2018 = model.predict([scaled_features])
# Inverse scale if the output was scaled
predicted_sales_Jan_2018 = scaler.inverse_transform([predicted_sales_Jan_2018])
# Use predicted_sales_Jan_2018 to update your dataset for the next prediction if necessary
at 16:31 i m getting an error 'DatetimeProperties' object has no attribute 'to_timestamp'
help me please
The error message you are seeing, "'DatetimeProperties' object has no attribute 'to_timestamp'", suggests that you are trying to use the to_timestamp method on a DatetimeProperties object, but this method is not available for that particular object.
In Python's standard library, there is no built-in to_timestamp method for the DatetimeProperties class. However, the to_timestamp method is available for datetime objects in Python, which allows converting a datetime object to a Unix timestamp.
If you have a DatetimeProperties object and you want to convert it to a timestamp, you can use the timestamp() method available for datetime objects.
collab link ?
pinned ?
can you post the codes that you are using?
It's pinned
Can you provide the total code if possible
Sorry, but I lost the code somehow
We didnt use XG Boost and Random forest as we intented first
True
where can I find the code?
pinned
awasome sir, can I ask for the code?
Yes, sure
I am not able to download this train dataset from github, if anyone could please guide me…
Click on the file, then click view raw, then copy the data and paste it into an excel file saved under csv file extension
Done, now while writing the code, I am facing issues while downloading library for tensorflow
@@shreyagoyal2847what's the error ?
In your visualization of sales difference xlable as date but in ylable you took sales , I suppose it should be sales_diff, Please clarify?
how to predict 1 year in the future after this?
import numpy as np
# Assuming 'model' is your trained model and 'scaler' is your Min-Max scaler
# Last 12 months sales data
input_features = np.array([last_12_months_sales])
# Prepare list to store predictions
future_predictions = []
for _ in range(12):
# Scale features
scaled_features = scaler.transform([input_features])
# Predict next month's sales
next_month_prediction = model.predict(scaled_features)
# Inverse scale the prediction
next_month_sales = scaler.inverse_transform(next_month_prediction)
# Store the prediction
future_predictions.append(next_month_sales)
# Update input features for next prediction
input_features = np.append(input_features[1:], next_month_sales)
# future_predictions now contains the sales forecast for the next 12 months
@@BekBrace the result of future prediction is like that, is that true?
]
0s
future_predictions
[array([0.123706, 0.123706, 0.123706, 0.123706, 0.123706, 0.123706,
0.123706, 0.123706, 0.123706, 0.123706, 0.123706, 0.123706]),
array([0.123706, 0.123706, 0.123706, 0.123706, 0.123706, 0.123706,
0.123706, 0.123706, 0.123706, 0.123706, 0.123706, 0.123706]),
array([0.123706, 0.123706, 0.123706, 0.123706, 0.123706, 0.123706,
0.123706, 0.123706, 0.123706, 0.123706, 0.123706, 0.123706]),
array([0.123706, 0.123706, 0.123706, 0.123706, 0.123706, 0.123706,
0.123706, 0.123706, 0.123706, 0.123706, 0.123706, 0.123706]),
array([0.123706, 0.123706, 0.123706, 0.123706, 0.123706, 0.123706,
0.123706, 0.123706, 0.123706, 0.123706, 0.123706, 0.123706]),
array([0.123706, 0.123706, 0.123706, 0.123706, 0.123706, 0.123706,
0.123706, 0.123706, 0.123706, 0.123706, 0.123706, 0.123706]),
array([0.123706, 0.123706, 0.123706, 0.123706, 0.123706, 0.123706,
0.123706, 0.123706, 0.123706, 0.123706, 0.123706, 0.123706]),
array([0.123706, 0.123706, 0.123706, 0.123706, 0.123706, 0.123706,
0.123706, 0.123706, 0.123706, 0.123706, 0.123706, 0.123706]),
array([0.123706, 0.123706, 0.123706, 0.123706, 0.123706, 0.123706,
0.123706, 0.123706, 0.123706, 0.123706, 0.123706, 0.123706]),
array([0.123706, 0.123706, 0.123706, 0.123706, 0.123706, 0.123706,
0.123706, 0.123706, 0.123706, 0.123706, 0.123706, 0.123706]),
array([0.123706, 0.123706, 0.123706, 0.123706, 0.123706, 0.123706,
0.123706, 0.123706, 0.123706, 0.123706, 0.123706, 0.123706]),
array([0.123706, 0.123706, 0.123706, 0.123706, 0.123706, 0.123706,
0.123706, 0.123706, 0.123706, 0.123706, 0.123706, 0.123706])]
How can you make this into a website feature?
that is to look in deeper, do not have a ready answer now, but i suspect it is very possible to convert the algorithms into an interactive web app for deployment
Hi Bek, could you also share the test dataset, please? Thank you.
Yeah about that, unfortunately i cannot do it for the moment 😔 but i promise to do that later today