Multivariate Time Series Data Preprocessing with Pandas in Python | Machine Learning Tutorial

แชร์
ฝัง
  • เผยแพร่เมื่อ 22 ก.ค. 2024
  • 🎓 Prepare for the Machine Learning interview: mlexpert.io
    🔔 Subscribe: bit.ly/venelin-subscribe
    📖 Get SH*T Done with PyTorch Book: bit.ly/gtd-with-pytorch
    🗓️ 1:1 Consultation Session With Me: calendly.com/venelin-valkov/c...
    🔣 GitHub: github.com/curiousily/Getting...
    Learn how to prepare data for Time Series forecasting. We'll convert minute-by-minute Bitcoin trading data (stored in CSV file) into sequences. We'll scale the data and split it into training and test sets.
    #TimeSeries #LSTM #PyTorch #Python #Transformer
  • วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 54

  • @MaximeAntoine97
    @MaximeAntoine97 2 ปีที่แล้ว +9

    This video is a gold mine for multivariate time series data. After searching for hours online, you were the ONLY person that was capable of explaining everything in a simple way.
    Thank you!

  • @Asparuh.Emilov
    @Asparuh.Emilov 2 ปีที่แล้ว +7

    This is by far one of the best videos I have seen about data preprocessing for Time Series Data. Keep up the good work please!

  • @AlistairWalsh
    @AlistairWalsh 3 ปีที่แล้ว +12

    8:32 - iterating over rows in pandas is usually much slower than doing a column-wise operation.
    Instead of this:
    df["close_change"] = df.progress_apply(
    lambda row: 0 if np.isnan(row.prev_close) else row.close - row.prev_close,
    axis = 'columns'
    )
    Try this:
    df["close_change"] = df['close'] - df['prev_close']
    df["close_change"].fillna(0, inplace=True)

    • @maxmohamed9878
      @maxmohamed9878 2 ปีที่แล้ว

      Maybe Venelin was trying to show how to use the progress_apply. But the way you calculated the close_change is the best way.

    • @sinabirecik
      @sinabirecik 2 ปีที่แล้ว

      If you want only the change of the value, you can use also diff()

  • @paulntalo1425
    @paulntalo1425 3 ปีที่แล้ว

    Thank You for this wonderful video showing casing PyTorch for LSTM Time Series

  • @MayssaRekik
    @MayssaRekik 2 หลายเดือนก่อน

    Love the sweet song you shared in your notebook. Been vibin to Common's music while going through the code. Great stuff thanks for sharing!

  • @Deepakkumar-sn6tr
    @Deepakkumar-sn6tr 3 ปีที่แล้ว +2

    great job Venelin!!...waiting for a video on fine-tuning Transformer based recommender :)

  • @ephi124
    @ephi124 3 ปีที่แล้ว

    probably the best video on time series.

  • @kadourkadouri3505
    @kadourkadouri3505 ปีที่แล้ว +1

    The scaler part is huge weakness in the model; by using a minmax scaler you are assuming that the historical ATH (all time high) price will never be reached which is a fundamental mistake as (asset) prices are continuous. Therefore, the model will not likely be able to predict a resistance.

  • @mehmetnaml5073
    @mehmetnaml5073 3 ปีที่แล้ว +3

    Thanks for the video Venelin. It is really good for learning the coding side of things. To those who wants to do real life projects, I suggest not to apply the same features with same way of scaling. I might be wrong but I don't think it is a good idea to scale days of week ( 0-6 range ) or months etc.. with MinMaxscale( -1, 1) . They are not numerical features like the price or volume. they are categorical data if I am not wrong and scaling them the way they are done will confuse the algorithms.

  • @antonbozhinov
    @antonbozhinov 2 ปีที่แล้ว +1

    Great content! Thanks for your efforts!

  • @zenfascist
    @zenfascist 3 ปีที่แล้ว

    Excellent tutorial! Thanks a lot!

  • @Rody2013
    @Rody2013 2 ปีที่แล้ว

    Thank you for very informative video, may I ask you why we need to transfer our Pandas data frame to sequence?

  • @priodyutipradhan66
    @priodyutipradhan66 2 ปีที่แล้ว

    Great videos! Thanks for sharing!

  • @mohammadfadel6447
    @mohammadfadel6447 ปีที่แล้ว

    This is a very high quality videos, Thanks!!
    Have you done any anomaly detection on a multi variate time series?

  • @gregjuva
    @gregjuva 3 ปีที่แล้ว +3

    Great tutorial! Thanks! One comment in the preprocessing step. Iterating over each row to create a dictionary and appending those dictionaries to a list is much much slower than copying the dataframe and creating the columns you need like so:
    features_df = df.copy()
    features_df['day_of_week'] = features_df['date'].dt.dayofweek
    features_df['day_of_month'] = features_df['date'].dt.day
    features_df['week_of_year'] = features_df['date'].dt.week
    features_df['month'] = features_df['date'].dt.month

    • @Aegilops
      @Aegilops 2 ปีที่แล้ว

      Great suggestion Greg, and agree it felt faster. Interestingly got a deprecation warning on .week, so went with features_df['date'].dt.isocalendar().week

  • @gj2u
    @gj2u ปีที่แล้ว

    Venelin, hey! Very nice video! Can you comment why you picked range from -1 to 1 for scaling?

  • @SaudBako
    @SaudBako 2 ปีที่แล้ว

    20:14 is where we write the create_sequences function

  • @marlonlopezpereyra
    @marlonlopezpereyra 6 หลายเดือนก่อน +1

    Hey man you are Aweomse, thank you so much for your easy and understandable video, this is the best of the best, thank you so much 👍👍👍👍👍👍👍👍👍👍

  • @tattwadarshipanda491
    @tattwadarshipanda491 2 ปีที่แล้ว

    Beautiful explanation

  • @alteshaus3149
    @alteshaus3149 3 ปีที่แล้ว

    thank you for this great video. very helpful

  • @Ks-oj6tc
    @Ks-oj6tc 3 ปีที่แล้ว

    Thanks Venelin.

  • @Wissam-rk7tv
    @Wissam-rk7tv ปีที่แล้ว

    Thank you very much for this vidéo I have a qst ; please, how to prepare our data, in the case of a multivariate analysis but with redundant dates, for example if the variable Symbol have different values(BTC, ETH, LTC......) ? (so we don't have a unique key )

  • @SP-wt9lo
    @SP-wt9lo 2 ปีที่แล้ว

    Hi can we use this approach in time series problem ( for employee attendance prediction for 30 days )

  • @mp3311
    @mp3311 2 ปีที่แล้ว

    Great video! What is the meaning behind creating the sequences?

  • @ibadrather
    @ibadrather 2 ปีที่แล้ว +2

    Complete Working Code:
    github.com/ibadrather/pytorch_learn/blob/main/Part%2013%20-%20Multivariate_Time_Series_Data_Preprocessing_with_Pandas.ipynb

    • @saurabhvarshneya4639
      @saurabhvarshneya4639 2 ปีที่แล้ว +1

      Thank you very much ;) I was scrolling through all the comments to find this

  • @vanish6839
    @vanish6839 3 ปีที่แล้ว

    Great Video!!

  • @HipHop-cz6os
    @HipHop-cz6os 3 ปีที่แล้ว

    Amazing video 😍

  • @justin9915
    @justin9915 2 ปีที่แล้ว

    it wont let me import pytorch_lightning as pl. It says "ModuleNotFoundError: No module named 'torchtext.legacy'" what do i do???

  • @hi_brante3
    @hi_brante3 2 ปีที่แล้ว

    I like the play button. What ide are you using?

  • @piramid53
    @piramid53 ปีที่แล้ว

    thank you for your kindness it's nice Vedio

  • @dhavalpatel5595
    @dhavalpatel5595 3 ปีที่แล้ว

    Thank you for your video. I do have one doubt. How to preprocess the data if we have variable length of the series in train_data

  • @bahadrbasaran8908
    @bahadrbasaran8908 3 ปีที่แล้ว +3

    Great Video Venelin! I guess there is a small mistake at 15:00. It should be [ : train] and [train : ], isn't it? In your case: e.g. x = [1,2,3,4,5] and train_size = 2 - >
    x1, x2 = x[ : 2] , x[3 : ] -> x1=[1,2] and x2 = [4,5] -> element '3' is missing.
    I have one question about the create_sequences function (23:45):
    If the reason behind creating (sequence, label) pairs is teaching our model by showing "if you see such a sequence like ..., its label is ..." , shouldn't "label_position" be equal to (i + sequence_length -1) ?

    • @mehmetnaml5073
      @mehmetnaml5073 3 ปีที่แล้ว +2

      I can answer your second question. in fact we are creating a sequence to predict the label of next row. Ex: you are getting 60 rows to predict the closing price of Bitcoin on 61st row. That is not explained on video but this should be the reason.

    • @Aegilops
      @Aegilops 2 ปีที่แล้ว

      @@mehmetnaml5073 This was really helpful Mehmet; I was getting confused by the video at that point but your explanation makes it much clearer, thanks. It is a bit more obvious how the function works if you add a couple of extra values to the sample_data dataframe, e.g. sample_data = pd.DataFrame(dict(feature=[1, 2, 3, 4, 5, 6, 7], label=[6, 7, 8, 9, 10, 11, 12])). That way you can see it's more of a sliding window function, not a simple "split in the middle" as I originally thought it might be

  • @mikheilmgebrishvili9571
    @mikheilmgebrishvili9571 3 ปีที่แล้ว

    Hi great video, but i have few questions regarding the notebook:
    what is difference between google co-op and Jupyter notebook, which one is better to use and if it is possible to have auto fillers in Jupyter notebook.

    • @venelin_valkov
      @venelin_valkov  3 ปีที่แล้ว

      Hey,
      Google Colab is a Jupyter-like environment which gives you free compute (CPU & GPU). It is open source:
      github.com/googlecolab/colabtools
      And you can read more about it:
      research.google.com/colaboratory/faq.html
      Don't know what auto fillers are. IMO, Jupyter lab is better.
      Thanks for watching!

    • @paulntalo1425
      @paulntalo1425 3 ปีที่แล้ว

      I was inspired to start using colab jupyter notebooks through this channel but challenge you will find is how to save and access project files on your Google drive. It's not straight forward like on your local machine. Hopefully may be one day Venelin will create a video for that task

    • @mikheilmgebrishvili9571
      @mikheilmgebrishvili9571 3 ปีที่แล้ว

      @@venelin_valkov Thanks very much

    • @mikheilmgebrishvili9571
      @mikheilmgebrishvili9571 3 ปีที่แล้ว

      @@paulntalo1425 Thanks very much

    • @venelin_valkov
      @venelin_valkov  3 ปีที่แล้ว

      @@paulntalo1425 yes, Collab needs some form of permanent storage (like Google Drive) for your files. What problems/questions do you have regarding that?

  • @lyudmilabilerminiagavrilov9132
    @lyudmilabilerminiagavrilov9132 หลายเดือนก่อน

    bless u

  • @charmz973
    @charmz973 2 ปีที่แล้ว

    I'm importing a dataframe from a csv file, but cannot access it's columns by name. What's going on? df.head() returns all the columns in the dataframe but df.columns returns only Index

    • @mp3311
      @mp3311 2 ปีที่แล้ว +1

      it's because the first row of the excel file contains the index; i just removed the first row by hand :)

    • @charmz973
      @charmz973 2 ปีที่แล้ว

      @@mp3311 thanks