How to Use Pandas With Pandera to Validate Your Data in Python

แชร์
ฝัง
  • เผยแพร่เมื่อ 26 มิ.ย. 2024
  • Type hints and annotations are not enough when you are using pandas for data analysis in Python. You need validation! Today I’ll show you how to work with Pandera to quickly and easily validate your dataframes.
    Git Repo ➡️ github.com/ArjanCodes/2023-pa...
    ✍🏻 Take a quiz on this topic: www.learntail.com/quiz/ibqaex
    🚀 Next-Level Python Skillshare Class: skl.sh/3ZQkUEN
    👷 Join the FREE Code Diagnosis Workshop to help you review code more effectively using my 3-Factor Diagnosis Framework: www.arjancodes.com/diagnosis
    💻 ArjanCodes Blog: www.arjancodes.com/blog
    🎓 Courses:
    The Software Designer Mindset: www.arjancodes.com/mindset
    The Software Designer Mindset Team Packages: www.arjancodes.com/sas
    The Software Architect Mindset: Pre-register now! www.arjancodes.com/architect
    Next Level Python: Become a Python Expert: www.arjancodes.com/next-level...
    The 30-Day Design Challenge: www.arjancodes.com/30ddc
    🛒 GEAR & RECOMMENDED BOOKS: kit.co/arjancodes.👍 If you enjoyed this content, give this video a like. If you want to watch more of my upcoming videos, consider subscribing to my channel!
    💬 Discord: discord.arjan.codes
    🐦Twitter: / arjancodes
    🌍LinkedIn: / arjancodes
    🕵Facebook: / arjancodes
    📱Instagram: / arjancodes
    ♪ Tiktok: / arjancodes
    👀 Code reviewers:
    - Yoriz
    - Ryan Laursen
    - Dale Hagglund
    🎥 Video edited by Mark Bacskai: / bacskaimark
    💻 Code example by Henrique Branco: / henriqueajnb
    🔖 Chapters:
    0:00 Intro
    0:47 Type annotations with pandas
    3:11 Pandera validation
    4:23 Pandera dtypes
    4:43 Pandera integration
    5:00 Code examples
    10:48 Outro
    #arjancodes #softwaredesign #python
    DISCLAIMER - The links in this description might be affiliate links. If you purchase a product or service through one of those links, I may receive a small commission. There is no additional charge to you. Thanks for supporting my channel so I can continue to provide you with free content each week!

ความคิดเห็น • 72

  • @ArjanCodes
    @ArjanCodes  8 หลายเดือนก่อน

    👷 Join the FREE Code Diagnosis Workshop to help you review code more effectively using my 3-Factor Diagnosis Framework: www.arjancodes.com/diagnosis

  • @jorgesilva932
    @jorgesilva932 ปีที่แล้ว +22

    Great video Arjan ! It would be great to see the integration with SQL Model since often you want to save the data to a DB without repetition of the schemas.
    Thank you for the content !

  • @ShinSoulTC
    @ShinSoulTC ปีที่แล้ว +21

    FastAPI integration pls!!

  • @ronaldokun
    @ronaldokun ปีที่แล้ว +9

    Great Tutorial. Clean presentation and motivation for use. Pandera was in my toolbox to use in a Pandas Project. I'll follow-up with this clean setup using pydantic. I'll be interested in the integration with FastAPI.
    Thank You!

  • @xlr32x
    @xlr32x ปีที่แล้ว +2

    Love your videos, always simple shot and to the point

  • @davidl3383
    @davidl3383 ปีที่แล้ว

    Just discovered this library last week. Amazing. Thank you

  • @AbdolaMike
    @AbdolaMike 2 หลายเดือนก่อน

    love this would like more example with integration with hypothesis too!

  • @lpadgett23
    @lpadgett23 ปีที่แล้ว

    This is super useful. Thank you

  • @modoudiao9660
    @modoudiao9660 ปีที่แล้ว +2

    Wooow thanks for sharing

  • @CaptainCsaba
    @CaptainCsaba ปีที่แล้ว

    I always envied C#-s FluentValidation package. It made validating data objects so easy and readbale. Glad to see Python has something similar with Pandera!

  • @fzfgru4508
    @fzfgru4508 ปีที่แล้ว +2

    Sounds very useful. Thanks for sharing.

    • @ArjanCodes
      @ArjanCodes  ปีที่แล้ว

      Thanks for watching!

  • @rafaelagd0
    @rafaelagd0 ปีที่แล้ว

    Very useful stuff! Thank you!

    • @ArjanCodes
      @ArjanCodes  ปีที่แล้ว +1

      Glad you think so!

  • @brunosompreee
    @brunosompreee ปีที่แล้ว

    This is the only channel where I use the super thanks. Your channel is amazing and help me grow as a Python developer. Thanks!

    • @ArjanCodes
      @ArjanCodes  ปีที่แล้ว

      Thank you so much, Bruno!

  • @colonellucasl
    @colonellucasl ปีที่แล้ว

    As always a great video. Thanks a lot :)

  • @silaseul3186
    @silaseul3186 ปีที่แล้ว

    Great Video ! :)
    I would love a tutorial about the Pint package for working with physical/scientific units including a take from you regarding typehinting and validation of correct function inputs.

  • @ramonsantiago4573
    @ramonsantiago4573 ปีที่แล้ว

    Excellent content!

  • @MinhVu-ym4tk
    @MinhVu-ym4tk ปีที่แล้ว

    As always !!! your video is interesting and helpful !!! I really want to deep dive into the integration with FastAPI!

  • @RatafakRatafak
    @RatafakRatafak ปีที่แล้ว +6

    Series can be not only a row but also a column of DataFrame.

    • @Dara-lj8rk
      @Dara-lj8rk ปีที่แล้ว

      True

    • @dispatch1347
      @dispatch1347 ปีที่แล้ว

      which is actually really annoying sometimes

    • @RatafakRatafak
      @RatafakRatafak ปีที่แล้ว +1

      @@dispatch1347 I think it’s quite logic.

  • @hudabdulwahab2499
    @hudabdulwahab2499 ปีที่แล้ว +2

    wonderful series! Add with fastapi is a good shout. Or perhaps ORM into some SQL database? not sure if that makes sense. In any case - VALIDATED 🔥

  • @2010edward1978
    @2010edward1978 ปีที่แล้ว

    Thanks, it was indeed useful for me. I did not know about pandera

    • @ArjanCodes
      @ArjanCodes  ปีที่แล้ว

      You're welcome Eduard!

  • @codevincas
    @codevincas ปีที่แล้ว

    Love your production quality! Are you using a teleprompter? Your camera presence in the intro is sooo good!

  • @shivaharip6281
    @shivaharip6281 ปีที่แล้ว

    This is super useful. Thank you for sharing.
    I have a question about min 8:57 in the video though, you've mentioned the DataFrame columns can be set as an instance variable of OutputSchema class, is it an instance variable or a class variable?

  • @JuanseVargas
    @JuanseVargas ปีที่แล้ว

    It would be nice if you can extend or deepen on the validation you make using Pandera. Maybe showing some logs of dataframe examples, one that complies to the schema and one that doesn't. And maybe showing when do you use it too. Do you use it for reading from csv's or for testing a transformed data frame to check that it complies? Thanks :)

  • @sharabhshukla7918
    @sharabhshukla7918 ปีที่แล้ว

    Good, this is a very usefull video

  • @dmitrykuleshov7134
    @dmitrykuleshov7134 8 หลายเดือนก่อน

    Well, hmmm, interesting)
    Would be great to see more on integrations

  • @uszr1
    @uszr1 ปีที่แล้ว

    Best IT channel ever❤

  • @ThuBomb
    @ThuBomb 11 หลายเดือนก่อน

    Glad to see an integration of pydantic with this --schema file was not practical for a new developer coming into the codebase. Downside is we rely on two libraries but I believe it's worth it for now

  • @mpfmorawski
    @mpfmorawski ปีที่แล้ว

    A video on pandera and FastAPI integration would be great!

  • @aaronsayeb6566
    @aaronsayeb6566 ปีที่แล้ว

    Thanks for the great video Arjan! How can you integrate this with BigQuery

  • @Dara-lj8rk
    @Dara-lj8rk ปีที่แล้ว

    Thanks for the great video. MLFlow would be a good topic for a video in my opinion. Not many good vids out there.

  • @alexandrodisla6285
    @alexandrodisla6285 ปีที่แล้ว +3

    you can have 500 columns in a dataframe. ok, you will write a lot.
    The infered schema might help

  • @Dendus90
    @Dendus90 ปีที่แล้ว +2

    Hello Arjan,
    Thank you for this great overview. I have a couple of follow-up question.
    What kind of validation does `pandera` support?
    Can I have
    1) fuzzy checks, something like I expect the value to be not a NULL, but I accept a few of them.
    2) multicolumn checks? If df["column_a"] == xx then df["column_b"] must be int, otherwise float?
    3) expectation regarding the shape of the data, using Z-test to compare it with a given distribution?
    Otherwise, this library is pretty useless. I can implement similar check in a few minutes on my own ;)

  • @ErikS-
    @ErikS- ปีที่แล้ว +1

    I changed quite a bit in my programming techniques since I started watching your series of videos. Amongst other, I now also add the typehints when I define new methods.
    I am also a user of pandas. But when I see the typehints that you propose for pandas dataframes, I am getting the feeling that this is a bit over-the-top for me. I can understand it may be valuable in a professional software development department. But as an amateur programmer this is a bit too much I think.
    I also think you may change your title to also include "Pydantic". This since in the end, you propose to use Pydantic instead of (or combined with) Pandera.

  • @gabrielcanuto3321
    @gabrielcanuto3321 ปีที่แล้ว

    Hi arjan, very nice video!
    Do know if DataFrameSchema works well with the new pyarrow dtypes from Pandas 2.0.0?
    Thanks in advance :D

  • @roccococolombo2044
    @roccococolombo2044 ปีที่แล้ว +2

    Is a series not a column of a Dataframe ?

  • @luizmatias737
    @luizmatias737 ปีที่แล้ว +4

    Thanks for sharing! I would like to know how to integrate with FastAPI. 😄

  • @kosmonautofficial296
    @kosmonautofficial296 ปีที่แล้ว

    Great video! I am just about to start up a larger project for me working with a REST API so I will be using some Pydantic. I didn't know you could also validate pandas like this, pretty interesting. Right now what I am kinda stuck trying to figure out is how to design my classes. Right now there is one class for handling oauth authentication and two others it contains for get and set methods. So that way I can do restapi.get.systeminfo() or restapi.set.locationinfo() but they are starting to get large and I am thinking about have those get and set classes as bases and extended them each with other files to separate things more.
    My thoughts are to store some of this data with sqlite. Would I have benefits to using pandas with some of this? Right now I am thinking of using pedantic for API response validation, and user input validation and then internally store the datastructures in classes then using pandas to export to xlsx as one of the output formats.

  • @MrTomkan
    @MrTomkan ปีที่แล้ว

    Wouldn’t import pandera as pa because of the confusion with pyarrow. Also what if you don’t know the column names beforehand? But you do know the structure? Can you do regex matching? And can you repeat the structure for multiple columns?

  • @dmitryutkin9864
    @dmitryutkin9864 6 หลายเดือนก่อน

    Could you please observe Polars?

  • @lorenzotagliaferri2923
    @lorenzotagliaferri2923 ปีที่แล้ว

    Superb video! But how to validate an email address with pandera? Thank you in advance

  • @commentmachine1457
    @commentmachine1457 ปีที่แล้ว

    what about the performance impact?

  • @timelschner8451
    @timelschner8451 ปีที่แล้ว

    cant you use the @validator function decorator with pydantic? Very nice video again! Thanks alot !

    • @aflous
      @aflous ปีที่แล้ว

      I don't think you can achieve this Pydantic validators without some additional work, because Pydantic is not specifically designed to work with DataFrames out of the box. You'll need to convert the DataFrame into a format that Pydantic can understand, such as a list of dictionaries.

  • @marcolaube2957
    @marcolaube2957 ปีที่แล้ว

    Great video! I'd love to see how to combine pandera with fastAPI

    • @ArjanCodes
      @ArjanCodes  ปีที่แล้ว

      Great suggestion!

    • @yasmina6318
      @yasmina6318 9 หลายเดือนก่อน

      Or with django too 🙏@@ArjanCodes

  • @avinashsingh7698
    @avinashsingh7698 ปีที่แล้ว

    I will be waiting for the fastapi integration.🙏🙏🙏

  • @murphygreen8484
    @murphygreen8484 ปีที่แล้ว

    Is there a way yet to only keep rows that meet the criteria?

  • @mudbone7706
    @mudbone7706 3 หลายเดือนก่อน

    Is there some way to pass the decorator check_output(schema) a schema that is not imported in global scope? Suppose you load the schema from file in main() and then want to call retrieve_retail_products() while passing it the validation schema. Can this be made to work?

  • @albertodivita
    @albertodivita ปีที่แล้ว

    Hi. I have a question, i don't know if you accept this kind of requests.
    I have a very huge table, of several GB. In an hour it writes about 4 thousand lines. Every day, every 5 minutes I have to populate/modify another table, much smaller, which makes a calculation: it is a sum of 3 or 4 columns contained in the larger database, from 00:01 to the time when the check is performer. So, if zi check at 8.00, I have to sum the values generate from 00:01 to 08:00. If I check at 13:00, the sum is for the values between 00:01 and 13.00.
    I was wondering, since I only work on data from that specific day (and so, max 30k of rows), does it make sense to create a temporary sqlite db in memory? Or rather a Pandas dataframe? Or are there other faster solutions? Something that does query caching (like SELECT * FROM XYZ WHERE TIME > "00:01" AND TIME

  • @SP-db6sh
    @SP-db6sh ปีที่แล้ว

    Attrs or pydantic is good, Databricks inferschema methods looks similar

  • @sambroderick5156
    @sambroderick5156 ปีที่แล้ว

    What about pola-rs?! It has schemas built in!

  • @torsteinsrnes4872
    @torsteinsrnes4872 ปีที่แล้ว

    Would it not be true to say that a Pandas Series is a column of a table, instead of a row of a table. A row usually multiple data types. A pandas series is usually of one datatype, a single column.

  • @dann1kid
    @dann1kid ปีที่แล้ว

    arjan spying again for me. I just making project with tensorflow now lmao

  • @samuel.ibarra
    @samuel.ibarra ปีที่แล้ว

    Use polars.

  • @learndatabending182
    @learndatabending182 ปีที่แล้ว

    You may not know Polars outperform Pandas, and Peaks prepares to outperform Polars.

  • @tsoier
    @tsoier ปีที่แล้ว

    For DS and ML purposes Pandera seems to be useless. It definitely slows down your EDA and ml-model development. Furthermore, it's hard to imagine a situation where data validation in production needs to be done in this way. If you receive invalid data in production, it's likely that you have larger problems with other services and components. Such situations can be detected with the help of monitoring systems and services.

  • @artistv1
    @artistv1 ปีที่แล้ว

    en.wikipedia.org/wiki/Design_by_contract