PyFroid: Scaling Data Preparation Using Database

แชร์
ฝัง
  • เผยแพร่เมื่อ 27 ส.ค. 2024
  • Speaker: Venkatesh Emani, Senior Scientist at Microsoft
    Python has become overwhelmingly popular for ad-hoc data analysis, and Pandas dataframes have quickly become the de facto standard API for data preparation. However, the performance and scalability limitations of Pandas for large datasets is well known.
    In this session, you will hear more about PyFroid, a system that leverages databases to significantly scale and speed up Pandas workloads - whether on a commodity workstation or a cloud warehouse - by automatic translation into Ibis and subsequently SQL queries. We acknowledge that not all Pandas operations can be translated into SQL, i.e., some operations require the use of the Pandas engine. With this dichotomy at hand, a solution is designed based on lazy evaluation to push translatable Pandas operations into SQL, and use imperative statement batching to lazily evaluate operations in the Pandas engine. Our early evaluations suggest that PyFroid can provide significant speed ups for data in warehouses, and enable analysis of 3X to 5X more data on commodity machines compared to Pandas even while consuming much less resources than competitive frameworks.

ความคิดเห็น •