Photon for Dummies: How Does this New Execution Engine Actually Work?

แชร์
ฝัง
  • เผยแพร่เมื่อ 9 มิ.ย. 2024
  • Did you finish the Photon whitepaper and think, wait, what? I know I did; it’s my job to understand it, explain it, and then use it. If your role involves using Apache Spark™ on Databricks, then you need to know about Photon and where to use it. Join me, chief dummy, nay "supreme" dummy, as I break down this whitepaper into easy to understand explanations that don’t require a computer science degree. Together we will unravel mysteries such as:
    - Why is a Java Virtual Machine the current bottleneck for Spark enhancements?
    - What does vectorized even mean? And how was it done before?
    - Why is the relationship status between Spark and Photon "complicated?"
    In this session, we’ll start with the basics of Apache Spark, the details we pretend to know, and where those performance cracks are starting to show through. Only then will we start to look at Photon, how it’s different, where the clever design choices are and how you can make the most of this in your own workloads. I’ve spent over 50 hours going over the paper in excruciating detail; every reference, and in some instances, the references of the references so that you don’t have to.
    Talk by: Holly Smith
    Connect with us: Website: databricks.com
    Twitter: / databricks
    LinkedIn: / databricks
    Instagram: / databricksinc
    Facebook: / databricksinc
  • วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 8

  • @datasmithing_holly
    @datasmithing_holly 6 หลายเดือนก่อน +6

    Hi everyone! Thanks for watching this video. Unfotunately the sources and credits were cut off at the end, so here they are if you would like to do any further reading.
    [Paper] Alexander Behm, Shoumik Palkar, Utkarsh Agarwal, Timothy Armstrong, David Cashman, Ankur Dave, Todd Greenstein, Shant Hovsepian, Ryan Johnson, Arvind Sai Krishnan, Paul Leventis, Ala Luszczak, Prashanth Menon, Mostafa Mokhtar, Gene Pang, Sameer Paranjpye, Greg Rahn, Bart Samwel, Tom van Bussel, Herman van Hovell, Maryann Xue, Reynold Xin, Matei Zaharia. Photon: A Fast Query Engine for Lakehouse Systems. SIGMOD ’22
    [Paper] Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, Matei Zaharia. 2015. Spark SQL: Relational Data Processing in Spark. ACM SIGMOD
    [Paper] Timo Kersten, Viktor Leis, Alfons Kemper, Thomas Neumann, Andrew Pavlo, and Peter Boncz. 2018. Everything you always wanted to know about compiled and vectorized queries but were afraid to ask.
    [Lectures] CMU 15-721 Advanced Database Systems. 20 - Databricks Photon / Spark SQL, Andrew Pavlo
    [Book] Code: The Hidden Language of Computer Hardware and Software, Charles Petzold
    With special thanks to fact checkers and early reviewers: Alexander Behm, Sriram Krishnamurthy, Utkarsh Agarwal, Kent Marten, Tim Dikland, Grzegorz Rusin, Yassine Essawabi, Youssef Mrini, Erika Fonseca, Eoin O'Flanagan and Michael O'Kane

  • @lezwon
    @lezwon 8 หลายเดือนก่อน +3

    Wow! this was one of the best and fun talks I've listened to i a long time. I loved how Holly similplified the entire talk, so that even dummies like me can understand. Kudos to her 👏 Great job from starting with basics of how spark and the system works, to relating it to photon.
    Thank you for the presentation Holly. This was very helpful. 🙏

  • @rakeshreddy6630
    @rakeshreddy6630 8 หลายเดือนก่อน +2

    Holly Smith's voice is amazing..
    explanation is giving so effectively...

  • @allthingsdata
    @allthingsdata 7 หลายเดือนก่อน +2

    fantastic, probably gonna steal some slides for internal training

  • @youssefb.7406
    @youssefb.7406 9 หลายเดือนก่อน +1

    Thanks a lot, could be interesting to showcase performance increase using the photon acceleration

    • @datasmithing_holly
      @datasmithing_holly 6 หลายเดือนก่อน +3

      Hey Youssef, I toyed with the idea of including them, but the problem is that performance is very subjective to workloads, feature coverage and when the test is being run. If I was cherry picking, I would point to the 37x speed up for some text functions. On the other hand, not all workloads are photon-isable, so it could make no difference whatsoever. In general, as of 2023 I'd expect to see 2-3x speed up in a compatible workload, but by 2024 I'm anticipating 3-4x.
      Benchmarks can be useful, but what matters are your personal ETL pipelines you're running. At 37:57 there's a list of good candidates to start with. I'd recommend testing Photon with those, and seeing what kind of a difference it makes.
      Happy testing!

  • @maximerivest3501
    @maximerivest3501 7 หลายเดือนก่อน

    Seems like lots of the problems could have been resolved by using julia instead of scala

  • @ScienceMinisterZero
    @ScienceMinisterZero 6 หลายเดือนก่อน +2

    The jvm is for boomers, rewrite it in Rust.