The columnar roadmap: Apache Parquet and Apache Arrow

Apache Spark Core - Practical Optimization Daniel Tomes (Databricks)

Hive Bucketing in Apache Spark - Tejas Patil

irl stream in Thailand 🇹🇭

รวมภาพที่จะทำให้คุณร้อง Hmmmmmm #short #memeไทย #memes #meme #พากย์นรก #พากย์ไทยอ่านมีมฮาๆ

นี่คือโปเกม่อนที่น่าผิดหวังที่สุด ! #shorts

Apache Parquet: Parquet file internals and inspecting Parquet file structure

Melvin L

มุมมอง 56 460

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 11 ก.ย. 2024

ความคิดเห็น • 20

@markevogt 4 ปีที่แล้ว ⁺⁴
Interesting video showing a single RowGroup...
You present well, and clearly have a solid grasp of the Parquet file format.
If you're interested in preparing a sequel to your video...
... considering showing a diagram of MULTIPLE row groups, each stored on a different disk in a different node in a cluster, so that a RowGroup represents the "sharding" (splitting across rows in the logical representation of a table) of a logical table and distributing shards-as-RowGroups on DIFFERENT nodes.
Then you could explore what happens during a query like "What is average square ft in ZIP Code 60542?"
This query can & will be PARALLELIZED into 1 query on each disk where a portion of the larger (logical) table has been stored.
What's COOL about parquet is this:
- in a ROW-based storage format to get the ZIP from a single record I have to read EACH row, FIND the ZIP field and return it.
- therefore in a row-based "shard" containing (say) 10,000 rows across (say) 10 disks (so 1,000 rows per disk) I have to make 10,000 READS across different regions of my disk... VERY INEFFICIENT just to get a SINGLE field (sqft) :-(
- in a COLUMN-based storage format I simply have to make 1 single read , starting with where the sqft data begins, and stopping where this field ends. And in a SINGLE read (NOT 1,000) I have ALL the sqft values in that shard representing those rows in my larger (logical) table :-)
- MEANWHILE on my other (say 10) disks also containing this (logical) table, there are also only 1 READ per disk,
The result?
Instead of 10,000 reads across 10 disks just to get 10,000 measely values of sqft to average...
... the parquet format lets me make only 10 reads and get the same 10,000 values :-O
Illustrate THIS in your next video ;-)
You'll be a hero :-)
-Mark in North Aurora IL ...
@srividyaus 4 ปีที่แล้ว ⁺³
Best explanation of parquet file and columnar file format, I came across so far. Thank you very much
@flwi 4 ปีที่แล้ว ⁺²
Great overview! Thanks for taking the time to record it!
@abhijeetzagade3349 3 ปีที่แล้ว
best explanation of columnar storage format
@nkantkumar 6 ปีที่แล้ว ⁺⁴
Excellent talk!!
@charanjeetsingh1100 6 ปีที่แล้ว ⁺¹
Very nice. Brilliant. Thanks.
@debashishkheti5010 7 ปีที่แล้ว ⁺²
Nice Explanation !!
@meditating010 6 ปีที่แล้ว ⁺¹
crazy good videos .... you are godly
@melvinl5797 6 ปีที่แล้ว
Wow..thanks 😀
@aniruddhnathani5518 10 หลายเดือนก่อน
Nice video but i dont see any row group tuning parameter directly. It is tuned via block.size itself. Is my understanding correct?
@aharonwsmith 5 ปีที่แล้ว ⁺¹
Good lecture. Play at 1.25x
@rambabuchamakuri1780 5 ปีที่แล้ว
excellent..
@karthikgolagani6844 7 ปีที่แล้ว ⁺¹
learnt new things
@sunilmali8483 6 หลายเดือนก่อน
Hi all , I am searching a way to load the parquet file but not in one go. Want to load in parts . How can i achieve this in Java . Any Implementation reference will be highly appreciated. I have gone through few articles but not up to the mark.
@rogermenezes 6 ปีที่แล้ว ⁺³
Awesome talk. Melvin, can you share your slides? via Slideshare or something.
@melvinl5797 6 ปีที่แล้ว ⁺¹
Thanks! Unfortunately dont have the slides anymore. The images used in the slides have been sourced from the official parquet site parquet.apache.org/documentation/latest/
@djibb.7876 7 ปีที่แล้ว
Great talk!
I set up a spark-cluster with 2 workers. I save a Dtaframe using partitionBy ("column x") as a parquet format to some path on each worker. The matter is that i am able to save it but if i want to read it back i am getting these errors: - Could not read footer for file file´status ...... - unable to specify Schema ... Any Suggestions?
@brianz2011 5 ปีที่แล้ว
Why the parquet store the data as row layout (row group)? Does it store data as column side by side?
@__dio 5 ปีที่แล้ว
What happens if i write a parquet file that has 2 row group??
@rajatsharma1570 4 ปีที่แล้ว
Parquet-tools not working..

ต่อไป

เล่นอัตโนมัติ

The columnar roadmap: Apache Parquet and Apache Arrow

The columnar roadmap: Apache Parquet and Apache Arrow

Apache Spark Core - Practical Optimization Daniel Tomes (Databricks)

Apache Spark Core – Practical Optimization Daniel Tomes (Databricks)

Hive Bucketing in Apache Spark - Tejas Patil

Hive Bucketing in Apache Spark - Tejas Patil

irl stream in Thailand 🇹🇭

irl stream in Thailand 🇹🇭

รวมภาพที่จะทำให้คุณร้อง Hmmmmmm #short #memeไทย #memes #meme #พากย์นรก #พากย์ไทยอ่านมีมฮาๆ

รวมภาพที่จะทำให้คุณร้อง Hmmmmmm #short #memeไทย #memes #meme #พากย์นรก #พากย์ไทยอ่านมีมฮาๆ

นี่คือโปเกม่อนที่น่าผิดหวังที่สุด ! #shorts

นี่คือโปเกม่อนที่น่าผิดหวังที่สุด ! #shorts

กระบอง Wukong 1-100%🔥 #cg #vfx #wukong

กระบอง Wukong 1-100%🔥 #cg #vfx #wukong

A Deeper Understanding of Spark Internals - Aaron Davidson (Databricks)

A Deeper Understanding of Spark Internals - Aaron Davidson (Databricks)

What is Apache Parquet file?

What is Apache Parquet file?

Parquet file, Avro file, RC, ORC file formats in Hadoop | Different file formats in Hadoop

Parquet file, Avro file, RC, ORC file formats in Hadoop | Different file formats in Hadoop

The Parquet Format and Performance Optimization Opportunities Boudewijn Braams (Databricks)

The Parquet Format and Performance Optimization Opportunities Boudewijn Braams (Databricks)

Spark + Parquet In Depth: Spark Summit East talk by: Emily Curtin and Robbie Strickland

Spark + Parquet In Depth: Spark Summit East talk by: Emily Curtin and Robbie Strickland

Intro to Apache Spark for Java and Scala Developers - Ted Malaska (Cloudera)

Intro to Apache Spark for Java and Scala Developers - Ted Malaska (Cloudera)

This MacBook was really really really really dirty #413

This MacBook was really really really really dirty #413

Using GPG to encrypt and decrypt a file

Using GPG to encrypt and decrypt a file

InfuxDB: Overview, Key Concepts and Demo | Getting Started

InfuxDB: Overview, Key Concepts and Demo | Getting Started

irl stream in Thailand 🇹🇭

irl stream in Thailand 🇹🇭

กัปตัน - ไม่อยู่ในชีวิตแต่อยู่ในหัวใจ - Blind Auditions -The Voice Thailand 2024 - 8 Sep 2024

กัปตัน - ไม่อยู่ในชีวิตแต่อยู่ในหัวใจ - Blind Auditions -The Voice Thailand 2024 - 8 Sep 2024

LIVE : Indonesia vs Australia | AFC Asian Qualifiers™ - Road to 26 (Round 3) | 10.09.24

LIVE : Indonesia vs Australia | AFC Asian Qualifiers™ - Road to 26 (Round 3) | 10.09.24

เมื่อคุณครูบอกว่า ถ้าได้ยินเสียงเพลงชาติ ทำอะไรอยู่ให้หยุด 5555 #เคารพธงชาติ #บี้เดอะสกา #bietheska

เมื่อคุณครูบอกว่า ถ้าได้ยินเสียงเพลงชาติ ทำอะไรอยู่ให้หยุด 5555 #เคารพธงชาติ #บี้เดอะสกา #bietheska

🔴Live สด! 𝐏𝐔𝐁𝐆 𝐍𝐀𝐓𝐈𝐎𝐍𝐒 𝐂𝐔𝐏 𝟐𝟎𝟐𝟒 วันที่ 3 l พับจีทีมชาติ

🔴Live สด! 𝐏𝐔𝐁𝐆 𝐍𝐀𝐓𝐈𝐎𝐍𝐒 𝐂𝐔𝐏 𝟐𝟎𝟐𝟒 วันที่ 3 l พับจีทีมชาติ

From Small To Giant Cola #katebrush #shorts

From Small To Giant Cola #katebrush #shorts

สรุปดรามาผิดใจ “แน็ก-กามิน” | แฉฮอต 2024

สรุปดรามาผิดใจ “แน็ก-กามิน” | แฉฮอต 2024

#ต้นหอมศกุนตลา ลั่น #สนยุกต์ ดีมากจนอยากเข้าฉากด้วย | Shorts Clip 2024

#ต้นหอมศกุนตลา ลั่น #สนยุกต์ ดีมากจนอยากเข้าฉากด้วย | Shorts Clip 2024