@andypavlo : I had a question on the its-not-worth-recompressing-intermediate-data discussion at the end : I agree that what you say makes sense in the context of compute-bound operators in the query plans. However, is it not true that if there are barriers in the physical operator plan (e.g. a join or a shuffle, that is required to wait for inputs from all its predecessor map stages), there may be advantage in recompressing data since data is anyways not streamed from operator to operator and is persisted back to ephemeral storage and therefore requires rescanning this data. In that sense, the intermediate data is no different from input data and hence there may be a case for recompression. Would appreciate your thoughts.
Thanks for the great content @andypavlo. At 33:09, I did not understand why I need to perform a Seq. scan. If I am looking for names corresponding to encoded values between 10 and 20, why can't I just extract those corresponding names? From the compressed data, I know which records they are. How is this different from getting DISTINCT names? I am sure I am missing something. Appreciate if you could help clarify please.
For " SELECT name FROM users WHERE name LIKE 'And%' " We do sequential scan, operate directly on the encoded data (enabled by query re-writing), and then figure out all the rows that match the predicate, and return those many rows in output. Here number of rows in the output should match the number of rows satisfying the predicate from "users" table. For " SELECT DISTINCT name FROM users WHERE name LIKE 'And%' " , we only need all the distinct names satisying the predicate , which we can get from the dictionary itself without having to do sequential scan
My understanding: 1) When you do a distinct it is sufficient to scan for that particular query just the dictionary key and list these. So no need to go to actual scan the table. 2) Here the order is required. As a consequence to table needs to be scaned by performing the between
@andypavlo : I had a question on the its-not-worth-recompressing-intermediate-data discussion at the end : I agree that what you say makes sense in the context of compute-bound operators in the query plans. However, is it not true that if there are barriers in the physical operator plan (e.g. a join or a shuffle, that is required to wait for inputs from all its predecessor map stages), there may be advantage in recompressing data since data is anyways not streamed from operator to operator and is persisted back to ephemeral storage and therefore requires rescanning this data. In that sense, the intermediate data is no different from input data and hence there may be a case for recompression. Would appreciate your thoughts.
Thanks for the great content @andypavlo. At 33:09, I did not understand why I need to perform a Seq. scan. If I am looking for names corresponding to encoded values between 10 and 20, why can't I just extract those corresponding names? From the compressed data, I know which records they are. How is this different from getting DISTINCT names? I am sure I am missing something. Appreciate if you could help clarify please.
For " SELECT name FROM users WHERE name LIKE 'And%' "
We do sequential scan, operate directly on the encoded data (enabled by query re-writing), and then figure out all the rows that match the predicate, and return those many rows in output. Here number of rows in the output should match the number of rows satisfying the predicate from "users" table.
For " SELECT DISTINCT name FROM users WHERE name LIKE 'And%' " , we only need all the distinct names satisying the predicate , which we can get from the dictionary itself without having to do sequential scan
My understanding:
1) When you do a distinct it is sufficient to scan for that particular query just the dictionary key and list these. So no need to go to actual scan the table.
2) Here the order is required. As a consequence to table needs to be scaned by performing the between
wtf is this intro music 😂
@@nathanballance2347 I don't understand. Can you explain what you mean?