Another big thing too, if you're writing SQL queries that could possibly index petabytes of data. You should always use the EXPLAIN keyword before running to see what the database intends to run before you lock up a production, dev or test DB with reads.
@@kraageshmikosevaar794 If a Test DB doesn't have enough data, then migrate a backup of the production DB to the Test DB or populate the Test DB. It's the whole point of a testing environment to test before potentially break the production. And let's say you broke production, but the logs are not clearly defining the source of the issue, you'd have to debug. If you try to debug with the production DB, you're just gonna cause the same issue. It's universally always better to invest time in doing a migration than expecting everything will be fine on its own.
@kraageshmikosevaar794 Then you guys need to build a real test environment or sandbox. Stop justifying this behavior. Data architects despise this sloppy corner-cutting because forgetting that explain stops production. If you stop production, I am likely firing you. End of discussion. We *_will_* find someone who understands test, dev and prod environments since you don't.
They can save even more $ by: 4. adding appropriate indexes on the columns in the WHERE clause 5. test and analyze using EXPLAIN prior to pushing to production.
It always turns a yellow warning in my mind when I see a SELECT * FROM statement- specially if I find it in production code. Whenever possible query ONLY by the required columns, people. Reduced network traffic between database and application always pays back.
@@sfcs3743 In modern setups where memory and CPU usage are paid for by usage, it's always an issue. It might not impact performance but it WILL impact costs. So it's good practice regardless.
@@GeckoOBac The specifics of my use case demands that I need the entire table exported for analytical purposes (its an ML dataset that would be used to calculate KL scores for drift detection). I chose to use SELECT * because it made sense from a maintenance perspective because I have to do nothing extra in the event of a new column were to be added. Costwise, its identical. This is an edge-case, where I'm not filtering subsets inside the application and using the entire table/view. You two make a good general rule of thumb, but it's still a *general* rule that is not applicable for all instances.
@@sfcs3743i mean if your use case requires all the columns, then obviously you have to use * to query all, but what the original commenter was talking about was people using * when they’re not using all the columns returned from the query (this happens way too often unfortunately)
Very interesting vide I am working on google cloud myself and I never used big query or needed clustering but this video perfectly explained why clustering is so important. Thanks!
@@kikisbytes Is it though? It's basically the equivalent of doing a full scan vs querying on an Indexed column on a RDBMS. The performance boost is exactly that you expect and that's why in RDBMS you'd generally assume indexing on FKs as a default.
Hey man, people are saying “Kevin Fang” copy, but it’s fine man. You took inspiration which is clear - so just credit him. When I design UI for frontend, I pretty much copy (and change a little) other websites, and I always credit the website I copied the design from. Strongly recommend. Brilliant video, well edited and explained! You’re gonna get big.
Yeah but even if he has the exact same intro style he doesn't need to credit anyone. It's not like my man Kevin has the copyright for it. The important thing is that the content is original. He can take inspiration from anyone he likes and that's how a lot of content is made. Copy pasting stuff (with little changes) is completely different from making original content that is only inspired by someone else...
@@kikisbyteshonestly I don't think it's just done in the same style. I like kevin fang though so wouldn't mind having more videos on cs, swe, cloud, cyber, data, et cetera. in that style
I just wanted to point out that the value chat gpt returned for how many gigabytes in a petabyte was wrong its actually 1,000,000 GB. This also gets more complicated as GB and GIB are often confused and the acronyms are used interchangeably even though they are different values. For instance Windows actually uses GiB to represent the total volume of drives however Windows also shows these values as GB and not GiB GiB base 2 1024 GIB = 1 TIB GB Base 10 1000 GB = 1 TB
As you said using GiB and GB is not that uncommon, but combining both in a single calculation makes me want to scream. It's either 1000*1000GB or 1024*1024GiB, but NEVER is there a situation where you should use 1000*1024 as chatgpt did here
Good video! But man, if you work with that kind of traffic and you don't even know the basic rules of optimization with database (like never use select *, put in your index only few fields and try to make them only integer) you deserve to pay 1b at month! Now i'm curious in which university they went: I remember that this things are taught in the database class (at least in my country)
Very interesting topic - I guess this is why is important to understand the fundamentals well, so you can later identify how to optimize data pipelines. Now pay dev team a million dollars in bonus lol
Any SaaS service that has unbounded costs should be treated to paranoia levels. One mistake in code can explode your bills to bankruptcy levels. Log aggregation services are another example. Just takes one developer to add a debug log in a tight loop and you can be pushing tens of thousands of dollars per hour in a busy production system. I wouldn't use any unbounded costing SaaS service unless it had financial controls to limit its costs to $xx/minute (or hour) before shutting down.
This also looks like a problem a time series database could solve because of the ranges it searched by. TS DBs and extensions partition for you and make the workflow less error prone.
Great video man! And another thing when you said the word "subscribe" the subcribe on your channel was highlighted, I dont know if it was intentional but it was so cool that I subcribe. Lol!
Great video. I like how you explained everything that was happening in the query and how clustering can help. Also, I'm guessing you're a fellow Canadian? I don't see Canadian provinces often in most examples, sometimes it feels like we don't exist to Americans hahaha
That's a good question! I'm not too sure about this one, but I would imagine that there would be a bit of additional cost. However, I think the main cost of the whole operation is still the querying aspect of it.
They clustered their data and sorted it by some fields like timestamp and geography. Since the data is sorted, BigQuery doesn't have to scan everything. Instead it can figure out where to start and where to stop. Hope this makes it more clear.
@@Ic3q4 Because the condition is pretty specific. Together with clustered indexes it's much easier to optimize and lookup since they are stored in the same order on disk. Additionally, selecting indexed columns also means the DB doesn't even have to look into the table if the given query only deals with indexed columns. It isn't magic, it's common practice when working with large datasets in RDBMS' of which Google BigQuery is.
There is a calculation error here: you multiply 60RPM by 60minutes, which effectively gives you 3600 queries per hour and 2592000 per month. There would be only 2590000/60=43200 queries a month. The rest of your reasoning remains valid, just the numbers are 60times too large.
BigQuery clustering is a bit different than that of indexing. "Clustered tables in BigQuery are tables that have a user-defined column sort order using clustered columns. Clustered tables can improve query performance and reduce query costs. In BigQuery, a clustered column is a user-defined table property that sorts storage blocks based on the values in the clustered columns. The storage blocks are adaptively sized based on the size of the table. A clustered table maintains the sort properties in the context of each operation that modifies it. Queries that filter or aggregate by the clustered columns only scan the relevant blocks based on the clustered columns instead of the entire table or table partition. As a result, BigQuery might not be able to accurately estimate the bytes to be processed by the query or the query costs, but it attempts to reduce the total bytes at execution." - from BigQuery's definition of clustering.
Another big thing too, if you're writing SQL queries that could possibly index petabytes of data. You should always use the EXPLAIN keyword before running to see what the database intends to run before you lock up a production, dev or test DB with reads.
maybe also just test it on a TEST database!
@@ArthurvanH0udt test DBs usually do not contain enough data. But at least with EXPLAIN you can see if query causes full scan (or scans too much).
@@kraageshmikosevaar794 If a Test DB doesn't have enough data, then migrate a backup of the production DB to the Test DB or populate the Test DB.
It's the whole point of a testing environment to test before potentially break the production.
And let's say you broke production, but the logs are not clearly defining the source of the issue, you'd have to debug. If you try to debug with the production DB, you're just gonna cause the same issue. It's universally always better to invest time in doing a migration than expecting everything will be fine on its own.
@kraageshmikosevaar794 Then you guys need to build a real test environment or sandbox. Stop justifying this behavior. Data architects despise this sloppy corner-cutting because forgetting that explain stops production. If you stop production, I am likely firing you. End of discussion. We *_will_* find someone who understands test, dev and prod environments since you don't.
Luckily Calvin did the estimation first. Be like Calvin.
haha agreed!
It's a very important skill to calm the fuck down and think about how good the 'perfect idea' really is
Devops etiquette: super simple stuff.
Luckily he listened to Hobbes!
They can save even more $ by:
4. adding appropriate indexes on the columns in the WHERE clause
5. test and analyze using EXPLAIN prior to pushing to production.
Yeah for sure that's definitely a good advice to follow prior to pushing anything out to production to understand the query more thoroughly.
It always turns a yellow warning in my mind when I see a SELECT * FROM statement- specially if I find it in production code. Whenever possible query ONLY by the required columns, people. Reduced network traffic between database and application always pays back.
Not always an issue because I wrote a cronjob that exports the entire table on a monthly basis (scans about 50GB) using "SELECT *".
@@sfcs3743 In modern setups where memory and CPU usage are paid for by usage, it's always an issue. It might not impact performance but it WILL impact costs. So it's good practice regardless.
@@GeckoOBac The specifics of my use case demands that I need the entire table exported for analytical purposes (its an ML dataset that would be used to calculate KL scores for drift detection). I chose to use SELECT * because it made sense from a maintenance perspective because I have to do nothing extra in the event of a new column were to be added. Costwise, its identical.
This is an edge-case, where I'm not filtering subsets inside the application and using the entire table/view. You two make a good general rule of thumb, but it's still a *general* rule that is not applicable for all instances.
@@sfcs3743 Ofc specific trumps general so no issue there
@@sfcs3743i mean if your use case requires all the columns, then obviously you have to use * to query all, but what the original commenter was talking about was people using * when they’re not using all the columns returned from the query (this happens way too often unfortunately)
TH-cam’s algo already started recommending. You’re going to blow up real soon, keep at it.
thank you really appreciate it!
running data pipelines in cloud is really costly it seems if you dont optimize properly
Shopify does everything in Ruby and shit too lol. RIP CPU/RAM.
Very interesting vide I am working on google cloud myself and I never used big query or needed clustering but this video perfectly explained why clustering is so important.
Thanks!
Glad it was helpful! Yeah it's quite amazing how clustering can help reduce the amount of data being scanned.
@@kikisbytes Is it though? It's basically the equivalent of doing a full scan vs querying on an Indexed column on a RDBMS. The performance boost is exactly that you expect and that's why in RDBMS you'd generally assume indexing on FKs as a default.
Hey man, people are saying “Kevin Fang” copy, but it’s fine man. You took inspiration which is clear - so just credit him.
When I design UI for frontend, I pretty much copy (and change a little) other websites, and I always credit the website I copied the design from.
Strongly recommend.
Brilliant video, well edited and explained! You’re gonna get big.
Thank you that's a really good suggestion! Updated the description to give proper credit.
Yeah but even if he has the exact same intro style he doesn't need to credit anyone. It's not like my man Kevin has the copyright for it. The important thing is that the content is original. He can take inspiration from anyone he likes and that's how a lot of content is made. Copy pasting stuff (with little changes) is completely different from making original content that is only inspired by someone else...
@@kikisbyteshonestly I don't think it's just done in the same style. I like kevin fang though so wouldn't mind having more videos on cs, swe, cloud, cyber, data, et cetera. in that style
I just wanted to point out that the value chat gpt returned for how many gigabytes in a petabyte was wrong its actually 1,000,000 GB. This also gets more complicated as GB and GIB are often confused and the acronyms are used interchangeably even though they are different values.
For instance Windows actually uses GiB to represent the total volume of drives however Windows also shows these values as GB and not GiB
GiB base 2 1024 GIB = 1 TIB
GB Base 10 1000 GB = 1 TB
As you said using GiB and GB is not that uncommon, but combining both in a single calculation makes me want to scream. It's either 1000*1000GB or 1024*1024GiB, but NEVER is there a situation where you should use 1000*1024 as chatgpt did here
You have a great way of making the information simple, relevant and engaging. Subscribed!
Good video! But man, if you work with that kind of traffic and you don't even know the basic rules of optimization with database (like never use select *, put in your index only few fields and try to make them only integer) you deserve to pay 1b at month! Now i'm curious in which university they went: I remember that this things are taught in the database class (at least in my country)
Very interesting topic - I guess this is why is important to understand the fundamentals well, so you can later identify how to optimize data pipelines. Now pay dev team a million dollars in bonus lol
i love the kevin fang style videos
The story of why every dev should know how their DB of choice works.
wow I loved this video man, you presented everything in such manner that it didn't get boring for a second. keep it up !
That means a lot thank you for the kind words!
Needed someone to pick up kevin fang's slack
4:20 - Being able to handle 60 requests per minute is very different from actually having an average of 60 requests per minute???
Any SaaS service that has unbounded costs should be treated to paranoia levels. One mistake in code can explode your bills to bankruptcy levels.
Log aggregation services are another example. Just takes one developer to add a debug log in a tight loop and you can be pushing tens of thousands of dollars per hour in a busy production system.
I wouldn't use any unbounded costing SaaS service unless it had financial controls to limit its costs to $xx/minute (or hour) before shutting down.
I mean seriously?? This seems like absolute basics doesn’t it?! Did they really “save” that much if they just did things in a normal proper way??
great stuff! really helpful for someone trying to get into DE
that's awesome to hear! Keep me updated on the progess!
Great video! Keep up the good work!
Maybe it's called cluster in google big query, but isn't that just an index in other RDBMS?
This also looks like a problem a time series database could solve because of the ranges it searched by. TS DBs and extensions partition for you and make the workflow less error prone.
No a cluster and an index are 2 different things
great video! you deserve way more subs
I appreciate that!
Cool & informative video for someone who is starting in software
That's awesome! Are you currently in school or working right now?
Great video man! And another thing when you said the word "subscribe" the subcribe on your channel was highlighted, I dont know if it was intentional but it was so cool that I subcribe. Lol!
nooo way I didn't know that was a thing
Its criminal how this channel hasnt blown up yet
Thanks for the informative video!
Good video, but in the end the question was unanswered: waht was the single query? and did it actually cost them this much?
You got clickbaited, friend.
Great video. I like how you explained everything that was happening in the query and how clustering can help.
Also, I'm guessing you're a fellow Canadian? I don't see Canadian provinces often in most examples, sometimes it feels like we don't exist to Americans hahaha
hahaha for sure I also never see our provinces mention in any tech videos so ofc I had to.
amazing vid need more
Loved it, keep up the good work
thank you!
subscribed -- great explanation
Good video, but I would have appreciated this more if the intro skit wasn't a direct copy of Kevin Fang's Gitlab video intro.
Thank you for the feedback, promise to be more original next time!
There must be more cost to clustering the data by timestamp and geography. Is this negligible?
That's a good question! I'm not too sure about this one, but I would imagine that there would be a bit of additional cost. However, I think the main cost of the whole operation is still the querying aspect of it.
I don't get it. They used the same query before and after, so what exactly did they change?
They clustered their data and sorted it by some fields like timestamp and geography. Since the data is sorted, BigQuery doesn't have to scan everything. Instead it can figure out where to start and where to stop. Hope this makes it more clear.
@@kikisbytes Okay this makes sense! So like pre-sorted the db? Thank you for answering
@@Eurotool yeah exactly!
@@Ic3q4 Because the condition is pretty specific. Together with clustered indexes it's much easier to optimize and lookup since they are stored in the same order on disk. Additionally, selecting indexed columns also means the DB doesn't even have to look into the table if the given query only deals with indexed columns.
It isn't magic, it's common practice when working with large datasets in RDBMS' of which Google BigQuery is.
Nice video, a bit too close to Kevin Fang’s style though. Look forward to seeing more.
Yeah definitely took the intro inspiration from him. Thank you for the feedback, will try to be more original next time!
great video man, keep it up
I literally got a Shopify ad on this video 😭
ohhhh nooo was it an ad to just use shopify for online business?
update where....
The slack notification causes me trauma pls no.
Good to know that a billion dollar company's DB suck as much as my Todos DB
Thanks good work
That's really interesting!
or just index the columns you're searching with...
surprise
takeaway: call whatever mistake you made optimization
hahaha can't optimize something that isn't broken 😉
There is a calculation error here: you multiply 60RPM by 60minutes, which effectively gives you 3600 queries per hour and 2592000 per month.
There would be only 2590000/60=43200 queries a month. The rest of your reasoning remains valid, just the numbers are 60times too large.
Which tool you used for making videos ?
I use a combination of tools: audacity, DaVinci Resolve, motion canvas
Always cluster and partition!!
gj great audio
Clustering or indexing?
BigQuery clustering is a bit different than that of indexing.
"Clustered tables in BigQuery are tables that have a user-defined column sort order using clustered columns. Clustered tables can improve query performance and reduce query costs.
In BigQuery, a clustered column is a user-defined table property that sorts storage blocks based on the values in the clustered columns. The storage blocks are adaptively sized based on the size of the table. A clustered table maintains the sort properties in the context of each operation that modifies it. Queries that filter or aggregate by the clustered columns only scan the relevant blocks based on the clustered columns instead of the entire table or table partition. As a result, BigQuery might not be able to accurately estimate the bytes to be processed by the query or the query costs, but it attempts to reduce the total bytes at execution." - from BigQuery's definition of clustering.
@@kikisbytes very insightful, thanks.
So, they didn't have a covering index.
yeah I think it may not have been added originally.
among us
😂
Bootleg Kevin Fang
appreciate the feedback, I will be more original next time with the intro.
@@kikisbytesfeedback? No that’s a good thing.
skill issue
m,
Are you Indian American?
Quality content