This single query costs $1,000,000 a month!

แชร์
ฝัง
  • เผยแพร่เมื่อ 23 พ.ย. 2024

ความคิดเห็น •

  • @rallokkcaz
    @rallokkcaz 10 หลายเดือนก่อน +305

    Another big thing too, if you're writing SQL queries that could possibly index petabytes of data. You should always use the EXPLAIN keyword before running to see what the database intends to run before you lock up a production, dev or test DB with reads.

    • @ArthurvanH0udt
      @ArthurvanH0udt 10 หลายเดือนก่อน +30

      maybe also just test it on a TEST database!

    • @kraageshmikosevaar794
      @kraageshmikosevaar794 10 หลายเดือนก่อน +28

      ​@@ArthurvanH0udt test DBs usually do not contain enough data. But at least with EXPLAIN you can see if query causes full scan (or scans too much).

    • @BigHalfSteps
      @BigHalfSteps 9 หลายเดือนก่อน +7

      @@kraageshmikosevaar794 If a Test DB doesn't have enough data, then migrate a backup of the production DB to the Test DB or populate the Test DB.
      It's the whole point of a testing environment to test before potentially break the production.
      And let's say you broke production, but the logs are not clearly defining the source of the issue, you'd have to debug. If you try to debug with the production DB, you're just gonna cause the same issue. It's universally always better to invest time in doing a migration than expecting everything will be fine on its own.

    • @goofballbiscuits3647
      @goofballbiscuits3647 9 หลายเดือนก่อน +1

      ​@kraageshmikosevaar794 Then you guys need to build a real test environment or sandbox. Stop justifying this behavior. Data architects despise this sloppy corner-cutting because forgetting that explain stops production. If you stop production, I am likely firing you. End of discussion. We *_will_* find someone who understands test, dev and prod environments since you don't.

  • @dikatok
    @dikatok 10 หลายเดือนก่อน +366

    Luckily Calvin did the estimation first. Be like Calvin.

    • @kikisbytes
      @kikisbytes  10 หลายเดือนก่อน +13

      haha agreed!

    • @DopeTropic
      @DopeTropic 10 หลายเดือนก่อน +12

      It's a very important skill to calm the fuck down and think about how good the 'perfect idea' really is

    • @jerichaux9219
      @jerichaux9219 10 หลายเดือนก่อน

      Devops etiquette: super simple stuff.

    • @oz_jones
      @oz_jones 9 หลายเดือนก่อน

      Luckily he listened to Hobbes!

  • @ytlongbeach
    @ytlongbeach 10 หลายเดือนก่อน +56

    They can save even more $ by:
    4. adding appropriate indexes on the columns in the WHERE clause
    5. test and analyze using EXPLAIN prior to pushing to production.

    • @kikisbytes
      @kikisbytes  10 หลายเดือนก่อน +5

      Yeah for sure that's definitely a good advice to follow prior to pushing anything out to production to understand the query more thoroughly.

  • @chpsilva
    @chpsilva 10 หลายเดือนก่อน +100

    It always turns a yellow warning in my mind when I see a SELECT * FROM statement- specially if I find it in production code. Whenever possible query ONLY by the required columns, people. Reduced network traffic between database and application always pays back.

    • @sfcs3743
      @sfcs3743 10 หลายเดือนก่อน +1

      Not always an issue because I wrote a cronjob that exports the entire table on a monthly basis (scans about 50GB) using "SELECT *".

    • @GeckoOBac
      @GeckoOBac 10 หลายเดือนก่อน +3

      @@sfcs3743 In modern setups where memory and CPU usage are paid for by usage, it's always an issue. It might not impact performance but it WILL impact costs. So it's good practice regardless.

    • @sfcs3743
      @sfcs3743 10 หลายเดือนก่อน +3

      @@GeckoOBac The specifics of my use case demands that I need the entire table exported for analytical purposes (its an ML dataset that would be used to calculate KL scores for drift detection). I chose to use SELECT * because it made sense from a maintenance perspective because I have to do nothing extra in the event of a new column were to be added. Costwise, its identical.
      This is an edge-case, where I'm not filtering subsets inside the application and using the entire table/view. You two make a good general rule of thumb, but it's still a *general* rule that is not applicable for all instances.

    • @GeckoOBac
      @GeckoOBac 10 หลายเดือนก่อน +4

      @@sfcs3743 Ofc specific trumps general so no issue there

    • @sapphicgaze
      @sapphicgaze 10 หลายเดือนก่อน

      @@sfcs3743i mean if your use case requires all the columns, then obviously you have to use * to query all, but what the original commenter was talking about was people using * when they’re not using all the columns returned from the query (this happens way too often unfortunately)

  • @gabrieltaveira
    @gabrieltaveira 10 หลายเดือนก่อน +16

    TH-cam’s algo already started recommending. You’re going to blow up real soon, keep at it.

    • @kikisbytes
      @kikisbytes  10 หลายเดือนก่อน +2

      thank you really appreciate it!

  • @pradiptarakshit7751
    @pradiptarakshit7751 10 หลายเดือนก่อน +44

    running data pipelines in cloud is really costly it seems if you dont optimize properly

    • @rallokkcaz
      @rallokkcaz 10 หลายเดือนก่อน +6

      Shopify does everything in Ruby and shit too lol. RIP CPU/RAM.

  • @thehawik
    @thehawik 10 หลายเดือนก่อน +51

    Very interesting vide I am working on google cloud myself and I never used big query or needed clustering but this video perfectly explained why clustering is so important.
    Thanks!

    • @kikisbytes
      @kikisbytes  10 หลายเดือนก่อน +1

      Glad it was helpful! Yeah it's quite amazing how clustering can help reduce the amount of data being scanned.

    • @GeckoOBac
      @GeckoOBac 10 หลายเดือนก่อน +2

      @@kikisbytes Is it though? It's basically the equivalent of doing a full scan vs querying on an Indexed column on a RDBMS. The performance boost is exactly that you expect and that's why in RDBMS you'd generally assume indexing on FKs as a default.

  • @HaynesX
    @HaynesX 10 หลายเดือนก่อน +63

    Hey man, people are saying “Kevin Fang” copy, but it’s fine man. You took inspiration which is clear - so just credit him.
    When I design UI for frontend, I pretty much copy (and change a little) other websites, and I always credit the website I copied the design from.
    Strongly recommend.
    Brilliant video, well edited and explained! You’re gonna get big.

    • @kikisbytes
      @kikisbytes  10 หลายเดือนก่อน +20

      Thank you that's a really good suggestion! Updated the description to give proper credit.

    • @_yumeo_
      @_yumeo_ 10 หลายเดือนก่อน +9

      Yeah but even if he has the exact same intro style he doesn't need to credit anyone. It's not like my man Kevin has the copyright for it. The important thing is that the content is original. He can take inspiration from anyone he likes and that's how a lot of content is made. Copy pasting stuff (with little changes) is completely different from making original content that is only inspired by someone else...

    • @b3at1
      @b3at1 10 หลายเดือนก่อน +3

      @@kikisbyteshonestly I don't think it's just done in the same style. I like kevin fang though so wouldn't mind having more videos on cs, swe, cloud, cyber, data, et cetera. in that style

  • @Butterwhales
    @Butterwhales 9 หลายเดือนก่อน +3

    I just wanted to point out that the value chat gpt returned for how many gigabytes in a petabyte was wrong its actually 1,000,000 GB. This also gets more complicated as GB and GIB are often confused and the acronyms are used interchangeably even though they are different values.
    For instance Windows actually uses GiB to represent the total volume of drives however Windows also shows these values as GB and not GiB
    GiB base 2 1024 GIB = 1 TIB
    GB Base 10 1000 GB = 1 TB

    • @paulsieben9718
      @paulsieben9718 9 หลายเดือนก่อน +2

      As you said using GiB and GB is not that uncommon, but combining both in a single calculation makes me want to scream. It's either 1000*1000GB or 1024*1024GiB, but NEVER is there a situation where you should use 1000*1024 as chatgpt did here

  • @bill1520
    @bill1520 10 หลายเดือนก่อน +3

    You have a great way of making the information simple, relevant and engaging. Subscribed!

  • @hominusprogramming
    @hominusprogramming 10 หลายเดือนก่อน +6

    Good video! But man, if you work with that kind of traffic and you don't even know the basic rules of optimization with database (like never use select *, put in your index only few fields and try to make them only integer) you deserve to pay 1b at month! Now i'm curious in which university they went: I remember that this things are taught in the database class (at least in my country)

  • @glitchy_weasel
    @glitchy_weasel 9 หลายเดือนก่อน +1

    Very interesting topic - I guess this is why is important to understand the fundamentals well, so you can later identify how to optimize data pipelines. Now pay dev team a million dollars in bonus lol

  • @reset5899
    @reset5899 10 หลายเดือนก่อน +1

    i love the kevin fang style videos

  • @Steelrat1994
    @Steelrat1994 10 หลายเดือนก่อน +2

    The story of why every dev should know how their DB of choice works.

  • @ketaminefairy
    @ketaminefairy 10 หลายเดือนก่อน +10

    wow I loved this video man, you presented everything in such manner that it didn't get boring for a second. keep it up !

    • @kikisbytes
      @kikisbytes  10 หลายเดือนก่อน +1

      That means a lot thank you for the kind words!

  • @nixietubes
    @nixietubes 10 หลายเดือนก่อน +2

    Needed someone to pick up kevin fang's slack

  • @zelllers
    @zelllers 10 หลายเดือนก่อน +2

    4:20 - Being able to handle 60 requests per minute is very different from actually having an average of 60 requests per minute???

  • @beepbop6697
    @beepbop6697 9 หลายเดือนก่อน

    Any SaaS service that has unbounded costs should be treated to paranoia levels. One mistake in code can explode your bills to bankruptcy levels.
    Log aggregation services are another example. Just takes one developer to add a debug log in a tight loop and you can be pushing tens of thousands of dollars per hour in a busy production system.
    I wouldn't use any unbounded costing SaaS service unless it had financial controls to limit its costs to $xx/minute (or hour) before shutting down.

  • @JOHNSMITH-ve3rq
    @JOHNSMITH-ve3rq 10 หลายเดือนก่อน +1

    I mean seriously?? This seems like absolute basics doesn’t it?! Did they really “save” that much if they just did things in a normal proper way??

  • @vatanak8146
    @vatanak8146 10 หลายเดือนก่อน +1

    great stuff! really helpful for someone trying to get into DE

    • @kikisbytes
      @kikisbytes  10 หลายเดือนก่อน

      that's awesome to hear! Keep me updated on the progess!

  • @lokylee7872
    @lokylee7872 10 หลายเดือนก่อน +4

    Great video! Keep up the good work!

  • @rallokkcaz
    @rallokkcaz 10 หลายเดือนก่อน +4

    Maybe it's called cluster in google big query, but isn't that just an index in other RDBMS?

    • @rallokkcaz
      @rallokkcaz 10 หลายเดือนก่อน

      This also looks like a problem a time series database could solve because of the ranges it searched by. TS DBs and extensions partition for you and make the workflow less error prone.

    • @warrenarnold
      @warrenarnold 10 หลายเดือนก่อน

      No a cluster and an index are 2 different things

  • @StevenTohme
    @StevenTohme 10 หลายเดือนก่อน +3

    great video! you deserve way more subs

    • @kikisbytes
      @kikisbytes  10 หลายเดือนก่อน

      I appreciate that!

  • @GeorgeTan-is9lr
    @GeorgeTan-is9lr 10 หลายเดือนก่อน

    Cool & informative video for someone who is starting in software

    • @kikisbytes
      @kikisbytes  10 หลายเดือนก่อน

      That's awesome! Are you currently in school or working right now?

  • @solacenight7553
    @solacenight7553 10 หลายเดือนก่อน

    Great video man! And another thing when you said the word "subscribe" the subcribe on your channel was highlighted, I dont know if it was intentional but it was so cool that I subcribe. Lol!

    • @kikisbytes
      @kikisbytes  8 หลายเดือนก่อน

      nooo way I didn't know that was a thing

  • @andydataguy
    @andydataguy 10 หลายเดือนก่อน +3

    Its criminal how this channel hasnt blown up yet

  • @GeorgeTan-is9lr
    @GeorgeTan-is9lr 10 หลายเดือนก่อน +1

    Thanks for the informative video!

  • @fritzfahrmann4730
    @fritzfahrmann4730 10 หลายเดือนก่อน

    Good video, but in the end the question was unanswered: waht was the single query? and did it actually cost them this much?

    • @Toleich
      @Toleich 10 หลายเดือนก่อน +1

      You got clickbaited, friend.

  • @Jeanseb23
    @Jeanseb23 10 หลายเดือนก่อน +4

    Great video. I like how you explained everything that was happening in the query and how clustering can help.
    Also, I'm guessing you're a fellow Canadian? I don't see Canadian provinces often in most examples, sometimes it feels like we don't exist to Americans hahaha

    • @kikisbytes
      @kikisbytes  10 หลายเดือนก่อน +1

      hahaha for sure I also never see our provinces mention in any tech videos so ofc I had to.

  • @youKnowMe4real
    @youKnowMe4real 9 หลายเดือนก่อน

    amazing vid need more

  • @luismelloleite
    @luismelloleite 10 หลายเดือนก่อน

    Loved it, keep up the good work

    • @kikisbytes
      @kikisbytes  10 หลายเดือนก่อน

      thank you!

  • @aloufin
    @aloufin 10 หลายเดือนก่อน

    subscribed -- great explanation

  • @joshuapare4304
    @joshuapare4304 10 หลายเดือนก่อน +1

    Good video, but I would have appreciated this more if the intro skit wasn't a direct copy of Kevin Fang's Gitlab video intro.

    • @kikisbytes
      @kikisbytes  10 หลายเดือนก่อน +1

      Thank you for the feedback, promise to be more original next time!

  • @nyesExpress
    @nyesExpress 10 หลายเดือนก่อน +1

    There must be more cost to clustering the data by timestamp and geography. Is this negligible?

    • @kikisbytes
      @kikisbytes  10 หลายเดือนก่อน

      That's a good question! I'm not too sure about this one, but I would imagine that there would be a bit of additional cost. However, I think the main cost of the whole operation is still the querying aspect of it.

  • @Eurotool
    @Eurotool 10 หลายเดือนก่อน +1

    I don't get it. They used the same query before and after, so what exactly did they change?

    • @kikisbytes
      @kikisbytes  10 หลายเดือนก่อน +7

      They clustered their data and sorted it by some fields like timestamp and geography. Since the data is sorted, BigQuery doesn't have to scan everything. Instead it can figure out where to start and where to stop. Hope this makes it more clear.

    • @Eurotool
      @Eurotool 10 หลายเดือนก่อน

      @@kikisbytes Okay this makes sense! So like pre-sorted the db? Thank you for answering

    • @kikisbytes
      @kikisbytes  10 หลายเดือนก่อน +2

      @@Eurotool yeah exactly!

    • @dealloc
      @dealloc 10 หลายเดือนก่อน

      @@Ic3q4 Because the condition is pretty specific. Together with clustered indexes it's much easier to optimize and lookup since they are stored in the same order on disk. Additionally, selecting indexed columns also means the DB doesn't even have to look into the table if the given query only deals with indexed columns.
      It isn't magic, it's common practice when working with large datasets in RDBMS' of which Google BigQuery is.

  • @eric-id6bk
    @eric-id6bk 10 หลายเดือนก่อน +13

    Nice video, a bit too close to Kevin Fang’s style though. Look forward to seeing more.

    • @kikisbytes
      @kikisbytes  10 หลายเดือนก่อน +1

      Yeah definitely took the intro inspiration from him. Thank you for the feedback, will try to be more original next time!

  • @siveroo7493
    @siveroo7493 10 หลายเดือนก่อน

    great video man, keep it up

  • @deadgunner3625
    @deadgunner3625 10 หลายเดือนก่อน

    I literally got a Shopify ad on this video 😭

    • @kikisbytes
      @kikisbytes  10 หลายเดือนก่อน

      ohhhh nooo was it an ad to just use shopify for online business?

  • @xfritschix1
    @xfritschix1 9 หลายเดือนก่อน

    update where....

  • @GRAYgauss
    @GRAYgauss 9 หลายเดือนก่อน

    The slack notification causes me trauma pls no.

  • @wlockuz4467
    @wlockuz4467 10 หลายเดือนก่อน +1

    Good to know that a billion dollar company's DB suck as much as my Todos DB

  • @workingguy3166
    @workingguy3166 10 หลายเดือนก่อน

    Thanks good work

  • @kjcao_
    @kjcao_ 10 หลายเดือนก่อน

    That's really interesting!

  • @haithem8906
    @haithem8906 10 หลายเดือนก่อน

    or just index the columns you're searching with...
    surprise

  • @chauchau0825
    @chauchau0825 10 หลายเดือนก่อน

    takeaway: call whatever mistake you made optimization

    • @kikisbytes
      @kikisbytes  10 หลายเดือนก่อน

      hahaha can't optimize something that isn't broken 😉

  • @ToMeq87
    @ToMeq87 10 หลายเดือนก่อน

    There is a calculation error here: you multiply 60RPM by 60minutes, which effectively gives you 3600 queries per hour and 2592000 per month.
    There would be only 2590000/60=43200 queries a month. The rest of your reasoning remains valid, just the numbers are 60times too large.

  • @madhavanand756
    @madhavanand756 10 หลายเดือนก่อน

    Which tool you used for making videos ?

    • @kikisbytes
      @kikisbytes  10 หลายเดือนก่อน

      I use a combination of tools: audacity, DaVinci Resolve, motion canvas

  • @alexander191297
    @alexander191297 10 หลายเดือนก่อน

    Always cluster and partition!!

  • @BrentMalice
    @BrentMalice 10 หลายเดือนก่อน

    gj great audio

  • @chatGPT7
    @chatGPT7 10 หลายเดือนก่อน

    Clustering or indexing?

    • @kikisbytes
      @kikisbytes  10 หลายเดือนก่อน +1

      BigQuery clustering is a bit different than that of indexing.
      "Clustered tables in BigQuery are tables that have a user-defined column sort order using clustered columns. Clustered tables can improve query performance and reduce query costs.
      In BigQuery, a clustered column is a user-defined table property that sorts storage blocks based on the values in the clustered columns. The storage blocks are adaptively sized based on the size of the table. A clustered table maintains the sort properties in the context of each operation that modifies it. Queries that filter or aggregate by the clustered columns only scan the relevant blocks based on the clustered columns instead of the entire table or table partition. As a result, BigQuery might not be able to accurately estimate the bytes to be processed by the query or the query costs, but it attempts to reduce the total bytes at execution." - from BigQuery's definition of clustering.

    • @chatGPT7
      @chatGPT7 10 หลายเดือนก่อน

      @@kikisbytes very insightful, thanks.

  • @askegg
    @askegg 10 หลายเดือนก่อน

    So, they didn't have a covering index.

    • @kikisbytes
      @kikisbytes  10 หลายเดือนก่อน

      yeah I think it may not have been added originally.

  • @RedStone576
    @RedStone576 10 หลายเดือนก่อน +1

    among us

    • @kikisbytes
      @kikisbytes  10 หลายเดือนก่อน

      😂

  • @s0kulite
    @s0kulite 10 หลายเดือนก่อน

    Bootleg Kevin Fang

    • @kikisbytes
      @kikisbytes  10 หลายเดือนก่อน +1

      appreciate the feedback, I will be more original next time with the intro.

    • @worldwide6626
      @worldwide6626 10 หลายเดือนก่อน

      @@kikisbytesfeedback? No that’s a good thing.

  • @hezuikn
    @hezuikn 10 หลายเดือนก่อน +1

    skill issue

  • @tsunningwah3471
    @tsunningwah3471 7 หลายเดือนก่อน

    m,

  • @worldwide6626
    @worldwide6626 10 หลายเดือนก่อน

    Are you Indian American?

  • @imainjett8904
    @imainjett8904 10 หลายเดือนก่อน

    Quality content