Stop using COUNT(id) to count rows
ฝัง
- เผยแพร่เมื่อ 8 ก.ค. 2024
- 📚 Learn more about PlanetScale at planetscale.com/youtube.
------------------
00:00 Intro
01:04 Origins of the myth
02:00 What does COUNT(*) mean?
02:30 COUNT(*) example
04:00 Primary and secondary indexes in MySQL
05:20 COUNT(id) example
------------------
💬 Follow PlanetScale on social media
• Twitter: / planetscaledata
• Discord: / discord
• TikTok: / planetscale
• Twitch: / planetscale
• LinkedIn: / planetscale - วิทยาศาสตร์และเทคโนโลยี
What about counting in OurSQL?
I can always count on you! YouSQL
😆@@aarondfrancis
Chill comrade, we don't need to expropriate the database engine for the proletariat
in soviet Russia you don't count(*) , * counts you
Did you mean YQL used in YDB?
I had a strong feeling that something like this is going on in DB engine, but only "argument" I had is pretty weak: "It would be very stupid if DB engine pulled all data from all columns just to pass it to count() function and return single int", and it would be very easy to implement such optimization. So I was using select(*) while simply trusting the DB to do the right thing.
Thanks to this great video, now I have a confirmation and also know exactly how DB decises how to count rows in optimal way. Great video :)
Nice to have your gut feeling proven right! Glad you liked it.
This was maybe true far far in a past, but many years ago count(*) was optimized in DB engines to not use whole columns.
I suggest using ‘count(1)’ to specifically show that no data is used. It eliminates any confusion.
I think this is one of the most important skills in programming. Knowing you are not the only smart person around :) If it's so obvious to us, developers of the engine might have thought of it as well. Modesty W
The most important take from the video is not about count(), but generically that (if speed is critical) you should always review the execution plan. Easier to do while at the prototyping phase, but can still be accomplished in production, just needs more testing and QA.
The problem is the execution plan the database will use while in development with very little data will be different to the execution plan in production when the query is now traversing millions of rows. So you just have to make educated guesses and use experience to know when a query will work or not and then iterate.
I don't think you got the main point. It's about shutting down your uncle when everyone is around.
If speed is critical you should always write your queries in the most simple and easy to understand way. Optimizing the execution plan is one of the main purposes of the DB engine. Just let it do its job without interfering. When performances degrade, the first response is to update statistics to let the engine have an accurate view of the data. If the engine still fails at finding the fastest path, rewrite your query in a better way. And ultimately give it directions as a last resort solution.
@@christianbarnay2499 "If the engine still fails at finding the fastest path, rewrite your query in a better way". That's what i said. "Review the execution plan" not "optimize the execution plan". If you can see that the DB engine can't make heads from tails of what you want, THEN you need to think on what/where to change so that it can. And waiting for degradation is a bad idea, because it can take long enough that when you NEED to make changes you might find out you CAN'T make them anymore (without breaking stuff).
Also, most people writing queries are software developers with not that much DB background. You can't expect them to write "excellent queries" or design "excellent schema"... As a "dumb software developer" i count my blessings when i have really good DB people around so i can show them my stuff and they can keep me from shooting my own feet :D
@@ErazerPTDatabases are so central to software that I can't label someone a software developer if they can't write decent SQL requests. And SQL is so easy to understand it only takes a couple hours to get the basic knowledge that will fulfill 90% or your needs.
This is actually incredibly educational. Thanks, planetscale!
"todos" also means "everything" in Spanish. So, "Select * from todos" means "select everything from everything". Terribly inefficient but it'll be great fun writing the code to sort out what we want from the results 😜
- a grizzled vet.
I get that, also works in portuguese 😂
Two thumbs up to you. My TL was at times bugging me on some reviews. Even when I showed him SQL docs saying that most optimisation is already being done by the MySQL there's no need to over engineer the solution.
With count(1) you get the performance boost without all the confusion
Yup! Counting with a constant is totally a viable option. See 06:09.
It's what I always do, because I feel it is the most expressive.
I also like to do that. No data is needed from the rows to count them.
This is what I've been doing for ages. Using count(*) at least has been slower in some older or obscure database engines. And I've worked with several.
"You'll have won the argument, which is what the holidays are all about." 🤣
Great stuff! Love that you mix in a bit of fun with the content, it's what got me to subscribe!
Just subscribed. Hard to find people who actually talk real facts these days.
A couple of corrections:
The COUNT() instruction can receive as parameter an EXPRESSION or a wildcard, that is to say, you could write: COUNT(*) or COUNT(1), COUNT(pepito), COUNT(id) or COUNT(99999) which will give you the same, the inside is considered a wildcard, the "*" is used as a wildcard by convention , but we could use any character because by definition, it doesn't use information about any particular column (and including for the COUNT the rows that contains NULL values in any column).
In the case that you comment the COUNT(id) and the COUNT(*) bring the same result because the "id" is declared as if it was a wildcard so the behavior is the same and the server takes the license to optimize the process as you have explained in the video.
But, if you really wanted to count the values of a field, the correct way would be to specify COUNT(ALL id) and this expression does have a difference with respect to the COUNT(id), and it is because it will only consider for the count the NON NULL values inside that field In the case of the example of the video COUNT(id) and COUNT(ALL id) should return the same result, since the "id" field, being a primary key, would never be empty, but the difference would be that you would force the server to use the index of the primary key to execute the COUNT(ALL id).
Finally, while it is true that the server often saves us from ourselves, it is not exactly true that it always makes the best decisions, as a DBA with over 10 years of experience I have found myself in several situations where after checking the execution plan I realize that the server is taking a not so optimal index for the instruction that has been requested and you have to address it to use the correct index for some instruction, this is seen quite often in big data querys.
i partially disagree with your second paragraph. I Put it to the test. I Downloaded the open source 'world database' for mysql. I ran 2 queries. select count(*) from country; which gives a reply: 239. The second query is: select count(IndepYear) from country; which gives a reply: 192. Indepyear is not a primary key and has several NULL values. IF you are wondering: select count(ALL IndepYear); returns the value 192 as well. Hence, in mysql 'ALL' is optional.
I felt my brain growing as I was reading your comment
Why someone liking this absolutely wrong answer? count() depends on particular columns. count(id) will count only NON NULL ones. COUNT(id) and COUNT(ALL id) are absolutely the same exepression as count(id) is implicitely ALL.
@@DmitriyYankin read the damn documentation before you said something. Also, obviously some databases work a bit different from what I said, if you use other SQL database read you own documentation 🙄.
@@user-yb6rd1fm5e yep, read it yourself. dev mysql com: "COUNT(expr) Returns a count of the number of non-NULL values of expr in the rows retrieved by a SELECT statement." ... "COUNT(*) is somewhat different in that it returns a count of the number of rows retrieved, whether or not they contain NULL values." ...
Thanks for this. I've had this argument so many times. I think some early SQL engines did look at all columns for count(*), but I believe pretty much all of them have this optimization at this point.
I believe it is a key point. Some early SQL engines did not have that optiomization, now they have it. To make a coclusion with a smart look now, is not exactly right. So man should make it automatically and even do not spend time, always use `count(1)` and not to worry about the peformance or search for confirmations. That solution works everywhere.
This is the type of content I didn't know I wanted. More please.
Great video. Short, helpful, and straight to the point.
Always nice to see how the optimiser is working under the covers. I’ve seen a few cases where the original program had done something dumb but the optimiser picked up the issue an optimised the issue away. Still makes me uneasy relying on it though.
I might call myself a somewhat senior programmer. Sometimes query optimizers did not realize to use a clustered or whatever else indexes of a table when using count (*). This happened at least on Oracle 8 and the workaround was to use count ([indexed column]) where [indexed column] = something. Count (*) caused full table scan in some cases, at least if table contained lobs. So there might really be a reason why some grey beards warn on count(*). When in doubt, check the execution plan.
I think the real takeaway here is that Oracle sucks
@@TheGreatAtario Sir you are most certainly correct on Oracle, but there are same kind of stupid behaviors in almost every database as far as I know. Not this fault but many more and different. The real takeaway is that always check the execution plan :)
It is always a challenge to figure out if something was done on purpose or by lack of knowledge or any other reason
Why count should use index?
@@DmitriyYankinit is up to query optimizer to use index or not, but in general it uses cheaper solution (less IO operations)
Love it! Thanks guys for sharing
What a great video. Earned a subscribe. Looking forward to more!
primary keys are clustered, aligned to disk clusters, physically, so counting them means traversing the disk to gather the count, and if the order on disk is not adjacent, then the index is fragmented so counting can take a lot of time, while non primary keys or indexes are no clustered, meaning they don't need to follow the disk physical alignment so they are most often stored off-table in a much more compact data structure, which even when it gets fragmented, the data is still going to be close to each other because all the data structure holds is index records, not every row in the table, like what each clustered index follows.
Correct!
Very cool to know.
FYI: I think for some databases this is not true though.
Clustering can be set differently than primary keys.
For db2, for example primary indexes are not by default clustered.
For databases like snowflake, they do not index the primary key.
Each DB may be different. Still very cool. Thanks!
@@jvapr27 correct, you can have one single index that can be clustered because that orders the records of that table physically. But that index doesn't have to be the primary key though some dbs might enforce that.
count(*) and count (id) are semantically equivalent, the database should not behave differently for those queries.
@@debasishraychawdhuri well, no. One is explicit, the other is implicit. Implicit means suggested but not expressly stated. Thus they are completely different. That is logical deduction.
for specific purposes (e.g. extremely large table, statistics, etc) i set up a count-table with a single row and column, holding the information of the count of rows of the "parent table". this requires setting up triggers on the parent table insert+delete procedures to increment+decrement the value of the count-table.
keep in mind that this setup slows down the process of writing data, but data is usually read many times in contrast to written.
That seems extremely 'hacky' and you end up doing the same thing an auto_increment lock does (with more hoops) and if your system gets busy or needs to scale you basically lose any concurrency. Also makes your application a whole lot less portable and you could make a mistake (eg. most examples I've seen online, do not take into account that an INSERT or DELETE can target multiple rows, but the trigger only gets called once, so now you need looping logic or you have a bug). Not sure if you actually "need" a count if you're working with that many records, but most database engines can provide an estimate or perhaps, you may be able to use a different database system altogether that is better optimized for providing statistical information.
yes and at the same time you need to keep on separate columns all the conditions combinations for filtering, also good luck finding programmers that actually use things like triggers since it hides the application business logic in the database
I don't know for MySQL but most DB engines already do that on their own. They have a "table information" table that contains all the metadata of each table, including the row count.
select count(anything not nullable) without a where clause will automatically trigger a lookup in that table to get the current count of rows of the selected table.
I know about count(*) but I didn't know how the Optimizer decides about what index to use.
One learn new things everyday,
Thanks Aaron!
Thanks for sharing ur knowledge 😊
Great explanation. Thanks that's very helpful.
Please make more education SQL content. This was fantastic.
Maaan, I didn't know this channel is posting such content :0 I liked it, subscribed and activated notifications.
Very nice! Thank you!
Great video, thanks!
Cools, I am using MySQL for years but never heard of this one. Please more videos like this :)
which also means it needs a secondary non null index to function the way you indicated.
redo the trial without such an index to see what it does
I actually learned something new today! Thanks!
Thanks so much for providing this great information.
Holy count, I just found an awesome channel to subscribe to. Love the humor at the end!
Holy count 😂
I love content like this, I'm going to have to check this out for my self!
Good video, thanks
Best ad ever, keep them coming
Wow, using mysql for decades and never knew this. Thanks man.
Incredible video!
"You'll have won the argument which what holidays are all about" - best sentence I've ever heard. :D
Very interesting, thank you!
This make sense when you select all rows from table, but throw in there any WHERE clause, or any filtering then the advantage might evaporate. The hardest thing i found was to display results on filtering ... In this for example might you want to show how many todos are done from total. This specific table does not have the field 'done' but if its having done, and was not specified in a key it will result in a table scan.
Hey dude, your way to explain this topic is very well !! Congrats
Hey, thanks!
This is also true in MS-SQL. it will use the most narrow Non Clustered Index on a Table. If there is non, the Tables clustered Index has to be checked, which is slow. but i think in MS-SQL you can use sys.sysindexes to look up the rowcount even faster
A nice gotcha indeed! One question : this "optimization" -- is it applicable ONLY to MySQL or is also the case with say, Postgresql ??
in postgres the star is not necessarily the fastest, it also depends if an index is used or just a scan is used, due to the amount of data in the table, and if auto vacuum was successful recently.
It applies also to oracle. Count(*) counts the rows not the data. But keep in mind that inline-views or subselect still have to fetch data for the sql to even work.
Using a constant instead of * is also a common (most offen 1 is used) viable alternative.
thanks for this info
I love your style!
Thank you!
Like the EXISTS function where the "SELECT * FROM ..." is just a predicate for the function to work and only the WHERE clause is meaningful
Just curious, what if you copied the id field as a secondary key so whenever id gets a value this copy would also get a copy, just for this purpose?
very informative!
Yes, very often the problem you are trying to solve is more generic, and can be expressed in more generic terms.
Using language features and trusting that smarter people have put more effort in optimizing the language itself is often more optimised than what we can come up with ourselves.
One classic C trick was to not use multiply/divide and instead add or subtract bitshifted values. However, in modern systems that no longer take dozens of clock cycles to multiply, the compiler knows better and will just replace your whole expression with a multiply.
The compiler may even optimize your division into multiplication or bit shifts, and all other kinds of fun wizardry. It really is more important to have readable code in many cases nowadays.
@@Demonslay335 In some cases it doesn't do what you want it to do; but only then is it worth looking into optimization. We should always be aware of the performance impact our code has, but there's no need to go crazy about optimization before you even have any performance information to work with.
usually compiler isn't as smart as you think. It can do simple optimizations like the one you mentioned, but anything slightly more complex it fails at.
@@hwstar9416 That is because you are using more dynamic languages. With more strict rules for memory safety and type setting, newer languages like Rust and Zig are doing wonders. It doesn't save you from algorithm problems with Big O of n cube tho
@@harrytsang1501 I don't use dynamically typed langs, I use C/C++.
People often overestimate how optimizing the compiler is, it's not as impressive as you think
The advice to avoid count(*) predates mysql. It may have even been true in mysql at some point or some obscure schema. As you said, count(*) relies on the optimizer to do the right thing. I use count(0) myself, but I wouldn't be surprised if sometimes you have to be more specific to get the right query plan. I think the biggest lesson in this video is not to rely on advice, but to check the plan and know that counting an index can be faster--which is great advice!
When I got a job that involved Oracle 7 in 1999 the DBA told me to use count(*) because the old hack with count(1) wasn't needed anymore. So it was presumably true at some point in the 90s.
I was teaching performance tuning on behalf of a database vendor in 1989. COUNT(*) was the recommended approach for SQL Server (both Mifcosoft & Sybase), Oracle, DB2, & Ingress.
So yes it predates mySQL. The syntax alternative was to specify a column name, But that was only if you wanted to find the count of non-null fields in that column or expression.
Count (constant) was never necessary in any platform I've used. Yet lots of people suggested it. Most had minimal clue about DB internals or query optimisations.
@@ivanskyttejrgensen7464 I have used it on oracle6 so at least even back then the advice to use count(1) was already outdated. Likely it was something for pre-ansi sql.
But all queries rely on the optimizer. The optimizer is the core of the querying engine. As long as you don't mess with the execution plan by forcing a path through hints, the optimizer will always kick in and do its job.
@@christianbarnay2499 a big difference is that count(*) semantics are defined by the SQL standard itself, so its optimisation is a lot more likely than count(constant) being recognised as equivalent... to count(*).
Wow, thank you!
Your thanksgiving conversations sound interesting
I love how you explain things, is there any course from you that teach from the ground up
Check out our MySQL for Developers course: planetscale.com/learn/courses/mysql-for-developers/introduction/course-introduction
One of the best videos I haveever watched on SQL, tq
Thank you! Love hearing that
I stopped using count(*) decades ago when I ran into a problem with our database engine at the time where there was some catalog corruption and this was erroring out with "column not found error". Lately I've been using count(1). Likely not a great reason to continue not using it (and I'm expecting the execution plan to be the same in any case).
I knew the db optimizes the query, but didn't know details like this. I would be surprised if after 20+ years of development, it would interpret count(*) as "load everything from the table and count it". What about when you don't have a secondary index, or doing a join query? Probably still counts the returned rows, or some optimization with joined indexes?
This changes by database server. This is true for MySQL but SELECT count(*) is much slower in SQL Server.
In SQL Server the way I learned to to do it was SELECT count(1)
“When you’re arguing with you family…” haha nice one.
thanks for the educational content
You could also use the „rows“ from the explain query. In some use cases this is already good enough ^^
Interesting, thanks ✌️
you are amazing at explaining things man
Thank you!
Great video
Thanx
I didn't know this. Thanks! Has it always been this way?
You never finish learning
Great explanation 👍
"Tell your family on thanksgiving" It would take me about 5 winters to explain this to my family.
This video should have more views and likes ❤
if you remove secondary indexes, and run count again, how long it takes?
Nice video, thank you!
And what's the time complexity of this operation in MySQL? Is it linear? How would you go about counting records in a table with millions of rows?
O(n) but not all O(n) algorithms take the same time in practice. The constant factor of n is probably several times different.
I've been using COUNT(1) for a while now. I wonder if the behavior is the same in other databases such as CockroachDB
Is 'select count' still always going to be a table-scan query? Or are there internal optimizations, that maintain a count of active rows in a table.. maybe an in-memory cache that's updated atomically with insert/delete operations?
Is this in ANSI SQL too? or what would be the advantage of supplying the column name in COUNT()? doesn't COUNT() skip the rows that have NULL in the specified column?
Love this video. Love the way he teaches.
❤️ thank you so much
I wonder if it's supposed to be the same for oracle because at one point we had a significant performance drop when using count(*) compared to count(id).
Maybe just another of those optimizer issues.
You should also create index for columns that are used for whatever is usually in "where" ;)
on mysql
Even in SQLite COUNT(*) takes less opcodes compared to COUNT(1) and COUNT(id) as it will just read value that is stored already, instead of having to aggregate.
Great video, It was incredibly well-presented.
By any chance, would it be possible to remove the credit card requirement for creating free DB? Thanks a bunch!
Well said
count star is asking recordcount and that is table property. When select * is used in columnar database then it is not optimal since in case columnar databases columns are retrieved separate and then records are put together. In case of columnar databases for null allowed columns also distinct count is also column property sometimes also min and max etc. Vertica has for example no indexes at all.
Does this mean adding a non-null secondary index will improve count performance on tables that don't already have one?
You could create a secondary index on the same column(s) as the primary index. That would then speed up counting, unless the DB engine is doing some clever stuff under the hood that means you don't necessarily have to do that. But given what he's said in this video, creating a secondary index on the same column(s) as the primary is certainly a good workaround. Maybe someone who knows more than I do could clarify this point.
To answer your question directly, having ANY index on a table is way better than none at all, both in the case of searching and counting. So having a simple non-null non-unique index would be the minimal requirement for fast counting. It also matters if the columns are fixed or variable length datatypes, eg. int is fixed length, varchar is variable. If all columns are fixed, then the database can also do a fast count without any indices because it knows that the length of each row is fixed, and it knows the full size of the entire table, thus what the number of rows must be.
Thank you for saving everyone's Thanksgiving! 😂
Any queries with dynamic attributes , functions , etc... Don't cache on db engine level if you use caching there, so in that case it does effect the performance.
That is not true - not only does this behaviour depend on the DB but also which function is used: Most DBs have some sort of notion of pure functions (functions who always return the same result for the same input and have no outside sideeffects).
Heck we are using them. A couple thousand of those functions even.
great info. However, I'd also like some more insights into count(*) in case we have a where clause in the query.
Since count(*) uses the smallest secondary non null key, will it be "slower" if I'm counting rows where a column value is null (or perhaps some other where clause combination which might include nulls) ?
It won't be slower, but it will only count rows where that column isn't null!
I should create a database engine where COUNT(*) is "multiply number of columns by the number of rows", COUNT(/) is "divide rows by columns", COUNT(+) is ... (you get the idea)
more content about SQL, please!
Would be interesting if the same is true for other sql engines like litesql, postgres, sql server and postgres
I'm a vet and will use *. I don't care if it bothers anyone, because I want to just move on. However, I Love this video and the knowledge you share.
I can't wait to go to the next thanksgiving.
Looking forward for the next argument with my family about the performance of the SQL count operation. Everyone will be so excited
Hopefully you win! "Happy Thanksgiving, y'all don't know anything!" - Peter, probably
How is someone talking about SQL this charming
Now I'm ready for the holidays 💪🏽
Go get em!
I never got into using SQL databases, but having written a few on-disk and over-the-network data structures, I'd expect the count(*) to be smart enough to use some cached "total_length" value, especially considering that a lot of effort went into writing query optimizers. I guess, people would think that because of lack of experience of writing a data store yourself?
would be easy enough on a simple table query, in fact count() and indexes probably do do that. However whenever you call a function or a view or whatever script with at least a little logic, the final length is unknown
I never wrote a data store, but worked with relational databases for almost 25 years. Still, assuming COUNT(*) wouldn't be fast and it would process the whole record sounds utterly absurd to me. (And, honestly, I never heard that myth)
actually, WAY back in the day in Oracle (1990s), it was recommended to use SELECT COUNT(1) and not COUNT(*) because it actually did make a difference. but they fixed that. but some grizzled old devs kept that convention.
Yeah, I've been using count(1) on Oracle for decades. I've suspected for a while that recent versions have a smart enough query optimizer to do the right thing with count(*), but I haven't taken the time to verify. And for MS SQL Server, the "culture" has always been to use count(*), so I've assumed it's ok there.
I always use COUNT(1). Because COUNT(*) depends totally on the engine's optimisation.
0:55 >optimized
This optimization is specifically MySQL case, in other DB it may not be so and can even depend on DB version. So generally for SQL count(id) is better (if there is suitable index of course).
new thing for me. TQ
I verified the explain plan of count(*) on a table, optimiser even picked a secondary index on a null column.
With a proper SQL implementation, this should not matter; the compiler should handle this.
@@lawrencechiasson975 *which is part of the compiler.
Fire video ❤
Thanks 🔥