I happen to realize about this about 2 weeks ago when I created an index on a column with "status" and saw no increase of performance at all, as the matter fact it actually decreased so I guess I learned it the hard way 😂
I was just troubleshooting a long running query and found that one of the tables being joined was missing an index. The SQL EXPLAIN showed exactly what index should be added. Added it and bam, 40s query down to 1s
Great video Hussein, hope u have a great vacation. I think two more ways indexes can be useless if it's on a column that changes value very very often where maintaining a rebalanced tree will become challenging and introduce a lot of overhead. Also if the rows are not that many then index might not provide a lot of benefits.
I had questions regarding the propagation of query from backend to database - why do we introduce so much overhead in this part? We use textual commands and pass it to the database for lexing *for every single query* (queries get cached but this only helps if query is exactly the same), then parsing, then actually doing its actual job. But since most of the query is programmatically generated, why don't we have a binary format for this, where we just send a blob C-style structure over the network. It would have all the information database needs without needing to filter through the human junk. Also, I am not an engineer (yet!) so no real world experience, but do databases run on different machines than the server? For all examples and usecases encountered by me, they've always been on the same machine, so using the network (especially TCP for localhost networking!) seems unnecessarily expensive. There should be an alternative method communication that doesn't require network but uses some Unix-y message passing of some sorts.
yes, databases are best kept away from your server. This way it's decoupled and you can scale each other independently. A server should do exactly what it's name suggests: serve your clients. Whatever requests your client has, it should note it down and send it to the "kitchen" where "food" is prepped, cooked and sent to the clients. Sometimes "food" is also cached :). Coming to the network problem: it's not always a full TCP connection establishment, there is a connection pool from your server to the database which is almost always active and your server can just use these existing connections to quickly send over the raw query. Yes, there is still a little overhead but it's still better than keeping your database on the server itself when we talk about scale. Finally, we don't send "parsed" queries over the network because then you are parsing it on your Server. As explained by Hussein in this video, the parsing is dynamic, i.e, the database looks at a query and comes up with the most efficient plan based on statistics which it stores in the database itself. So you're loosing all the good database optimizations if you just send the parsed query directly to the database. Also, if tomorrow the database changes it's parser (maybe an upgrade), you will have to upgrade all your server instances to have the same version of the parser. It's a headache nobody wants. Let the database do it's job of caching, looking at your query to plan and executing your query.
To beat the server query planner, you'll need actual knowledge of the data. AFAIK, only SQLite allows you to do that by embedding the entire logic into its driver; making it a classic serverless RDBMS. note: Of course there is cloud serverless db like cosmos or aurora, but those are a bit more obscure in implementation due to several abstraction layer. SQLite use simplified model where db interface is treated as part of the application itself. Regarding your last question: Yes. It is advisable to separate db with backend due to predictability factor. Roundtrip latency is generally an acceptable cost compared to resource exhaustion during peak load (since on single-machine scenario, every additional load will be multiplied by at least two: backend processing and db access). And latency is usually much more predictable compared to load. Having said that, for localhost you could use UNIX socket- or named pipes on Windows- to avoid TCP "overhead". However, on obscure edge cases where you need do bug hunting, TCP-based connection generally easier to debug. And using TCP from the get go means you're not locked to single scenario, or worse, made wrong assumptions due to difference in system behavior.
I had a question about composite indexes in nosql/mongo How does mongo know what's left and what's right ? And let's say I have a index for userid property but _id (inbuilt) is also indexed right ? So will this affect performance?
I'm not sure of this but this might be the case (it can be wrong) - When u only do select * from employees where condition; This * can cause the db to do heap scan instead of index scan because it anyways need to fetch every column from heap, so to skip two hops it goes to table scan in the first place. Please correct me if it's wrong.
Doesn't sound logical to me. If you have million rows and you do "SELECT * WHERE id = 42069" I don't see way it will scan million rows instead of using the index then getting this 1 row from the heap. From what I know select * is only bad for network traffic and CPU/RAM/IO for fetching/formatting the output. Except probably if you select *only* from the indexes then you don't need the heap at all and skip this extra step?
The index doesn't store all columns values so it doesn't really matter what columns you're looking for, instead what matters in this case is the where clause column, the DB will scan this index based on the where column, get the primary key, then it has access to all columns related to this primary key, so specifying one column or * won't make much of a difference if you have an index that will be used to filter your rows. Someone can correct me if I'm wrong :)
When the index is implied but not mentioned.. Like object type(having index) is car and we are searching for BMW but we never mentioned object type to be car in the query.
Trying to learn kotlin by just reading the documentation. Ugggggh. I like JavaScript but need to leave my comfort zone. Hope your family is doing well. Thanks for a great channel !
Another one is when you do a LIKE query with percent sign on the left of your where clause. E.g, select sth from xx where xxx LIKE %q% Since % means all possibilities, the database has no way to apply any index scan.
Yo, big fan of your outage content. Could we get some content on the recent roblox haloween outage? Have you heard anything about that? Edit: Didn't see the part about vacation, hopefully we get some good outage content when hes back
head to database.husseinnasser.com to get a discount coupon for my Introduction to Database Engineering course
I happen to realize about this about 2 weeks ago when I created an index on a column with "status" and saw no increase of performance at all, as the matter fact it actually decreased so I guess I learned it the hard way 😂
I was just troubleshooting a long running query and found that one of the tables being joined was missing an index. The SQL EXPLAIN showed exactly what index should be added. Added it and bam, 40s query down to 1s
the way you said bahrain made me so happy
Just bought your course Yesterday, looking forward to it
Great video. This helped my understand indexing a bit deeper. Thank you
Great video Hussein, hope u have a great vacation.
I think two more ways indexes can be useless if it's on a column that changes value very very often where maintaining a rebalanced tree will become challenging and introduce a lot of overhead. Also if the rows are not that many then index might not provide a lot of benefits.
exactly.
one of the reasons we generate random test data, is to force the DB engine to use the index.
explained very well thank you Hussein Nasir for such awesome content we really appreciate your content & hard work keep it up 👌
I had questions regarding the propagation of query from backend to database - why do we introduce so much overhead in this part?
We use textual commands and pass it to the database for lexing *for every single query* (queries get cached but this only helps if query is exactly the same), then parsing, then actually doing its actual job. But since most of the query is programmatically generated, why don't we have a binary format for this, where we just send a blob C-style structure over the network. It would have all the information database needs without needing to filter through the human junk.
Also, I am not an engineer (yet!) so no real world experience, but do databases run on different machines than the server? For all examples and usecases encountered by me, they've always been on the same machine, so using the network (especially TCP for localhost networking!) seems unnecessarily expensive. There should be an alternative method communication that doesn't require network but uses some Unix-y message passing of some sorts.
yes, databases are best kept away from your server. This way it's decoupled and you can scale each other independently. A server should do exactly what it's name suggests: serve your clients. Whatever requests your client has, it should note it down and send it to the "kitchen" where "food" is prepped, cooked and sent to the clients. Sometimes "food" is also cached :).
Coming to the network problem: it's not always a full TCP connection establishment, there is a connection pool from your server to the database which is almost always active and your server can just use these existing connections to quickly send over the raw query. Yes, there is still a little overhead but it's still better than keeping your database on the server itself when we talk about scale.
Finally, we don't send "parsed" queries over the network because then you are parsing it on your Server. As explained by Hussein in this video, the parsing is dynamic, i.e, the database looks at a query and comes up with the most efficient plan based on statistics which it stores in the database itself. So you're loosing all the good database optimizations if you just send the parsed query directly to the database. Also, if tomorrow the database changes it's parser (maybe an upgrade), you will have to upgrade all your server instances to have the same version of the parser. It's a headache nobody wants. Let the database do it's job of caching, looking at your query to plan and executing your query.
To beat the server query planner, you'll need actual knowledge of the data. AFAIK, only SQLite allows you to do that by embedding the entire logic into its driver; making it a classic serverless RDBMS.
note: Of course there is cloud serverless db like cosmos or aurora, but those are a bit more obscure in implementation due to several abstraction layer. SQLite use simplified model where db interface is treated as part of the application itself.
Regarding your last question: Yes. It is advisable to separate db with backend due to predictability factor. Roundtrip latency is generally an acceptable cost compared to resource exhaustion during peak load (since on single-machine scenario, every additional load will be multiplied by at least two: backend processing and db access). And latency is usually much more predictable compared to load.
Having said that, for localhost you could use UNIX socket- or named pipes on Windows- to avoid TCP "overhead". However, on obscure edge cases where you need do bug hunting, TCP-based connection generally easier to debug. And using TCP from the get go means you're not locked to single scenario, or worse, made wrong assumptions due to difference in system behavior.
on point 2, i had figured out this one a few months ago. I was totally shocked.
The composite index thing is really interesting
and how does indexes works when we make a join by a PK it will search the results by Index and then search in the other table? great video btw
I had a question about composite indexes in nosql/mongo
How does mongo know what's left and what's right ?
And let's say I have a index for userid property but _id (inbuilt) is also indexed right ?
So will this affect performance?
I'm not sure of this but this might be the case (it can be wrong) -
When u only do
select * from employees where condition;
This * can cause the db to do heap scan instead of index scan because it anyways need to fetch every column from heap, so to skip two hops it goes to table scan in the first place.
Please correct me if it's wrong.
Doesn't sound logical to me. If you have million rows and you do "SELECT * WHERE id = 42069" I don't see way it will scan million rows instead of using the index then getting this 1 row from the heap. From what I know select * is only bad for network traffic and CPU/RAM/IO for fetching/formatting the output. Except probably if you select *only* from the indexes then you don't need the heap at all and skip this extra step?
The index doesn't store all columns values so it doesn't really matter what columns you're looking for, instead what matters in this case is the where clause column, the DB will scan this index based on the where column, get the primary key, then it has access to all columns related to this primary key, so specifying one column or * won't make much of a difference if you have an index that will be used to filter your rows.
Someone can correct me if I'm wrong :)
Yes I guess, it'll still use the index scan 99% of the time, it's just bad for the I/O.
selecting only happens after finding the rows.
Regarding the first name case - can we not just create an index on a normalized version of the first name table?
Similar situation!
Saving in watch later
can you please make a video on HLS ? and more importantly how you explore/read untouched technology ?
I think this will help ruclips.net/video/1-KmLc0c2sk/видео.html
hi hussein
love from egypt man ❤️
Enjoy your vacation bro, we will all be here when you return.
Regards from Costa Rica.
⭐ Thanks Man, You are #1 ⭐
good video.. even better to display with 1.5 speed
When the index is implied but not mentioned.. Like object type(having index) is car and we are searching for BMW but we never mentioned object type to be car in the query.
Trying to learn kotlin by just reading the documentation. Ugggggh. I like JavaScript but need to leave my comfort zone.
Hope your family is doing well. Thanks for a great channel !
Another one is when you do a LIKE query with percent sign on the left of your where clause. E.g, select sth from xx where xxx LIKE %q%
Since % means all possibilities, the database has no way to apply any index scan.
also, iLike with/without the leading % will cause a full table scan.
Yo, big fan of your outage content.
Could we get some content on the recent roblox haloween outage?
Have you heard anything about that?
Edit: Didn't see the part about vacation, hopefully we get some good outage content when hes back
Thank you sir for the video :)
How you make it ?!!
really when i face a problem , i just find that you talking about in newly videos!! 😂😂
Thank you Hussein
I read minds 😍
please give sql code or benchmark result when it useless
Hey! Let us know how life is back home ! That sounds fantastic!
commenting for the sake of RUclips algorithm, to help him monetize from the video.
500th like ❤️
Welcome back to bahrain in Advance
Don't we say indices instead of indexes? 😁