How

Arpit Bhayani

Просмотров 12 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 1 фев 2025

Комментарии • 36

@DHARSHANSASTRYBN 2 года назад ⁺⁹
Great video Arpit!!
One more step is missing after cutover, before we prune old records from source database, either we need to stop Kafka consumer listening to delete CDC events or make an entry in consumer to filter/ignore these records otherwise we may end up applying these delete on target database as well :)
@AsliEngineering 2 года назад ⁺²
Hahhahaha 😂😂😂 true. We should definitely switchoff the CDC once everything caught up 😀😀
It once happened with me and I was thinking where did the data go???? 😂😂😂😂😂
@DHARSHANSASTRYBN 2 года назад
@@AsliEngineering 😅
@mohammedabbas6404 2 года назад ⁺¹
Very well explained👏🏼. Thank you Arpit.
@akshay_madeit 2 года назад ⁺¹
Great notes and content. Awesome bro. Keep rocking
@adianimesh 2 года назад ⁺⁵
Another approach could be that instead of moving the shop which was making the shard hot we can move the other shops' rows to rebalance the load !
@siteshp 3 месяца назад
Yeah, this can be a good option only if there are fewer number of cold shards to be moved to a different pod. Otherwise, it adds a huge operational complexity.
@PriyankaYadav-tb3pw Год назад ⁺¹
Great video Aprit. Thanks 🙏🏻
@sagarmehta3515 Год назад
Very well explained
@koteshwarraomaripudi1080 2 года назад
I absolutely love the content of this channel. Thank you, Arpit.
Do we have any channel out there that posts similar content talking about engineering challenges and how they are solved? If we have any please comment them. Thanks in advance.
@sagarnikam8001 2 года назад ⁺¹
Hi arpit, great video and really appreciate your efforts...
have a few doubts it will be great if you or anyone here can clearify those.
1. If we move a particular shop(which was making the shard hot) from one shard to another, how will it solve the issue? I mean wouldn't it just make our other shard(where we copied data) hot?
2. If we batch copy data from hot shard wouldn't that increase the load on the shard and might bring it down? I don't exactly know how much load the batch copying will bring in...
3. What would be ideal threshold, of load on shard(e.g. 80%, 70% resource utilisation), after which we should think about moving data? or it is when we see some large difference between resource utilisation among various shard present
@rohan_devarc 2 года назад ⁺¹
A hot partition problem is a subjective problem and is often related to Database Internals.
- It involves dealing with any kind of mandatory operation that couldn't wait any longer say "backup, compaction/VACCUM, Write Repair, Read Repair ", and these operations often add additional IO networking to the already occupied CPU core. And following that DB goes into an imbalanced state.
- Shiting of shards involves the redistribution of the data volume as well as offloading of mentioned operation to other cores/NUMA sockets.
- So rectifying the hot partition problem is more about rectifying the imbalance.
- Don't rely on a database unless its shard balance mechanism is identical to the application need
From the perspective of DB, only replicas can suffer from hot partition problems. However, this is not true for many applications and hence Shopify has an application-space driver, and "Batch copying" is a lot more economical especially when you have more control over the driver that triggers the same batch operation. This is in contrast with DB tooling which can only offer some amount of abstract control.
"There is no threshold until you start having one" i.e. application-space driver for shard migration would work on application-specific load pattern.
@LeoLeo-nx5gi 2 года назад
Firstly thanks Arpit bhaiya for explaining in such ease!!
Now even I wanted to ask about the second point which Sagar Nikam mentioned - won't it put more load on our initial shard when we copy data from it to another one?
@dharins1636 2 года назад ⁺¹
1. the non-hot shop are being moved to reduce the load on the pod x db which has higher load due to shop m
2. This is via bin-log so it is mysql all transactions are logged in a "file" which can be used to replicate the data to different databases or messaging system (see. debezium)
3. That really depends on the use cases and the impact on the system due to the load, which is very much to 'trying' out different configurations (and as mentioned in the video, its driven by analytics team)
@LeoLeo-nx5gi 2 года назад
@@dharins1636 ohh thanks, didn't knew about Debezium
@SubhamDas-dw3kk Год назад
Wondering what would happen during the Black Friday sale, multiple stores (or shops) will be hot. Additionally, a very small downtime (even half a minute) would be of great business loss
@siteshp 3 месяца назад
I would assume, they pre-request over-provisioning of the database to avoid the noisy-neighbour problems.
@swagatochatterjee7104 2 года назад ⁺²
What if the lag never dies? That is the there are always new writes to the old DB.
@AsliEngineering 2 года назад
Lag has to ear off it is a function of CPU utilisation.
There is a possibility that lag is increasing because.of an underlying hardware failure and in that case you create a new replica with a recent snapshot requiring you a minimal time to catch up.
@swagatochatterjee7104 2 года назад
@@AsliEngineering I understand the data pump would transfer the data from the new db to the old one. However, the NGINX is still sending the requests for shop 2 to the old db. So there seems to be no guarantee that there wouldn't be any writes to the old db. Or is there something which I am completely missing?
@AsliEngineering 2 года назад ⁺²
@Swagato Chatterjee writes are cutoff from old db for sometime through a knob. Covered in video.
You take a small downtime (not really a downtime but a glitch with retries) and stop accepting writes for that shop altogether so that other DB could catchup.
@swagatochatterjee7104 2 года назад
@@AsliEngineering thanks it makes sense now.
@Stockfundamentalstelugu 2 года назад
Hello @arpit, Thanks for the great video won't they have a replication/ data warehouse to store the data incase of some failure if yes cant we do bulk load from the place instead of reading from the current DB?
@rishabhgoel1877 Год назад
good stuff but shouldn't this be in phased manner like first stopping read from DB1(write still going on) and then stop writes or having some sort of phased treatment/dialup?
@siteshp 3 месяца назад
Yes, you're right. It should be something along similar lines. Howevern if we are assuming read to be routed later too, there would a few seconds of read misses that should be taken care by application retrying.
@ksansudeen 2 года назад
Hello @arpit, I see you are talking only about the data but how about indexes ? so shopify doesn't use indexes on those table ? i don't think that could be a case.
@AsliEngineering 2 года назад ⁺²
Indexes are implicitly managed by the database.
@siteshp 3 месяца назад
@@AsliEngineering So, you are saying, the database takes care of creating required indexes while we batch copy in to the new database. Right ?
@rjarora 2 года назад
I've one question. Won't they already have replicas? Can't they copy from replicas, why put load on master?
@siteshp 3 месяца назад
Replica will have some implicit replication lag (assuming the replication is done in an async way for heavy traffic production use-case)
In that case, it will be pretty hard to say when the older DB is similar to the new DB to start the cutover. Hence, doing it from the primary makes it much more determinstic to enable that cut over.
@tesla1772 2 года назад
What if some row gets updated after we copied that row
@AbhisarMohapatra 2 года назад ⁺⁵
The big log would take care of that
@dharins1636 2 года назад
bin-log is serial hence it tracks all the changes with respect to the changes done in time :)
@tesla1772 2 года назад
@@dharins1636 okay so we have to replicate that entire query on new db. Is that right?
@shantanutripathi Год назад
@@tesla1772 Yes, we do copy that binlog up until the current change to new replica + new request would also be queued up AFTER binlog to be applied on new replica. Because if we send new database requests(insert, delete) before applying binlog, data integrity would be lost.

Следующие

Автовоспроизведение

How Grab configured their data layer to handle multi-million database transactions a day!