The same architecture is used in the company I work for too, and I was always wondering why was the design chosen. This was a really great video which started from basics and went on to the ideal state. Thanks!
Thanks Arpit, this helps in drawing parallels to other systems as well. And its so nice to see the fundamentals are quite the same in handling large scale infra
Similar to what we built at Oracle...Oracle Knowledge AI search is similar kind of architecture.we have also introduced vector searches in elastic search
Thanks Arpit! This was a great video! I had a question. In the backfill process, how does the orchestrator know how many workers to spawn? How do you monitor and calculate the amount of data yet to be processed in HDFS? If Kafka was used instead of HDFS, I know there's a way to calculate the consumer lag, which can be used to trigger orchestrator's rules.
But I guess if the read operation is an I/O intensive one , like fetching a yearly orders report from ES it shouldn't be a synchronous operation , rather it should follow the write flow described by you i.e send the report meta details as an event to kafka topics and later workers can mail them the reports asynchronously.
Hi Arpit, How is it making a diffience when we ingest using Reducers or using the workers way you have explained ? We can control number of reducers + ingestion rate both in MR ob as well. Pls let me know what I am missing here ?
Great video dude. I wanted to ask you about your recording setup. Are you using obs & screen mirroring your iPad or something? Please mention any hardware/software you need for these videos
@@AsliEngineering Thanks for the reply! I was not hoping I would get a reply here. When backfill is not required Twitter is putting it in elastic search directly and for backfill they are putting it in HDFS. I think the reason would be the constraint of memory in Kafka or SQS. S3 or HDFS do not have that.
Hey Arpit, I am confused, initially you said every team had their own cluster, is the proxy a common service for all the clients of different cluster or each cluster will have its own proxy service?
Any HighLevel folks watching this, It would be very similar to our eventing (& mongo-indexing) service, and the backfill is basically our snapshot service.
In database systems we can segregate write and read across different DB and eventually make read node consistent with write node data. I dont much about ES. But was it not an option in ES.
Hi arpit, Thanks for your videos, sorry if my question is stupid, I have seen this video and your bookmyshow video also, in both always scaling happens during write opertion only, what about huge no of traffic reads a particular API how api stability is ensured, kindly revert please...
Thanks Arpit for making this video! I had some follow up, curious question - what happens when there is too much data on kafka during backpressure while indexing ? - can map reduce create an elastic-search understandable file, which can be be used for bulk insertion ? Since in current architecture worker will be again making 1:1 calls.
Since write is happening in async that particular tweet wouldn't reflect in his tweets immediately right?? so how will the user immediately sees his tweet??
@@GandesiriSanith search systems are never designed to be strongly consistent. But if you want strong consistency then your API will have to synchronously write to DB and to Search engine. a massive overkill tbh.
Also to add onto Arpit, I would assume the proxy still would have authority over rate of requests, and some kind of auth. In case of strange burst, we could avoid pushing a lot of unwanted data to Kafka.
No, the beauty of kafka is its log-append, it will add it and you just have to consume, then you can configure the topics to delete the "older" data based on the configuration (bytes or time or both). Ofcourse there are compacted topics but thats another type of "reducing" the data space (and it has its own problems :) )
The same architecture is used in the company I work for too, and I was always wondering why was the design chosen. This was a really great video which started from basics and went on to the ideal state. Thanks!
Thanks Arpit, this helps in drawing parallels to other systems as well. And its so nice to see the fundamentals are quite the same in handling large scale infra
such a great explanation, learned a lot from you arpit sir 😎, keep going 🔥
Similar to what we built at Oracle...Oracle Knowledge AI search is similar kind of architecture.we have also introduced vector searches in elastic search
Love these stories of great engineering. Request to please bring these more often. Thanks a lot 🙂
Thanks Arpit! This was a great video! I had a question.
In the backfill process, how does the orchestrator know how many workers to spawn? How do you monitor and calculate the amount of data yet to be processed in HDFS?
If Kafka was used instead of HDFS, I know there's a way to calculate the consumer lag, which can be used to trigger orchestrator's rules.
Thanks for the great explanation. I have a basic question. What is backfill & it's job here?
Is it about parsing each tweet and doing analysis?
Backfilling updates the index with the latest data crawled from various sources.
But I guess if the read operation is an I/O intensive one , like fetching a yearly orders report from ES it shouldn't be a synchronous operation , rather it should follow the write flow described by you i.e send the report meta details as an event to kafka topics and later workers can mail them the reports asynchronously.
Here also if report is big , how we fetch it can be discussed
But why use ES for analytical queries
better to directly run some spark job on s3/hdfs and refrain from using elasticsearch for such use cases.
Hi Arpit, How is it making a diffience when we ingest using Reducers or using the workers way you have explained ? We can control number of reducers + ingestion rate both in MR ob as well. Pls let me know what I am missing here ?
Thanks Arpit for such an informative content
Very helpful! Thanks a lot sir.
Great video dude.
I wanted to ask you about your recording setup. Are you using obs & screen mirroring your iPad or something? Please mention any hardware/software you need for these videos
Obs plus iPad. Nothing more.
@@AsliEngineering I see. So is it iPad that you screen mirror + obs on MacBook? And is the app Notability? Btw your handwriting is awesome!!
Why was HDFS used here? A simple queue(like SQS) or a Kafka if Twitter wanted to have a retry mechanism would have achieved the same.
Staging storage for subsequent consumption.
@@AsliEngineering Thanks for the reply! I was not hoping I would get a reply here.
When backfill is not required Twitter is putting it in elastic search directly and for backfill they are putting it in HDFS. I think the reason would be the constraint of memory in Kafka or SQS. S3 or HDFS do not have that.
Hey Arpit, I am confused, initially you said every team had their own cluster, is the proxy a common service for all the clients of different cluster or each cluster will have its own proxy service?
Hybrid setup is a possibility.
There may be services that have an isolated proxy where there may be a few who share .
This was really a crispy one
Any HighLevel folks watching this,
It would be very similar to our eventing (& mongo-indexing) service, and the backfill is basically our snapshot service.
In database systems we can segregate write and read across different DB and eventually make read node consistent with write node data.
I dont much about ES. But was it not an option in ES.
Hi arpit, Thanks for your videos, sorry if my question is stupid, I have seen this video and your bookmyshow video also, in both always scaling happens during write opertion only, what about huge no of traffic reads a particular API how api stability is ensured, kindly revert please...
Replicas and Caching.
@@AsliEngineering Thank you
Bahut acche
Thanks Arpit for making this video!
I had some follow up, curious question
- what happens when there is too much data on kafka during backpressure while indexing ?
- can map reduce create an elastic-search understandable file, which can be be used for bulk insertion ? Since in current architecture worker will be again making 1:1 calls.
Since write is happening in async that particular tweet wouldn't reflect in his tweets immediately right?? so how will the user immediately sees his tweet??
How likely is the user going to search his/her own tweet immediately after posting it?
@@AsliEngineering how to handle such a use case if there's any
@@GandesiriSanith search systems are never designed to be strongly consistent.
But if you want strong consistency then your API will have to synchronously write to DB and to Search engine. a massive overkill tbh.
@@AsliEngineering Yeah got it.
Hi sir I'm 1st year student should I buy your system design course
Not at all. Meant for more than 2 years of work experience.
Why dont api server directly write to kafka instead of proxy
Because it was a system rewrite and they did not want to change any upstream.
Also to add onto Arpit, I would assume the proxy still would have authority over rate of requests, and some kind of auth. In case of strange burst, we could avoid pushing a lot of unwanted data to Kafka.
what if kafka gets too many messages??? will it drop some messages>
Back pressure.
No, the beauty of kafka is its log-append, it will add it and you just have to consume, then you can configure the topics to delete the "older" data based on the configuration (bytes or time or both). Ofcourse there are compacted topics but thats another type of "reducing" the data space (and it has its own problems :) )
@@dharins1636 thankyou
Are worker nodes spark jobs which are streaming from Kafka and writing in elastic search at particular window or interval @arpit @asliengineering
Could be. Implementation can be anything. Raw consumers, or Spark jobs.