Based on Problem Statement, My Answer would be firstly by using Apache Kafka or Amazon Kinesis handle the streaming data and dump into Aws S3, Since Aws S3 is acts like a Data Lake. After that by using Apache spark do some essential data processing and then ingest data into Aws Redshift or any other Datawarehouse by using Aws Glue as ETL.
1.common current working task? 2. What type of problem have you face current task? 3.loading data into data lake ,which changes you face? 4.how you handle increamental data?by batch or stream?how much size of daily process data? 5.snarion:sales data group product category per hour. need result of half historical+ half real time data in report? 6.which tools possible to use for above sanirio? kafka,event hub 7.how to tranfermation in kafka? ****Hive**** 8.hive external and internal table keys different? give use case 9.when use static/dynamic partation in hive table? 10.daily transcational table with year,date colum we can use any one of them,its static/dynamic partation? soultion: we partation on date colum which dynamic.each day day data place in daily partation. ***Spark*** 11.why,which language you use in spark ?desc its benifits 12.you use df and data set? any error on runtime in df/dataset ?give example 13.spark end coders? 14.1 TB data process by spark ,how distrbute memory of core,driver,executor 15.scala case class and regular calss difference? ***DB*** 16.have you work any non-relational db? 17.a given tabel with three colum need to show one row of data use of grouping CREATE DATABASE big_data; USE big_data; CREATE TABLE user_info (user_name NVARCHAR(255),user_age INT,user_loc NVARCHAR(255)) INSERT INTO user_info (user_name,user_age,user_loc)VALUES('ansar',30,'bang'),('ansar',30,'fsk'); SELECT * FROM user_info; SELECT DISTINCT user_name,user_age FROM user_info; SELECT DISTINCT user_name,user_age,user_loc FROM user_info GROUP BY user_name,user_age;
Some scenarios where dropping a schema but not the data are 1)Reorganizing the database structure 2)Cleaning up unused schemas 3) Rebuilding the schema from scratch.
In this scenario-based question can we create an end-to-end pipeline using the Kafka and power BI dashboard like..we can connect with your database as a source connector and for the transformation we can use KSQL DB where we perform some business-level transformation and after that store it into the Kafka-topic and then connect with the power BI for dashboard? @dataSavvy or someone, can u check Am I right thinking?
Good Initiative Data Savvy, this will really help full for those who is preparing for interviews. One suggestions, if you will create a followup video where you can explain what good or wrong answer the candidate has given or what is the correct answer the candidate should give in order to get more acceptance from interviewer. a kind of analysis of this interview.
Great vedio. But the candidate seems to be have ETL informatica developer experience not data engineer experience. He was not able to ans major questions of Spark. 😀. But good initiative data savvy, helps me to test my knowledge on Spark, big data and I m able to ans many questions.
Major use case for dropping schema or. Creating external table when you have storage area outside your Hadoop e.g. client want data to be stored in S3 or data stored in mongodb.
@@manojkumar-oc1sp you mean ,you are keeping the external table schema but deleting the folder and file from hdfs?. in that case you won't be able to access the data as in hdfs it looks like when you create a database or table but under the hood it's always a file or folder.
In the last question, to combine emp data on name, age and select random location, if we use groupby & collect list, won't it create the list of all loc for the group of emp name and age ? Shouldnt the use of other function like max, first etc will help in this scenario?
Hi Harjit, you are doing a great job for the community. Is there a way i can connect with you on Linkedin or via email? Also, do you plan to conduct similar interviews about Spark Streaming/Kafka?
@@DataSavvy I would love to give mock interviews to all of your data engineering questions in case you looking out for candidates with 12+ years of experience in PySpark, AWS, Spark SQL, Jenkins CI/CD, Glue, Kafka, Python, Hive, Athena, Presto, Bash, Airflow, Nifi.
Can we solve the sales problem using classification. i.e - we can train our historical data by logistics regression and then predict the value of sales using evaluate function on new data.
Sir in red t-shirt is making the interview questions very uninteresting even though spark itself is very much interning in terms of its concept and its working principle,, but sorry you are ruining the interest of learning
I have seen so many videos but your is best on all topics. Precise and cover all the interview questions almost.
Thanks Priyanka... I am happy that you like it
Based on Problem Statement, My Answer would be firstly by using Apache Kafka or Amazon Kinesis handle the streaming data and dump into Aws S3, Since Aws S3 is acts like a Data Lake. After that by using Apache spark do some essential data processing and then ingest data into Aws Redshift or any other Datawarehouse by using Aws Glue as ETL.
1.common current working task?
2. What type of problem have you face current task?
3.loading data into data lake ,which changes you face?
4.how you handle increamental data?by batch or stream?how much size of daily process data?
5.snarion:sales data group product category per hour. need result of half historical+ half real time data in report?
6.which tools possible to use for above sanirio? kafka,event hub
7.how to tranfermation in kafka?
****Hive****
8.hive external and internal table keys different? give use case
9.when use static/dynamic partation in hive table?
10.daily transcational table with year,date colum we can use any one of them,its static/dynamic partation?
soultion: we partation on date colum which dynamic.each day day data place in daily partation.
***Spark***
11.why,which language you use in spark ?desc its benifits
12.you use df and data set? any error on runtime in df/dataset ?give example
13.spark end coders?
14.1 TB data process by spark ,how distrbute memory of core,driver,executor
15.scala case class and regular calss difference?
***DB***
16.have you work any non-relational db?
17.a given tabel with three colum need to show one row of data use of grouping
CREATE DATABASE big_data;
USE big_data;
CREATE TABLE user_info
(user_name NVARCHAR(255),user_age INT,user_loc NVARCHAR(255))
INSERT INTO user_info (user_name,user_age,user_loc)VALUES('ansar',30,'bang'),('ansar',30,'fsk');
SELECT * FROM user_info;
SELECT DISTINCT user_name,user_age FROM user_info;
SELECT DISTINCT user_name,user_age,user_loc FROM user_info
GROUP BY user_name,user_age;
Some scenarios where dropping a schema but not the data are 1)Reorganizing the database structure 2)Cleaning up unused schemas 3) Rebuilding the schema from scratch.
Can you please provide aws questions and for data engineer, it will be helpful for us thanks 🙏
Very informative. Thanks for the video. 🙏
These type of videos are extremely helpful. If you could prepare a video about scala interview questions that would be of great help!!
Sure Nikhil... That is already in plan.. it is just difficult to find volunteers for Mock Interview
Can anyone please help me with sample resumes for scala, it's very hard for me to find s ala resumes in internet
In this scenario-based question can we create an end-to-end pipeline using the Kafka and power BI dashboard like..we can connect with your database as a source connector and for the transformation we can use KSQL DB where we perform some business-level transformation and after that store it into the Kafka-topic and then connect with the power BI for dashboard?
@dataSavvy or someone, can u check Am I right thinking?
Good Initiative Data Savvy, this will really help full for those who is preparing for interviews. One suggestions, if you will create a followup video where you can explain what good or wrong answer the candidate has given or what is the correct answer the candidate should give in order to get more acceptance from interviewer. a kind of analysis of this interview.
Thanks Atanu... I will plan for that... Your suggestion is very valuable
Exactly
Point noted... If anyone of you can volunteer, it will help me create these kind of videos...
@@DataSavvy hello sir,
I am interested for volunteering. But I am fresher (2020 pass-out). If it is ok then I can volunteer.
@@DataSavvy I am interested to volunteer..
Great vedio. But the candidate seems to be have ETL informatica developer experience not data engineer experience. He was not able to ans major questions of Spark. 😀. But good initiative data savvy, helps me to test my knowledge on Spark, big data and I m able to ans many questions.
Thank you so much for sharing this kind of videos , really I understand that how interview happen 🙏
I wish I got senior like u can learn more knowledge ❤️nice question ⁉️ .....sir u should make the anwers also ....mostly problematic answer ❤️❤️❤️❤️
I found this video to be really helpful sir....Please create more such videos🙏
Really helpful. thanks a lot for the video.
Hi sir..
Please do more videos like this..
Sir can you make of kafkha community
Thanks a lot team, specially to Harjeet this video boosted my confidence towards interviews, please kindly post more interview videos
Thanks Sai... Yes I plan to create more videos
Harjit let me know if you are taking Ang training session please
What will be the usecase for dropping schema instead of truncating complete tbl? For only restore data in future or any other major reason?
Major use case for dropping schema or. Creating external table when you have storage area outside your Hadoop e.g. client want data to be stored in S3 or data stored in mongodb.
@@RakeshGupta23 Thanks bro.. One more question.. what will happen if we delete the external table file folder.
This is usually done when more than one team is consuming same data and also using different tech to consume it
@@manojkumar-oc1sp you mean ,you are keeping the external table schema but deleting the folder and file from hdfs?. in that case you won't be able to access the data as in hdfs it looks like when you create a database or table but under the hood it's always a file or folder.
@@RakeshGupta23 Thanks Rakesh for your quick response
Hi Sir,
How much it's important to know snowflake for big data engineer?
What is the function for grouping distinct and select random value from other column?
You can use first()
@@ashutoshsamanta4244 Ah thanks man. I was trying to put a max filter and whatnot
In the last question, to combine emp data on name, age and select random location, if we use groupby & collect list, won't it create the list of all loc for the group of emp name and age ? Shouldnt the use of other function like max, first etc will help in this scenario?
what is sql question he is asking, I could not understand completely.
LAMBDA ARCHITECTURE
Hi Harjit, you are doing a great job for the community. Is there a way i can connect with you on Linkedin or via email?
Also, do you plan to conduct similar interviews about Spark Streaming/Kafka?
Hi Harjeet... I am Priyanka and I would like to volunteer for mock interview on BigData
Great Thanks.... Can you please create a video with answers for these questions ...It really helps... Or add your comments at the end of the video
That's a good suggestion... Let me look into this
Can u call some data analysts for mock interview too plz??
I am finding it difficult to get volunteers... Let me explore that
@@DataSavvy I would love to give mock interviews to all of your data engineering questions in case you looking out for candidates with 12+ years of experience in PySpark, AWS, Spark SQL, Jenkins CI/CD, Glue, Kafka, Python, Hive, Athena, Presto, Bash, Airflow, Nifi.
Can we solve the sales problem using classification. i.e - we can train our historical data by logistics regression and then predict the value of sales using evaluate function on new data.
Hi DataSavvy, Please let me know your free time will discuss about the mock interview to me.
Hi @Data Savvy, can you plan for a senior level interview, may be people with more than 16-20 yrs of experience?
@30.02 what would be the solution? Having count(*)>1 after group by?
Use first() on the column
Sir in red t-shirt is making the interview questions very uninteresting even though spark itself is very much interning in terms of its concept and its working principle,, but sorry you are ruining the interest of learning