5. kpmg pyspark interview question & answer | databricks scenario based interview question & answer

แชร์
ฝัง
  • เผยแพร่เมื่อ 5 ก.ย. 2024
  • #Databricks #PysparkInterviewQuestions #deltalake
    Azure Databricks #spark #pyspark #azuredatabricks #azure
    In this video, I discussed KPMG PySpark scenario based interview questions and answers.
    PySpark advanced interview questions answers?
    databricks interview questions and answers?
    kpmg pyspark interview questions and answers?
    Create dataframe:
    ======================================================
    #Employees Salary info
    data1=[(100,"Raj",None,1,"01-04-23",50000),
    (200,"Joanne",100,1,"01-04-23",4000),(200,"Joanne",100,1,"13-04-23",4500),(200,"Joanne",100,1,"14-04-23",4020)]
    schema1=["EmpId","EmpName","Mgrid","deptid","salarydt","salary"]
    df_salary=spark.createDataFrame(data1,schema1)
    display(df_salary)
    #department dataframe
    data2=[(1,"IT"),
    (2,"HR")]
    schema2=["deptid","deptname"]
    df_dept=spark.createDataFrame(data2,schema2)
    display(df_dept)
    -----------------------------------------------------------------------------------------------------------------------
    df=df_salary.withColumn('Newsaldt',to_date('salarydt','dd-MM-yy'))
    display(df)
    ---------------------------------------------------------------------------------------------------------------------
    from pyspark.sql.functions import col
    df1=df.join(df_dept,['deptid'])
    #display(df1)
    df2=df1.alias('a').join(df1.alias('b'),col('a.Mgrid')==col('b.EmpId'),'left').select(
    col('a.deptname'),
    col('b.EmpName').alias('ManagerName'),
    col('a.EmpName'),
    col('a.Newsaldt'),
    col('a.salary')
    )
    display(df2)
    -------------------------------------------------------------------------------------------------------------------
    from pyspark.sql.functions import year,month
    df3=df2.groupBy('deptname','ManagerName','EMpName',year('Newsaldt').alias('Year'),date_format('Newsaldt','MMMM').alias('Month')).sum('salary')
    display(df3)
    ============================================================
    Learn PySpark, an interface for Apache Spark in Python. PySpark is often used for large-scale data processing and machine learning.
    Azure data factory tutorial playlist:
    • Azure Data factory (adf)
    ADF interview question & answer:
    • adf interview question...
    1. pyspark introduction | pyspark tutorial for beginners | pyspark tutorial for data engineers:
    • 1. pyspark introductio...
    2. what is dataframe in pyspark | dataframe in azure databricks | pyspark tutorial for data engineer:
    • 2. what is dataframe i...
    3. How to read write csv file in PySpark | Databricks Tutorial | pyspark tutorial for data engineer:
    • 3. How to read write c...
    4. Different types of write modes in Dataframe using PySpark | pyspark tutorial for data engineers:
    • 4. Different types of ...
    5. read data from parquet file in pyspark | write data to parquet file in pyspark:
    • 5. read data from parq...
    6. datatypes in PySpark | pyspark data types | pyspark tutorial for beginners:
    • 6. datatypes in PySpar...
    7. how to define the schema in pyspark | structtype & structfield in pyspark | Pyspark tutorial:
    • 7. how to define the s...
    8. how to read CSV file using PySpark | How to read csv file with schema option in pyspark:
    • 8. how to read CSV fil...
    9. read json file in pyspark | read nested json file in pyspark | read multiline json file:
    • 9. read json file in p...
    10. add, modify, rename and drop columns in dataframe | withcolumn and withcolumnrename in pyspark:
    • 10. add, modify, renam...
    11. filter in pyspark | how to filter dataframe using like operator | like in pyspark:
    • 11. filter in pyspark ...
    12. startswith in pyspark | endswith in pyspark | contains in pyspark | pyspark tutorial:
    • 12. startswith in pysp...
    13. isin in pyspark and not isin in pyspark | in and not in in pyspark | pyspark tutorial:
    • 13. isin in pyspark an...
    14. select in PySpark | alias in pyspark | azure Databricks #spark #pyspark #azuredatabricks #azure
    • 14. select in PySpark ...
    15. when in pyspark | otherwise in pyspark | alias in pyspark | case statement in pyspark:
    • 15. when in pyspark | ...
    16. Null handling in pySpark DataFrame | isNull function in pyspark | isNotNull function in pyspark:
    • 16. Null handling in p...
    17. fill() & fillna() functions in PySpark | how to replace null values in pyspark | Azure Databrick:
    • 17. fill() & fillna() ...
    18. GroupBy function in PySpark | agg function in pyspark | aggregate function in pyspark:
    • 18. GroupBy function i...
    19. count function in pyspark | countDistinct function in pyspark | pyspark tutorial for beginners:
    • 19. count function in ...
    20. orderBy in pyspark | sort in pyspark | difference between orderby and sort in pyspark:
    • 20. orderBy in pyspark...
    21. distinct and dropduplicates in pyspark | how to remove duplicate in pyspark | pyspark tutorial:
    • 21. distinct and dropd...

ความคิดเห็น • 15

  • @roshnisingh7661
    @roshnisingh7661 7 หลายเดือนก่อน +1

    Thanks SS Unitech, your videos are very easy to understand 😊

  • @pranaykiran1780
    @pranaykiran1780 11 วันที่ผ่านมา

    The second argument in the to_date function takes the format of how the first argument is. Please correct it

  • @amritasingh1769
    @amritasingh1769 7 หลายเดือนก่อน +1

    Super video

  • @nikhilasrirama3829
    @nikhilasrirama3829 7 หลายเดือนก่อน +1

    Thanks susheel

    • @ssunitech6890
      @ssunitech6890  7 หลายเดือนก่อน

      Thanks.
      Keep learning and sharing

  • @barmalini
    @barmalini หลายเดือนก่อน

    Thank you sir, this is very useful.
    I did not quite understand the purpose of joining a dataframe with itself, this part: df1.alias('a').join(df1.alias('b')
    Could someone explain?

    • @rajakumaranr2998
      @rajakumaranr2998 หลายเดือนก่อน

      To find the MgrName.
      if you see the Note: MgrId is EmpId of employee,
      In that case Mgrname would be EmpId of Empname. for eg : MgrId is 100 mean then MgrName would be "raj"

  • @a2zhi976
    @a2zhi976 6 หลายเดือนก่อน +1

    how to control variable value in dev and prod different values, can you please explain

    • @ssunitech6890
      @ssunitech6890  6 หลายเดือนก่อน

      You can create one table in any environment or any file which keeps the values of dev, qa and prod, which accessing this table you can use lookup activities and store into variable in adf pipeline

  • @sivaani37
    @sivaani37 7 หลายเดือนก่อน +1

    What about the 7th highest salary ?

    • @ssunitech6890
      @ssunitech6890  7 หลายเดือนก่อน +1

      Already recorded one video on hiw to get top n salary on each department please this video
      th-cam.com/video/HnN-J8_u2Tc/w-d-xo.html

    • @sivaani37
      @sivaani37 7 หลายเดือนก่อน

      But the link you shared is based on a window. What if I have to find the seventh highest salary irrespective of the dept?

    • @rajakumaranr2998
      @rajakumaranr2998 หลายเดือนก่อน

      @@sivaani37 you can use row number without partitioning
      w = Window().orderBy(salary.asc()))
      df1 = df.withColumn("row_num", row_number().over(w))
      df1.filter(row_num==7)
      In that video , he used only 4 records, so it's not possible to show the 7th highest salary