if with the latest version of pyspark, the null values are automatically removed from the count. 2 scenarios if you do count(columnName) the nulls are removed and counts are given, but if you do count(*) if will include the counts of null also. Anyways all videos are super helpful thanks!
Sagar sir.... my solution in spark sql 1) df_1 = spark.read.csv("dbfs:/FileStore/tables/Spark_Practise_1.csv", header=True) 2) df_1.createOrReplaceTempView("Sujoy_1") 3) %sql Select SUM(CASE WHEN ID like 'null' THEN 1 ELSE 0 END) as ID, SUM(CASE WHEN Name like 'null' THEN 1 ELSE 0 END) as Name, SUM(CASE WHEN Age like 'null' THEN 1 ELSE 0 END) as Age from Sujoy_1
from pyspark.sql.functions import count,when df = spark.read.option("nullValue","null").csv("dbfs:/FileStore/testing.csv", header=True) df.createOrReplaceTempView("temp") display(spark.sql("select count(*)-count(id) as nullcount_for_id, count(*)-count(name) as nullcount_for_name,count(*)-count(age) as nullcount_for_age from temp"))
if with the latest version of pyspark, the null values are automatically removed from the count. 2 scenarios if you do count(columnName) the nulls are removed and counts are given, but if you do count(*) if will include the counts of null also. Anyways all videos are super helpful thanks!
Sagar sir.... my solution in spark sql
1) df_1 = spark.read.csv("dbfs:/FileStore/tables/Spark_Practise_1.csv", header=True)
2) df_1.createOrReplaceTempView("Sujoy_1")
3) %sql Select SUM(CASE WHEN ID like 'null' THEN 1 ELSE 0 END) as ID,
SUM(CASE WHEN Name like 'null' THEN 1 ELSE 0 END) as Name,
SUM(CASE WHEN Age like 'null' THEN 1 ELSE 0 END) as Age
from Sujoy_1
df2=df1.columns
column_counts={}
for nums in df2:
df3=df1.filter(col(nums).isnull()).count()
column_counts[nums] = df3
print(column_counts)
nice
from pyspark.sql.functions import count,when
df = spark.read.option("nullValue","null").csv("dbfs:/FileStore/testing.csv", header=True)
df.createOrReplaceTempView("temp")
display(spark.sql("select count(*)-count(id) as nullcount_for_id, count(*)-count(name) as nullcount_for_name,count(*)-count(age) as nullcount_for_age from temp"))
df1=df.select([sum(col(c).isNull().cast('int')).alias('c') for c in df.columns])
df1.show()
data = [
(1, "A", 23),
(2, "B", None),
(3, "C", 56),
(4, None, None),
(5, None, None)
]
data_schema=['ID','Name','Age']
df=spark.createDataFrame(data,data_schema)
df1=df.select([(df.count()-count(i)).alias(i) for i in df.columns])
df1.show()
+---+----+---+
| ID|Name|Age|
+---+----+---+
| 0| 2| 3|
+---+----+---+
Thank you