Since more than one outlier can mess up the average value. Would suggest to use median value in outlier detection and removal. Using a z Score method or a percentile method would be appropriate.
my approach : with cte as( select *,avg(rating)over(partition by hotel )as avg from hotel_ratings), cte2 as (select *, abs(avg-rating) as abs from cte), cte3 as (select *, max(abs) over(partition by hotel) as exclude from cte2) select hotel,year,rating from cte3 where excludeabs order by hotel desc
Thank you for your instruction. However, I noticed some redundancy in the query. Here is my refined solution based on your guidance: WITH cte1 AS ( SELECT *, AVG(rating) OVER (PARTITION BY hotel) AS r_avg FROM hotel_ratings), cte2 AS ( SELECT *, ABS (rating - r_avg) AS diff FROM cte1 ) SELECT hotel, year, rating FROM cte2 WHERE diff
Sorry to say but your approach to this problem is naive and as a data analyst/scientist you need to make sure your solution is statistically accurate. I'll tell you a major flaw with your solution with an example: If you have multiple outliers in each hotel your solution will not identify those outliers since you are only identifying 1 extreme outlier with the maximum deviation from the mean but this does not mean that other outliers cannot exist. Right solution: Apart from mean also calculate the standard deviation for each hotel (every rdbms has a function for that else can be done manually too). In statistics there is something called the 3 Sigma rule (if an assumption can be made that the data distribution is approximately gaussian and this is also called the Z score technique) which needs to be applied here. Essentially, the outliers will be those records whose values are mean (mu) +/- 3* standard deviation (Sigma). This logic has to be applied to obtain the outliers. If your goal is to just apply SQL logic with window functions it is okay but since you are talking about a business case scenario it is important to be statistically accurate That being said thank you for your continuous efforts in helping us build strong SQL foundations.
@@gunasekhar8440you are absolutely right. Iqr rule is applied if the data is not gaussian/normal. Iqr method can also be applied here: we need to calculate the 25th (Q1) and 75th percentile (Q3) of the data (there are window functions for this which will be different in different rdbms i guess). IQR=75th percentile value - 25th percentile value (Q3-Q1). Then outliers will be either Q1-1.5*IQR or Q3+1.5*IQR. This logic can also be used in SQL to identify outliers. Hope this helps
@@gunasekhar8440 yes but in a general case iqr rule has to be applied. In these case there are only 4-6 records per hotel so the concept of percentile wouldn't be accurate for such low data points. But if the number of records were more, it would so you are right
with cte as(select *,rank() over(order by rating asc)as max_rnk,rank() over(order by rating desc)as min_rnk from hotel_ratings) select hotel,year,rating from cte where max_rnk 8 and max_rnk1 and min_rnk 1 order by hotel donot know whether it will be dynamic or not...
This is my alternate solution: select * from hotel_ratings where rating < (select avg(rating) from hotel_ratings where hotel='Radisson Blu') and hotel='Radisson Blu' union select * from hotel_ratings where rating > (select avg(rating) from hotel_ratings where hotel='InterContinental') and hotel='InterContinental' order by hotel desc
For a large enough data set , there could be multiple values above the mean value. So, the data point above the mean doesn't necessarily mean the highest dataset. The average height of a male is 5'8 (let's say) , doesn't necessarily signify that 5'9 is the tallest a man could ever be. It could be 6'5 , 7'3 or whatever.
with cte as ( select *,avg(rating)over(partition by hotel ) as avg_rating,round(STDDEV(rating)over(partition by hotel),4) as std from hotel_ratings), cte1 as (select *,abs((rating - avg_rating)/std) as out_detect from cte) select hotel,year,rating from cte1 where out_detect < 1.5 order by hotel desc,year
@techTFQ I also face the same problme of range. However when I tried to order by hotel then it gave me the result as expected why does orderng by year create so much of difference ??
with cte as (select hotel , year, rating, average_rating, average_rating_difference, max(average_rating_difference) over(partition by hotel order by year range between unbounded preceding and unbounded following) as max_difference from (select *, AVG(rating) over(partition by hotel order by year range between unbounded preceding and unbounded following) as average_rating, abs(rating - AVG(rating) over(partition by hotel order by year range between unbounded preceding and unbounded following)) as average_rating_difference from hotel_ratings)x ) select hotel, year, rating from cte where average_rating_difference max_difference
This might be applicable only for MS SQL, (I am not sure about PostgreSQL) a constant average can be obtained simply by removing the order by clause. select avg(rating) over(partition by hotel) from hotel_ratings This code gives the required output. Though, it helps that avg() does not require an order by clause mandatorily. This wouldn't work for fuctions that require order by, like rank(), etc.
@TFQ Generalized solution that will work for many outliers WITH cte AS ( SELECT hotel,year,rating, CAST(AVG(rating) OVER(PARTITION BY hotel ORDER BY hotel) AS DECIMAL(9,1)) AS avg_rating, STDEV(rating) over(partition by hotel order by hotel) as std_dev FROM hotel_ratings ), cte1 AS ( SELECT * ,ABS(rating - avg_rating) AS diff_rating , cast(avg_rating + std_dev as decimal(9,1)) as positive , cast(avg_rating - std_dev as decimal(9,1)) as negative from cte
),cte2 as ( Select *, CASE WHEN rating =negative THEN 1 else 0 end as flag from cte1 ) select hotel,year,rating from cte2 where flag 0 order by hotel desc, year asc
with cte as ( SELECT *, round(avg(rating) over(partition by hotel order by year range between unbounded preceding and unbounded following ),2) as avg_rating FROM hotel), cte1 as (select *, abs(avg_rating-rating) as diif from cte), cte2 as (select *, dense_rank() over(partition by hotel order by abs(avg_rating-rating) desc ) as rn from cte1) select hotel,year,rating from cte2 where rn>1
with cte as( select *, avg(rating) over(order by hotel) as rating2 from hotel_ratings ), cte2 as (select *, abs(rating-rating2) as did from cte), cte3 as (select *, rank() over(partition by hotel order by did desc) as a from cte2) select hotel,year,rating from cte3 where a>1 order by hotel desc ,year;
Oracle SQL | Solution that will work for many outliers. WITH cte AS ( SELECT hotel, year, rating, AVG(rating) OVER (PARTITION BY hotel) AS average, rating - AVG(rating) OVER (PARTITION BY hotel) AS difference FROM hotel_ratings ), result_table AS ( SELECT hotel, year, rating, difference, AVG(ABS(difference)) OVER (PARTITION BY hotel) AS avg_abs_diff, CASE WHEN ABS(difference) - AVG(ABS(difference)) OVER (PARTITION BY hotel) >= AVG(ABS(difference)) OVER (PARTITION BY hotel) THEN 'remove' ELSE 'keep' END AS flag FROM cte ) SELECT hotel, year, rating FROM result_table WHERE flag = 'keep' ORDER BY hotel DESC, year;
WITH AVG_RATING_CTE AS (SELECT *, ROUND(AVG(RATING)OVER(PARTITION BY HOTEL ORDER BY YEAR RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING),1) AS AVG_RATING FROM HOTEL_RATINGS) ,DIFF_CTE AS(SELECT HOTEL, YEAR, RATING, AVG_RATING ,ABS(AVG_RATING - RATING) AS DIFF FROM AVG_RATING_CTE) ,MAX_CTE AS( SELECT HOTEL, YEAR, RATING, DIFF ,MAX(DIFF)OVER(PARTITION BY HOTEL ORDER BY YEAR RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS MAX FROM DIFF_CTE ) SELECT HOTEL, YEAR, RATING FROM MAX_CTE WHERE DIFF MAX ORDER BY HOTEL,YEAR
SQL server - WITH CTE AS (SELECT *, ABS(rating - AVG(rating) OVER (PARTITION BY hotel)) as average_difference FROM hotel_ratings), CTE_FINAL AS (SELECT *,CASE WHEN average_difference = MAX(average_difference) OVER (PARTITION BY hotel) THEN 1 ELSE 0 END AS flag FROM CTE) SELECT hotel,year,rating FROM CTE_FINAL WHERE flag=0
with cte as( select *,avg(rating) over(partition by hotel ) avg_rating,abs(rating - avg(rating) over(partition by hotel )) diff_rating from hotel_ratings), cte2 as (SELECT *,max(diff_rating) over(partition by hotel) max_diff from cte ) select hotel,year,rating from cte2 where diff_ratingmax_diff order by hotel desc, year ;
with cte as( select hotel,avg(rating) as rat from hotel_ratings group by hotel) ,cte2 as(select h.hotel,h.year,rating,rank() over (partition by hotel order by abs(rating -rat) desc) as r from cte c inner join hotel_ratings h on c.hotel=h.hotel) select * from cte2 where r>1;
select hotel,year,rating from hotel_ratings where hotel = 'Radisson Blu' and year in('2021','2022','2023') union select hotel,year,rating from hotel_ratings where hotel = 'InterContinental' and year in('2020','2021','2023') order by hotel desc
select hotel,year, rating, delta, max_delta from ( select hotel,year, rating, delta, max(delta) over (partition by hotel ) max_delta from ( select hotel,year, rating, abs(rating-av) as delta from ( select hotel, year, rating, round(avg(rating) over (partition by hotel),2) av from hotel_ratings))) where delta max_delta order by hotel desc, year ;
WITH cte AS ( SELECT hr.*, ( SELECT ROUND(AVG(rating), 2) FROM hotel_ratings hr2 WHERE hr2.Hotel = hr.Hotel ) AS avrg FROM hotel_ratings hr ), cte1 AS ( SELECT Hotel, Year, rating, RANK() OVER ( PARTITION BY Hotel ORDER BY ABS(rating - avrg), Year ) AS rn FROM cte ) SELECT Hotel, Year, rating FROM cte1 WHERE rn < ( SELECT MAX(rn) FROM cte1 ) ORDER BY Hotel DESC, Year;
My solution: Correct me if am wrong WITH cte AS ( SELECT hotel,year,rating, CAST(AVG(rating) OVER(PARTITION BY hotel ORDER BY hotel) AS DECIMAL(9,1)) AS avg_rating FROM hotel_ratings ), cte1 AS ( SELECT * ,ABS(rating - avg_rating) AS flag1 from cte
),cte2 as ( select *,max(flag1) over(partition by hotel order by hotel) as maxFlag from cte1 ) (select Hotel,year,rating from cte2 where flag1 maxFlag) ORDER BY hotel DESC
my solution : WITH rating_differences AS ( SELECT hotel,year,rating, ABS(rating - (SELECT ROUND(AVG(rating), 2) FROM hotel_ratings hr2 WHERE hr2.hotel = hr1.hotel)) AS rating_difference FROM hotel_ratings hr1), max_deviation AS ( SELECT hotel, MAX(rating_difference) AS max_rating_difference FROM rating_differences GROUP BY hotel) select fl.hotel, fl.year, fl.rating from(SELECT rd.hotel, rd.year, rd.rating, rd.rating_difference, CASE WHEN rd.rating_difference = md.max_rating_difference AND rd.hotel = md.hotel THEN 1 ELSE 0 END AS flag FROM rating_differences rd JOIN max_deviation md ON rd.hotel = md.hotel) fl where fl.flag != 1 thank you
Not a full proof solution as one partition might have more than one distinct outliers..Not sure how rank function will help there
Since more than one outlier can mess up the average value. Would suggest to use median value in outlier detection and removal. Using a z Score method or a percentile method would be appropriate.
Could you share the answer, please?
Thank you so much learnt new things ❤
my approach :
with cte as( select *,avg(rating)over(partition by hotel )as avg
from hotel_ratings),
cte2 as (select *, abs(avg-rating) as abs from cte),
cte3 as (select *, max(abs) over(partition by hotel) as exclude from cte2)
select hotel,year,rating from cte3
where excludeabs
order by hotel desc
Thank you for your instruction. However, I noticed some redundancy in the query. Here is my refined solution based on your guidance:
WITH cte1 AS (
SELECT *, AVG(rating) OVER (PARTITION BY hotel) AS r_avg
FROM hotel_ratings),
cte2 AS (
SELECT *, ABS (rating - r_avg) AS diff
FROM cte1
)
SELECT hotel, year, rating
FROM cte2
WHERE diff
Sorry to say but your approach to this problem is naive and as a data analyst/scientist you need to make sure your solution is statistically accurate. I'll tell you a major flaw with your solution with an example:
If you have multiple outliers in each hotel your solution will not identify those outliers since you are only identifying 1 extreme outlier with the maximum deviation from the mean but this does not mean that other outliers cannot exist.
Right solution:
Apart from mean also calculate the standard deviation for each hotel (every rdbms has a function for that else can be done manually too). In statistics there is something called the 3 Sigma rule (if an assumption can be made that the data distribution is approximately gaussian and this is also called the Z score technique) which needs to be applied here. Essentially, the outliers will be those records whose values are mean (mu) +/- 3* standard deviation (Sigma). This logic has to be applied to obtain the outliers.
If your goal is to just apply SQL logic with window functions it is okay but since you are talking about a business case scenario it is important to be statistically accurate
That being said thank you for your continuous efforts in helping us build strong SQL foundations.
Hi Tushar ji, simply we have two types of famous methods in stats.
1. Z-score
2. IQR
Correct me if I'm wrong
@@gunasekhar8440you are absolutely right. Iqr rule is applied if the data is not gaussian/normal. Iqr method can also be applied here: we need to calculate the 25th (Q1) and 75th percentile (Q3) of the data (there are window functions for this which will be different in different rdbms i guess). IQR=75th percentile value - 25th percentile value (Q3-Q1). Then outliers will be either Q1-1.5*IQR or Q3+1.5*IQR.
This logic can also be used in SQL to identify outliers. Hope this helps
Yes I also felt the same. Approach is not correct, ranking will not suit for resolving the outliers issue. We need use IQR to find outliers I think
@@rakeshkumarreddymudda but we have only less number of records. Then if we do that procedure it doesn't make any sense
@@gunasekhar8440 yes but in a general case iqr rule has to be applied. In these case there are only 4-6 records per hotel so the concept of percentile wouldn't be accurate for such low data points. But if the number of records were more, it would so you are right
with cte as(select *,rank() over(order by rating asc)as max_rnk,rank() over(order by rating desc)as min_rnk from hotel_ratings)
select hotel,year,rating from cte where max_rnk 8 and max_rnk1 and min_rnk 1 order by hotel
donot know whether it will be dynamic or not...
This is my alternate solution:
select * from hotel_ratings where rating <
(select avg(rating) from hotel_ratings where hotel='Radisson Blu')
and hotel='Radisson Blu'
union
select * from hotel_ratings where rating >
(select avg(rating) from hotel_ratings where hotel='InterContinental')
and hotel='InterContinental' order by hotel desc
For a large enough data set , there could be multiple values above the mean value. So, the data point above the mean doesn't necessarily mean the highest dataset. The average height of a male is 5'8 (let's say) , doesn't necessarily signify that 5'9 is the tallest a man could ever be. It could be 6'5 , 7'3 or whatever.
Yes that is the issue with this solution
Thank You Sir!
Thank you so much
thank you,
with cte as
( select *,avg(rating)over(partition by hotel ) as avg_rating,round(STDDEV(rating)over(partition by hotel),4) as std
from hotel_ratings),
cte1 as
(select *,abs((rating - avg_rating)/std) as out_detect
from cte)
select hotel,year,rating from cte1 where out_detect < 1.5
order by hotel desc,year
Perfect!
why 1.5?
@@TheRaju991 it's just bcz of this problem , else it would have been 3
@@nileshpandey8247 Makes sense. Thanks!
@techTFQ I also face the same problme of range. However when I tried to order by hotel then it gave me the result as expected why does orderng by year create so much of difference ??
sir, please put a video on different star pattern forming with mysql .
with cte as
(select hotel , year, rating, average_rating, average_rating_difference, max(average_rating_difference)
over(partition by hotel order by year range between unbounded preceding and unbounded following) as max_difference
from (select *,
AVG(rating) over(partition by hotel order by year range between unbounded preceding and unbounded following) as average_rating,
abs(rating - AVG(rating) over(partition by hotel order by year range between unbounded preceding and unbounded following)) as average_rating_difference
from hotel_ratings)x )
select hotel, year, rating
from cte
where average_rating_difference max_difference
This might be applicable only for MS SQL, (I am not sure about PostgreSQL) a constant average can be obtained simply by removing the order by clause.
select avg(rating) over(partition by hotel)
from hotel_ratings
This code gives the required output. Though, it helps that avg() does not require an order by clause mandatorily. This wouldn't work for fuctions that require order by, like rank(), etc.
He put Order by within partition to just maintain same order of records as final.
@TFQ Generalized solution that will work for many outliers
WITH cte AS (
SELECT hotel,year,rating, CAST(AVG(rating) OVER(PARTITION BY hotel ORDER BY hotel) AS DECIMAL(9,1)) AS avg_rating, STDEV(rating) over(partition by hotel order by hotel) as std_dev
FROM hotel_ratings
),
cte1 AS (
SELECT * ,ABS(rating - avg_rating) AS diff_rating , cast(avg_rating + std_dev as decimal(9,1)) as positive , cast(avg_rating - std_dev as decimal(9,1)) as negative from cte
),cte2
as
(
Select *, CASE WHEN rating =negative THEN 1 else 0 end as flag from cte1
)
select hotel,year,rating from cte2
where flag 0
order by hotel desc, year asc
too complex, you can simplify this code
Ok let me know your solution
with cte as (
SELECT *,
round(avg(rating) over(partition by hotel order by year
range between unbounded preceding and unbounded following ),2) as avg_rating
FROM hotel),
cte1 as (select *,
abs(avg_rating-rating) as diif
from cte),
cte2 as (select *,
dense_rank() over(partition by hotel order by abs(avg_rating-rating) desc
) as rn
from cte1)
select hotel,year,rating from cte2
where rn>1
Bro , can you please explain in oracle
with cte as(
select *, avg(rating) over(order by hotel) as rating2 from hotel_ratings
),
cte2 as
(select *, abs(rating-rating2) as did from cte),
cte3 as
(select *, rank() over(partition by hotel order by did desc) as a from cte2)
select hotel,year,rating from cte3
where a>1
order by hotel desc ,year;
Instead of ranking, is there any other logic to find outlier step at the last as step?
yes. you can also find the outlier using 5 number summary method.
you can find max avg value for hotel and get less value of max avg value
🔥❤
Oracle SQL | Solution that will work for many outliers.
WITH cte AS (
SELECT
hotel,
year,
rating,
AVG(rating) OVER (PARTITION BY hotel) AS average,
rating - AVG(rating) OVER (PARTITION BY hotel) AS difference
FROM
hotel_ratings
),
result_table AS (
SELECT
hotel,
year,
rating,
difference,
AVG(ABS(difference)) OVER (PARTITION BY hotel) AS avg_abs_diff,
CASE
WHEN ABS(difference) - AVG(ABS(difference)) OVER (PARTITION BY hotel) >= AVG(ABS(difference)) OVER (PARTITION BY hotel)
THEN 'remove'
ELSE 'keep'
END AS flag
FROM
cte
)
SELECT
hotel,
year,
rating
FROM
result_table
WHERE
flag = 'keep'
ORDER BY
hotel DESC,
year;
WITH AVG_RATING_CTE AS
(SELECT *,
ROUND(AVG(RATING)OVER(PARTITION BY HOTEL ORDER BY YEAR
RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING),1) AS AVG_RATING
FROM HOTEL_RATINGS)
,DIFF_CTE AS(SELECT
HOTEL, YEAR, RATING, AVG_RATING ,ABS(AVG_RATING - RATING) AS DIFF
FROM AVG_RATING_CTE)
,MAX_CTE AS(
SELECT
HOTEL, YEAR, RATING, DIFF ,MAX(DIFF)OVER(PARTITION BY HOTEL ORDER BY YEAR
RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS MAX
FROM DIFF_CTE
)
SELECT HOTEL, YEAR, RATING
FROM MAX_CTE
WHERE DIFF MAX
ORDER BY HOTEL,YEAR
i am getting avg rating of radisson blu as 3.83
SQL server -
WITH CTE AS
(SELECT *, ABS(rating - AVG(rating) OVER (PARTITION BY hotel)) as average_difference FROM hotel_ratings),
CTE_FINAL AS
(SELECT *,CASE WHEN average_difference = MAX(average_difference) OVER (PARTITION BY hotel) THEN 1 ELSE 0 END AS flag FROM CTE)
SELECT hotel,year,rating FROM CTE_FINAL WHERE flag=0
with cte as(
select
*,avg(rating) over(partition by hotel ) avg_rating,abs(rating - avg(rating) over(partition by hotel )) diff_rating
from hotel_ratings),
cte2 as (SELECT *,max(diff_rating) over(partition by hotel) max_diff from cte
)
select hotel,year,rating from cte2
where diff_ratingmax_diff
order by hotel desc, year ;
with cte as(
select hotel,avg(rating) as rat
from hotel_ratings
group by hotel)
,cte2 as(select h.hotel,h.year,rating,rank() over (partition by hotel order by abs(rating -rat) desc) as r
from cte c
inner join hotel_ratings h
on c.hotel=h.hotel)
select * from cte2 where r>1;
select hotel,year,rating
from hotel_ratings
where hotel = 'Radisson Blu' and year in('2021','2022','2023')
union
select hotel,year,rating
from hotel_ratings
where hotel = 'InterContinental' and year in('2020','2021','2023')
order by hotel desc
this is a static way, not applicable for different cases. but what if the ratings changes? will you change the year names again? just think
select hotel,year, rating, delta, max_delta
from (
select hotel,year, rating, delta, max(delta) over (partition by hotel ) max_delta
from (
select hotel,year, rating, abs(rating-av) as delta
from (
select hotel, year, rating,
round(avg(rating) over (partition by hotel),2) av
from hotel_ratings)))
where delta max_delta
order by hotel desc, year
;
WITH cte AS (
SELECT
hr.*,
(
SELECT ROUND(AVG(rating), 2)
FROM hotel_ratings hr2
WHERE hr2.Hotel = hr.Hotel
) AS avrg
FROM
hotel_ratings hr
),
cte1 AS (
SELECT
Hotel,
Year,
rating,
RANK() OVER (
PARTITION BY Hotel
ORDER BY ABS(rating - avrg), Year
) AS rn
FROM
cte
)
SELECT
Hotel,
Year,
rating
FROM
cte1
WHERE
rn < (
SELECT MAX(rn)
FROM cte1
)
ORDER BY
Hotel DESC,
Year;
Thompson Edward Johnson Gary Johnson Karen
My solution:
Correct me if am wrong
WITH cte AS (
SELECT hotel,year,rating, CAST(AVG(rating) OVER(PARTITION BY hotel ORDER BY hotel) AS DECIMAL(9,1)) AS avg_rating
FROM hotel_ratings
),
cte1 AS (
SELECT * ,ABS(rating - avg_rating) AS flag1 from cte
),cte2 as
(
select *,max(flag1) over(partition by hotel order by hotel) as maxFlag from cte1
)
(select Hotel,year,rating from cte2
where flag1 maxFlag)
ORDER BY hotel DESC
my solution :
WITH rating_differences AS (
SELECT hotel,year,rating, ABS(rating - (SELECT ROUND(AVG(rating), 2)
FROM hotel_ratings hr2
WHERE hr2.hotel = hr1.hotel)) AS rating_difference
FROM hotel_ratings hr1),
max_deviation AS (
SELECT hotel, MAX(rating_difference) AS max_rating_difference
FROM rating_differences
GROUP BY hotel)
select fl.hotel, fl.year, fl.rating
from(SELECT rd.hotel, rd.year, rd.rating, rd.rating_difference,
CASE WHEN rd.rating_difference = md.max_rating_difference AND rd.hotel = md.hotel THEN 1 ELSE 0 END AS flag
FROM rating_differences rd
JOIN max_deviation md
ON rd.hotel = md.hotel) fl
where fl.flag != 1
thank you