Remove Outliers - SQL Interview Query 11 | SQL Problem Level "MEDIUM"

แชร์
ฝัง
  • เผยแพร่เมื่อ 1 ม.ค. 2025

ความคิดเห็น • 75

  • @arnabsarkar6159
    @arnabsarkar6159 9 หลายเดือนก่อน +6

    Not a full proof solution as one partition might have more than one distinct outliers..Not sure how rank function will help there

  • @7sandyData
    @7sandyData 9 หลายเดือนก่อน +5

    Since more than one outlier can mess up the average value. Would suggest to use median value in outlier detection and removal. Using a z Score method or a percentile method would be appropriate.

    • @camomile4085
      @camomile4085 6 หลายเดือนก่อน

      Could you share the answer, please?

  • @splendidabhi
    @splendidabhi 9 หลายเดือนก่อน +1

    Thank you so much learnt new things ❤

  • @13.AkashDas
    @13.AkashDas 4 หลายเดือนก่อน

    my approach :
    with cte as( select *,avg(rating)over(partition by hotel )as avg
    from hotel_ratings),
    cte2 as (select *, abs(avg-rating) as abs from cte),
    cte3 as (select *, max(abs) over(partition by hotel) as exclude from cte2)
    select hotel,year,rating from cte3
    where excludeabs
    order by hotel desc

  • @camomile4085
    @camomile4085 6 หลายเดือนก่อน

    Thank you for your instruction. However, I noticed some redundancy in the query. Here is my refined solution based on your guidance:
    WITH cte1 AS (
    SELECT *, AVG(rating) OVER (PARTITION BY hotel) AS r_avg
    FROM hotel_ratings),
    cte2 AS (
    SELECT *, ABS (rating - r_avg) AS diff
    FROM cte1
    )
    SELECT hotel, year, rating
    FROM cte2
    WHERE diff

  • @Tusharchitrakar
    @Tusharchitrakar 9 หลายเดือนก่อน +9

    Sorry to say but your approach to this problem is naive and as a data analyst/scientist you need to make sure your solution is statistically accurate. I'll tell you a major flaw with your solution with an example:
    If you have multiple outliers in each hotel your solution will not identify those outliers since you are only identifying 1 extreme outlier with the maximum deviation from the mean but this does not mean that other outliers cannot exist.
    Right solution:
    Apart from mean also calculate the standard deviation for each hotel (every rdbms has a function for that else can be done manually too). In statistics there is something called the 3 Sigma rule (if an assumption can be made that the data distribution is approximately gaussian and this is also called the Z score technique) which needs to be applied here. Essentially, the outliers will be those records whose values are mean (mu) +/- 3* standard deviation (Sigma). This logic has to be applied to obtain the outliers.
    If your goal is to just apply SQL logic with window functions it is okay but since you are talking about a business case scenario it is important to be statistically accurate
    That being said thank you for your continuous efforts in helping us build strong SQL foundations.

    • @gunasekhar8440
      @gunasekhar8440 9 หลายเดือนก่อน

      Hi Tushar ji, simply we have two types of famous methods in stats.
      1. Z-score
      2. IQR
      Correct me if I'm wrong

    • @Tusharchitrakar
      @Tusharchitrakar 9 หลายเดือนก่อน

      ​​@@gunasekhar8440you are absolutely right. Iqr rule is applied if the data is not gaussian/normal. Iqr method can also be applied here: we need to calculate the 25th (Q1) and 75th percentile (Q3) of the data (there are window functions for this which will be different in different rdbms i guess). IQR=75th percentile value - 25th percentile value (Q3-Q1). Then outliers will be either Q1-1.5*IQR or Q3+1.5*IQR.
      This logic can also be used in SQL to identify outliers. Hope this helps

    • @rakeshkumarreddymudda
      @rakeshkumarreddymudda 9 หลายเดือนก่อน

      Yes I also felt the same. Approach is not correct, ranking will not suit for resolving the outliers issue. We need use IQR to find outliers I think

    • @gunasekhar8440
      @gunasekhar8440 9 หลายเดือนก่อน

      @@rakeshkumarreddymudda but we have only less number of records. Then if we do that procedure it doesn't make any sense

    • @Tusharchitrakar
      @Tusharchitrakar 9 หลายเดือนก่อน +2

      @@gunasekhar8440 yes but in a general case iqr rule has to be applied. In these case there are only 4-6 records per hotel so the concept of percentile wouldn't be accurate for such low data points. But if the number of records were more, it would so you are right

  • @anirbanbiswas7624
    @anirbanbiswas7624 6 หลายเดือนก่อน

    with cte as(select *,rank() over(order by rating asc)as max_rnk,rank() over(order by rating desc)as min_rnk from hotel_ratings)
    select hotel,year,rating from cte where max_rnk 8 and max_rnk1 and min_rnk 1 order by hotel
    donot know whether it will be dynamic or not...

  • @Ajaysaini081
    @Ajaysaini081 9 หลายเดือนก่อน

    This is my alternate solution:
    select * from hotel_ratings where rating <
    (select avg(rating) from hotel_ratings where hotel='Radisson Blu')
    and hotel='Radisson Blu'
    union
    select * from hotel_ratings where rating >
    (select avg(rating) from hotel_ratings where hotel='InterContinental')
    and hotel='InterContinental' order by hotel desc

    • @NischitSapkota
      @NischitSapkota 9 หลายเดือนก่อน +1

      For a large enough data set , there could be multiple values above the mean value. So, the data point above the mean doesn't necessarily mean the highest dataset. The average height of a male is 5'8 (let's say) , doesn't necessarily signify that 5'9 is the tallest a man could ever be. It could be 6'5 , 7'3 or whatever.

    • @Ajaysaini081
      @Ajaysaini081 9 หลายเดือนก่อน

      Yes that is the issue with this solution

  • @sat9sh
    @sat9sh 9 หลายเดือนก่อน

    Thank You Sir!

  • @MdZeeshan-m9u
    @MdZeeshan-m9u 9 หลายเดือนก่อน

    Thank you so much

  • @hassamulhaq7762
    @hassamulhaq7762 9 หลายเดือนก่อน

    thank you,

  • @nileshpandey8247
    @nileshpandey8247 9 หลายเดือนก่อน +2

    with cte as
    ( select *,avg(rating)over(partition by hotel ) as avg_rating,round(STDDEV(rating)over(partition by hotel),4) as std
    from hotel_ratings),
    cte1 as
    (select *,abs((rating - avg_rating)/std) as out_detect
    from cte)
    select hotel,year,rating from cte1 where out_detect < 1.5
    order by hotel desc,year

    • @mayerKhadar
      @mayerKhadar 9 หลายเดือนก่อน

      Perfect!

    • @TheRaju991
      @TheRaju991 9 หลายเดือนก่อน

      why 1.5?

    • @nileshpandey8247
      @nileshpandey8247 9 หลายเดือนก่อน +1

      @@TheRaju991 it's just bcz of this problem , else it would have been 3

    • @TheRaju991
      @TheRaju991 9 หลายเดือนก่อน

      @@nileshpandey8247 Makes sense. Thanks!

  • @AjinkyaGhadge-w5l
    @AjinkyaGhadge-w5l 6 หลายเดือนก่อน

    @techTFQ I also face the same problme of range. However when I tried to order by hotel then it gave me the result as expected why does orderng by year create so much of difference ??

  • @RAISHABANU-d2k
    @RAISHABANU-d2k 9 หลายเดือนก่อน

    sir, please put a video on different star pattern forming with mysql .

  • @vaibhavbatra9118
    @vaibhavbatra9118 7 หลายเดือนก่อน

    with cte as
    (select hotel , year, rating, average_rating, average_rating_difference, max(average_rating_difference)
    over(partition by hotel order by year range between unbounded preceding and unbounded following) as max_difference
    from (select *,
    AVG(rating) over(partition by hotel order by year range between unbounded preceding and unbounded following) as average_rating,
    abs(rating - AVG(rating) over(partition by hotel order by year range between unbounded preceding and unbounded following)) as average_rating_difference
    from hotel_ratings)x )
    select hotel, year, rating
    from cte
    where average_rating_difference max_difference

  • @gphanisrinivasful
    @gphanisrinivasful 9 หลายเดือนก่อน

    This might be applicable only for MS SQL, (I am not sure about PostgreSQL) a constant average can be obtained simply by removing the order by clause.
    select avg(rating) over(partition by hotel)
    from hotel_ratings
    This code gives the required output. Though, it helps that avg() does not require an order by clause mandatorily. This wouldn't work for fuctions that require order by, like rank(), etc.

    • @yogamayaa
      @yogamayaa 9 หลายเดือนก่อน

      He put Order by within partition to just maintain same order of records as final.

  • @navaneeth6414
    @navaneeth6414 9 หลายเดือนก่อน +1

    @TFQ Generalized solution that will work for many outliers
    WITH cte AS (
    SELECT hotel,year,rating, CAST(AVG(rating) OVER(PARTITION BY hotel ORDER BY hotel) AS DECIMAL(9,1)) AS avg_rating, STDEV(rating) over(partition by hotel order by hotel) as std_dev
    FROM hotel_ratings
    ),
    cte1 AS (
    SELECT * ,ABS(rating - avg_rating) AS diff_rating , cast(avg_rating + std_dev as decimal(9,1)) as positive , cast(avg_rating - std_dev as decimal(9,1)) as negative from cte

    ),cte2
    as
    (
    Select *, CASE WHEN rating =negative THEN 1 else 0 end as flag from cte1
    )
    select hotel,year,rating from cte2
    where flag 0
    order by hotel desc, year asc

    • @gouthamstar6558
      @gouthamstar6558 8 หลายเดือนก่อน

      too complex, you can simplify this code

    • @navaneeth6414
      @navaneeth6414 8 หลายเดือนก่อน

      Ok let me know your solution

  • @parmoddhiman678
    @parmoddhiman678 5 หลายเดือนก่อน

    with cte as (
    SELECT *,
    round(avg(rating) over(partition by hotel order by year
    range between unbounded preceding and unbounded following ),2) as avg_rating
    FROM hotel),
    cte1 as (select *,
    abs(avg_rating-rating) as diif
    from cte),
    cte2 as (select *,
    dense_rank() over(partition by hotel order by abs(avg_rating-rating) desc
    ) as rn
    from cte1)
    select hotel,year,rating from cte2
    where rn>1

  • @ongole_boy6346
    @ongole_boy6346 9 หลายเดือนก่อน

    Bro , can you please explain in oracle

  • @gouthamstar6558
    @gouthamstar6558 8 หลายเดือนก่อน

    with cte as(
    select *, avg(rating) over(order by hotel) as rating2 from hotel_ratings
    ),
    cte2 as
    (select *, abs(rating-rating2) as did from cte),
    cte3 as
    (select *, rank() over(partition by hotel order by did desc) as a from cte2)
    select hotel,year,rating from cte3
    where a>1
    order by hotel desc ,year;

  • @digitaltechconnect1318
    @digitaltechconnect1318 9 หลายเดือนก่อน +1

    Instead of ranking, is there any other logic to find outlier step at the last as step?

    • @akashchristopher
      @akashchristopher 9 หลายเดือนก่อน

      yes. you can also find the outlier using 5 number summary method.

    • @ErfanHossainShoaib
      @ErfanHossainShoaib 9 หลายเดือนก่อน

      you can find max avg value for hotel and get less value of max avg value

  • @anudeepreddy5559
    @anudeepreddy5559 3 หลายเดือนก่อน

    🔥❤

  • @ratheeshg1896
    @ratheeshg1896 6 หลายเดือนก่อน

    Oracle SQL | Solution that will work for many outliers.
    WITH cte AS (
    SELECT
    hotel,
    year,
    rating,
    AVG(rating) OVER (PARTITION BY hotel) AS average,
    rating - AVG(rating) OVER (PARTITION BY hotel) AS difference
    FROM
    hotel_ratings
    ),
    result_table AS (
    SELECT
    hotel,
    year,
    rating,
    difference,
    AVG(ABS(difference)) OVER (PARTITION BY hotel) AS avg_abs_diff,
    CASE
    WHEN ABS(difference) - AVG(ABS(difference)) OVER (PARTITION BY hotel) >= AVG(ABS(difference)) OVER (PARTITION BY hotel)
    THEN 'remove'
    ELSE 'keep'
    END AS flag
    FROM
    cte
    )
    SELECT
    hotel,
    year,
    rating
    FROM
    result_table
    WHERE
    flag = 'keep'
    ORDER BY
    hotel DESC,
    year;

  • @vivekshirsat9249
    @vivekshirsat9249 7 หลายเดือนก่อน

    WITH AVG_RATING_CTE AS
    (SELECT *,
    ROUND(AVG(RATING)OVER(PARTITION BY HOTEL ORDER BY YEAR
    RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING),1) AS AVG_RATING
    FROM HOTEL_RATINGS)
    ,DIFF_CTE AS(SELECT
    HOTEL, YEAR, RATING, AVG_RATING ,ABS(AVG_RATING - RATING) AS DIFF
    FROM AVG_RATING_CTE)
    ,MAX_CTE AS(
    SELECT
    HOTEL, YEAR, RATING, DIFF ,MAX(DIFF)OVER(PARTITION BY HOTEL ORDER BY YEAR
    RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS MAX
    FROM DIFF_CTE
    )
    SELECT HOTEL, YEAR, RATING
    FROM MAX_CTE
    WHERE DIFF MAX
    ORDER BY HOTEL,YEAR

  • @chetanmaurya8557
    @chetanmaurya8557 8 หลายเดือนก่อน

    i am getting avg rating of radisson blu as 3.83

  • @DataSailor
    @DataSailor 8 หลายเดือนก่อน

    SQL server -
    WITH CTE AS
    (SELECT *, ABS(rating - AVG(rating) OVER (PARTITION BY hotel)) as average_difference FROM hotel_ratings),
    CTE_FINAL AS
    (SELECT *,CASE WHEN average_difference = MAX(average_difference) OVER (PARTITION BY hotel) THEN 1 ELSE 0 END AS flag FROM CTE)
    SELECT hotel,year,rating FROM CTE_FINAL WHERE flag=0

  • @abhijitvernekar593
    @abhijitvernekar593 9 หลายเดือนก่อน

    with cte as(
    select
    *,avg(rating) over(partition by hotel ) avg_rating,abs(rating - avg(rating) over(partition by hotel )) diff_rating
    from hotel_ratings),
    cte2 as (SELECT *,max(diff_rating) over(partition by hotel) max_diff from cte
    )
    select hotel,year,rating from cte2
    where diff_ratingmax_diff
    order by hotel desc, year ;

  • @dumdum987
    @dumdum987 5 หลายเดือนก่อน

    with cte as(
    select hotel,avg(rating) as rat
    from hotel_ratings
    group by hotel)
    ,cte2 as(select h.hotel,h.year,rating,rank() over (partition by hotel order by abs(rating -rat) desc) as r
    from cte c
    inner join hotel_ratings h
    on c.hotel=h.hotel)
    select * from cte2 where r>1;

  • @SanjaySharma-zf4qw
    @SanjaySharma-zf4qw 8 หลายเดือนก่อน +1

    select hotel,year,rating
    from hotel_ratings
    where hotel = 'Radisson Blu' and year in('2021','2022','2023')
    union
    select hotel,year,rating
    from hotel_ratings
    where hotel = 'InterContinental' and year in('2020','2021','2023')
    order by hotel desc

    • @gouthamstar6558
      @gouthamstar6558 8 หลายเดือนก่อน +1

      this is a static way, not applicable for different cases. but what if the ratings changes? will you change the year names again? just think

  • @iswillia123
    @iswillia123 9 หลายเดือนก่อน

    select hotel,year, rating, delta, max_delta
    from (
    select hotel,year, rating, delta, max(delta) over (partition by hotel ) max_delta
    from (
    select hotel,year, rating, abs(rating-av) as delta
    from (
    select hotel, year, rating,
    round(avg(rating) over (partition by hotel),2) av
    from hotel_ratings)))
    where delta max_delta
    order by hotel desc, year
    ;

  • @akshaysalve195
    @akshaysalve195 4 หลายเดือนก่อน

    WITH cte AS (
    SELECT
    hr.*,
    (
    SELECT ROUND(AVG(rating), 2)
    FROM hotel_ratings hr2
    WHERE hr2.Hotel = hr.Hotel
    ) AS avrg
    FROM
    hotel_ratings hr
    ),
    cte1 AS (
    SELECT
    Hotel,
    Year,
    rating,
    RANK() OVER (
    PARTITION BY Hotel
    ORDER BY ABS(rating - avrg), Year
    ) AS rn
    FROM
    cte
    )
    SELECT
    Hotel,
    Year,
    rating
    FROM
    cte1
    WHERE
    rn < (
    SELECT MAX(rn)
    FROM cte1
    )
    ORDER BY
    Hotel DESC,
    Year;

  • @SylviaFerguson-u8g
    @SylviaFerguson-u8g 4 หลายเดือนก่อน

    Thompson Edward Johnson Gary Johnson Karen

  • @navaneeth6414
    @navaneeth6414 9 หลายเดือนก่อน

    My solution:
    Correct me if am wrong
    WITH cte AS (
    SELECT hotel,year,rating, CAST(AVG(rating) OVER(PARTITION BY hotel ORDER BY hotel) AS DECIMAL(9,1)) AS avg_rating
    FROM hotel_ratings
    ),
    cte1 AS (
    SELECT * ,ABS(rating - avg_rating) AS flag1 from cte

    ),cte2 as
    (
    select *,max(flag1) over(partition by hotel order by hotel) as maxFlag from cte1
    )
    (select Hotel,year,rating from cte2
    where flag1 maxFlag)
    ORDER BY hotel DESC

  • @sathvikdakappagari9122
    @sathvikdakappagari9122 5 หลายเดือนก่อน

    my solution :
    WITH rating_differences AS (
    SELECT hotel,year,rating, ABS(rating - (SELECT ROUND(AVG(rating), 2)
    FROM hotel_ratings hr2
    WHERE hr2.hotel = hr1.hotel)) AS rating_difference
    FROM hotel_ratings hr1),
    max_deviation AS (
    SELECT hotel, MAX(rating_difference) AS max_rating_difference
    FROM rating_differences
    GROUP BY hotel)
    select fl.hotel, fl.year, fl.rating
    from(SELECT rd.hotel, rd.year, rd.rating, rd.rating_difference,
    CASE WHEN rd.rating_difference = md.max_rating_difference AND rd.hotel = md.hotel THEN 1 ELSE 0 END AS flag
    FROM rating_differences rd
    JOIN max_deviation md
    ON rd.hotel = md.hotel) fl
    where fl.flag != 1
    thank you