Hey Bert. Thanks for this, brought back a bunch of unhappy memories :) My suggestion... Print the entire heap table to paper. By the 3rd box of paper the customer had to pay for, he will realize it will be cheaper to give you an OLAP instance and let you query from there >
@@zanonilabuschagne7628Indeed they are! I just showed a newer colleague yesterday about the speed differences in OR and an alternative (though not AND in this case). I had a fairly complex query written out, and we ran it a few times. Every time it executed in ~700ms. -- CteNonUniqueIDPair: -- ([ID1], [ID2]) -- 5 rows -- Tbl: -- ([ID1], [ID2], …[21OtherCols]) -- 130,000 rows I commented out this logic: -- ========== /* WHERE ( NOT EXISTS ( SELECT TOP (1) 1 FROM CteNonUniqueIDPair AS “Cte” WHERE ( Cte.[ID1] = Tbl.[ID1] ) ) AND NOT EXISTS ( SELECT TOP (1) 1 FROM CteNonUniqueIDPair AS “Cte” WHERE ( Cte.[ID2] = Tbl.[ID2] ) ) ) */ -- ========== And replaced it with something that LOOKS like it would be more performant: -- ========== WHERE ( NOT EXISTS ( SELECT TOP (1) 1 FROM CteNonUniqueIDPair AS “Cte” WHERE ( (Cte.[ID1] = Tbl.[ID1]) OR (Cte.[ID2] = Tbl.[ID2]) ) ) ) -- ========== And we executed it a couple of times. Now, the execution time had spiked from ~700ms to ~1900ms! It was nearly three times as slow!
Hi Bert. I had similar with a 3rd party audit table. Then whilst looking for a key noticed a lot of duplicates (> 60% !). Job step 1) inc control table (year) 2) copy a year of data into a child 3) remove that year from parent 4) dedup child 5) copy smaller child back into parent. And left the children to be manually deleted at the end (just in case). Took a 12 nights.
Sometimes you just have to be pragmatic and do it in a loop. I probably would have done it on a day to day basis but had 5 or 6 connections to the database.
I'm wondering if something like a window function would have helped.. So figure out ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) for each row and then do a modulo 3 to pluck out every 3rd row..?
Hey Bert just curious did you build your query out using a WHILE loop? I'm fairly new to learning SQL & would like to practice this scenario on some sample data.
Glad to have the videos back.
Hey Bert. Thanks for this, brought back a bunch of unhappy memories :)
My suggestion... Print the entire heap table to paper. By the 3rd box of paper the customer had to pay for, he will realize it will be cheaper to give you an OLAP instance and let you query from there >
p.s. it was on this scenario I learnt how much slower "OR" operators are than "AND" operators... life lessons :)
@@zanonilabuschagne7628Indeed they are! I just showed a newer colleague yesterday about the speed differences in OR and an alternative (though not AND in this case).
I had a fairly complex query written out, and we ran it a few times. Every time it executed in ~700ms.
-- CteNonUniqueIDPair:
-- ([ID1], [ID2])
-- 5 rows
-- Tbl:
-- ([ID1], [ID2], …[21OtherCols])
-- 130,000 rows
I commented out this logic:
-- ==========
/*
WHERE (
NOT EXISTS (
SELECT TOP (1) 1
FROM CteNonUniqueIDPair AS “Cte”
WHERE (
Cte.[ID1] = Tbl.[ID1]
)
)
AND NOT EXISTS (
SELECT TOP (1) 1
FROM CteNonUniqueIDPair AS “Cte”
WHERE (
Cte.[ID2] = Tbl.[ID2]
)
)
)
*/
-- ==========
And replaced it with something that LOOKS like it would be more performant:
-- ==========
WHERE (
NOT EXISTS (
SELECT TOP (1) 1
FROM CteNonUniqueIDPair AS “Cte”
WHERE (
(Cte.[ID1] = Tbl.[ID1])
OR (Cte.[ID2] = Tbl.[ID2])
)
)
)
-- ==========
And we executed it a couple of times. Now, the execution time had spiked from ~700ms to ~1900ms! It was nearly three times as slow!
Hi Bert. I had similar with a 3rd party audit table. Then whilst looking for a key noticed a lot of duplicates (> 60% !). Job step 1) inc control table (year) 2) copy a year of data into a child 3) remove that year from parent 4) dedup child 5) copy smaller child back into parent. And left the children to be manually deleted at the end (just in case). Took a 12 nights.
Glad to have the videos back. I would have talked it over at the next volleyball game.
Also often need to go with this approach 😅 manual job but it works
Sometimes you just have to be pragmatic and do it in a loop. I probably would have done it on a day to day basis but had 5 or 6 connections to the database.
I would have cried
I'm wondering if something like a window function would have helped.. So figure out ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) for each row and then do a modulo 3 to pluck out every 3rd row..?
Did you try to put this logic into python and run async ? I think i would take less time to copy the entire table because it will be copied parallelly
Hey Bert just curious did you build your query out using a WHILE loop? I'm fairly new to learning SQL & would like to practice this scenario on some sample data.
@@DataWithBert O even better, thank you!