The practice dataset and SQL statements for this video tutorial are available here - know-star.blogspot.com/2023/04/sql-query-how-to-delete-duplicates-from.html
One important thing to ask is "what causes duplicate records in the database?" In large applications where data comes from multiple sources, "de-duping" before insert is a real problem. It's good to know how to deleting duplicates, but in some cases that's not a viable option. For example, if the table has 100 million rows, deleting duplicates could be quite expensive and shouldn't be run during business hours. When I ask SQL questions in interviews, I'm not looking for "how do I use group by for de-duping." I want to see the candidate thinking about the larger problem and taking time to understand business needs. Cleaning up dupes after the fact doesn't scale and candidates that ask "what is causing dupes and what's the impact" are the developers I want. A developer that only knows how to group by, but never bothers to ask "why is this happening and what is the root problem" aren't people I will hire.
Depends what are you interviewing for. Asking a simple developer about business problems doesn't make much sense. If you want to hire a tech lead, team lead, project manager, architect or a consultant, then yes.
@@redguard128 if I'm interviewing an entry level developer, I still ask the question for a few reasons. The first one is to expose the candidate to important issues they will eventually have to deal with. The second is to emphasize SQL isn't just for programming sake, it's to manage data and solve functional needs of the application. If applications aren't checking for dupes before inserting data, your database is going to become a pile of garbage very quickly.
@@woolfel For me a developer is an executive role. He has to do what I tell him/her to do. The "Why"s and "How"s isn't their concern. Sometimes the business decides that duplicates is what they want so a developer that prematurely solves a problem, actually deals more damage than fixes. Some businesses run on circular logic, repeating themselves, defining settings everywhere, modifying global variables in functions, running multiple databases with duplicate data, having too low spec or too high spec servers, etc.
@@redguard128 that's one way to do it. I work in the consulting world and growing our developers is very important to me. The faster the developer learns, the less hand holding I need to do and it makes the entire team more productive. I've worked in fortune 500 world long enough to know hiring a bunch of low level developers who can't grow causes more problems. In healthcare and finance sector, duplicate data causes huge data integrity issues. I would say 90% of the ETL work in fortune 500 deal with dirty data. When issues happen it's because of dirty data (bad references, missing data and dupes). Many of our customers waste 3-12months dealing with dirty data every year. A developer that isn't thinking about these issues and constantly growing will become obsolete. I have seen fulltime employees (aka not consultants) work this way. I question is that a good thing to teach people? Who wants to stay a low level engineer forever and be someone else's slave? Who wants to work at a job where the tech lead treats them like a pair of fingers?
Best is to use row_number() and partition by to create sequence column on duplicates - applicable to all kind of duplicates ( identical rows especially)
The practice dataset and SQL statements are now available and you can access them here - know-star.blogspot.com/2023/04/sql-query-how-to-delete-duplicates-from.html
Have you been asked a SQL query interview question that you couldn't answer? Let us know in the comments below and we will answer those in our upcoming videos!
in my last interview interviewer ask me what is difference between count(*) and count(1) and which one is performance wise better so can u please make video on this.
For finding the duplicate values we can use 'having count >1', it wont be possible to use a delete funtion here and the having clause as in a subquery?
Hello there, Can you please do a video on how to add a new large dataset (million of rows) to an exsiting table without deleting the data in the table. Provided that column names and data types are the same in old table and new added data. Thanks
The question is how to delete duplicates from table, not how to display duplicates or how to display unique - two different things. The correct answer is DELETE FROM WHERE max() … , even though this is inefficient. It’s about how to ingest data into table without duplicates.
prashant verma I add also that “deleting duplicates” is ambiguous phrase. Does it mean to completely remove duplicated rows from the table or leave unique row in case there are multiple rows with the same values (duplicates).
@@danieljust295 that's a question we should ask before proceeding with the query. However, in an interview we don't get chance. So here we can assume.. remove the duplicate from the table. As she said in her video we can keep the latest one
When I try to duplicate your example on mysql, i get the error Error Code: 1288. The target table employye_cte of the DELETE is not updatable . This is the query I am trying to run with employye_cte as (select firstname, lastname, employeenumber, row_number() over (partition by lastname order by employeenumber) as rownumber from employees1) delete from employye_cte where rownumber = '2' What am I missing? Thanks
Rowid would be unique for each row in the table. Each duplicate row will have its own rowid and hence rowid can not be used. You need to identify a key column that represents a unique record to business
with Employee_CTE as (Select *, RANK() over (partition by FirstName, Lastname order by EmployeeID desc) as Rank from Employee); delete from Employee_CTE where Rank > 1; whenever I am typing this block of code in my oracle db (11g), I am getting an error ORA-00923: FROM keyword not found where expected. can somebody please help me in this matter?
Hi, I have a query please reply to what is wrong in it, my employ table contains, id, name, sal, email with dup_emp as (select *,dense_rank() over (partition by email order by id desc) as dens_rnk from employ e) delete from dup_emp where dens_rnk >1 now this code is showing this error, SQL Error [42P01]: ERROR: relation "dup_emp" does not exist Position: 127 i am selecting everything , including the cte, and then executing the query
Hello, This video is very helpful. But I have one question, The maximum employee ID is the duplicate one right? But you are deleting min of employee ID. Could you please clarify that?
Yes, you can try the same. WITH CTE_dup AS ( SELECT EmpID, FirstName, LastName, ROW_NUMBER() OVER (PARTITION BY FirstName, LastName ORDER BY EmpID) AS rownum FROM [dbo].[tblDuplicate] ) DELETE FROM CTE_dup WHERE rownum > 1; SELECT * FROM [dbo].[tblDuplicate]
what if there are more no of column and just have different timestamp , or user but having duplicate values( key columns) then in that case how can we delete duplicate using row_number...
Not really in the example these records were not duplicates as they differed at EmployeeID - distinct would work only is all fields would be the same. There are other ways of getting rid of duplicates as using rowids something like (using the example where duplicates existed at first,last name, phone and emal DELETE FROM EMPLOYEE1 WHERE ROWID NOT IN (SELECT MIN(ROWID) FROM EMPLOYEE1 GROUP BY FIRSTNAME,LASTNAME,PHONE,EMAIL);
The practice dataset and SQL statements are now available and you can access them here - know-star.blogspot.com/2023/04/sql-query-how-to-delete-duplicates-from.html
If the data set is bigger suppose 10000 rows then how would you remove duplicate from that without seeing which one is duplicate and to remove ? Please tell me.
Output sequence of below query should be - select firstname,lastname,count(*) from employee gropy by fistnmae,lastname- output- firstname lastname count(*) Adam ownes 2 Mark wills 1 natasha lee 2 ruley jones 1
The practice dataset and SQL statements are now available and you can access them here - know-star.blogspot.com/2023/04/sql-query-how-to-delete-duplicates-from.html
The practice dataset and SQL statements are now available and you can access them here - know-star.blogspot.com/2023/04/sql-query-how-to-delete-duplicates-from.html
as per my understanding if we have >2 duplicate records in a table then rank() and denserank() will not not work here in this case we have to use row_number() only!!!
I appreciate your way of explanations. I just loved it...I would say thank you but I want you to make a video on common expiration which are very important in SQL. one more question for you it that. Suppose there are two table and name is table A and table B Table A having Table B having ID | STUDENT NAME ID | SUJECT | MARKS 1 A 2 ENGLISH 40 2 B 4 ENGLISH 60 3 C 5 MATHS 100 4 6 SCIENCE 80 Find out student name who got max mark? I had been asked this question. plz solve here so other people can also get to know. Thank you very much in advance Mam....I will keep on waiting for answer of above question.
Select rank() over ( partition by firstname order by employeeid) as employeenumber, firstname,lastname,phone,email from employee; Please correct me if i am wrong .
DEAR MADAM , PLEASE HELP....... WITH NEW_TABLE AS ( SELECT ID, F_N, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY ID) AS RANK FROM EMPLOYEES ) DELETE FROM NEW_TABLE WHERE RANK>1; ---------------------------------------------------------------------------- ERROR MSG: ORA-00928: missing SELECT keyword 00928. 00000 - "missing SELECT keyword" *Cause: *Action: Error at Line: 6 Column: 1
The practice dataset and SQL statements are now available and you can access them here - know-star.blogspot.com/2023/04/sql-query-how-to-delete-duplicates-from.html
The practice dataset and SQL statements are now available and you can access them here - know-star.blogspot.com/2023/04/sql-query-how-to-delete-duplicates-from.html
Hi , Actually after having the query SELECT *, (RANK() OVER (PARTITION BY Firstname,Lastname ORDER BY EmployeeId asc)) AS Rank1 FROM dbo.Employee1, I am unable to get the data in the order of EmployeeId, its getting in the order of Firstname alphabetical order, Can you please let me know the issue
DELETE FROM aliasB FROM dbo.employee1 aliasA INNER JOIN dbo.employee1 aliasB ON aliasA.FirstName = aliasB.FirstName AND aliasA.LastName = aliasB.LastName AND aliasA.EmployeeID < aliasB.EmployeeID -- ONLY valid if PK supports '
The practice dataset and SQL statements for this video tutorial are available here -
know-star.blogspot.com/2023/04/sql-query-how-to-delete-duplicates-from.html
One important thing to ask is "what causes duplicate records in the database?" In large applications where data comes from multiple sources, "de-duping" before insert is a real problem. It's good to know how to deleting duplicates, but in some cases that's not a viable option. For example, if the table has 100 million rows, deleting duplicates could be quite expensive and shouldn't be run during business hours.
When I ask SQL questions in interviews, I'm not looking for "how do I use group by for de-duping." I want to see the candidate thinking about the larger problem and taking time to understand business needs. Cleaning up dupes after the fact doesn't scale and candidates that ask "what is causing dupes and what's the impact" are the developers I want. A developer that only knows how to group by, but never bothers to ask "why is this happening and what is the root problem" aren't people I will hire.
Love your thoughts! Thank you for sharing this with us 👍
Depends what are you interviewing for. Asking a simple developer about business problems doesn't make much sense. If you want to hire a tech lead, team lead, project manager, architect or a consultant, then yes.
@@redguard128 if I'm interviewing an entry level developer, I still ask the question for a few reasons. The first one is to expose the candidate to important issues they will eventually have to deal with. The second is to emphasize SQL isn't just for programming sake, it's to manage data and solve functional needs of the application. If applications aren't checking for dupes before inserting data, your database is going to become a pile of garbage very quickly.
@@woolfel For me a developer is an executive role. He has to do what I tell him/her to do. The "Why"s and "How"s isn't their concern. Sometimes the business decides that duplicates is what they want so a developer that prematurely solves a problem, actually deals more damage than fixes.
Some businesses run on circular logic, repeating themselves, defining settings everywhere, modifying global variables in functions, running multiple databases with duplicate data, having too low spec or too high spec servers, etc.
@@redguard128 that's one way to do it. I work in the consulting world and growing our developers is very important to me. The faster the developer learns, the less hand holding I need to do and it makes the entire team more productive. I've worked in fortune 500 world long enough to know hiring a bunch of low level developers who can't grow causes more problems. In healthcare and finance sector, duplicate data causes huge data integrity issues. I would say 90% of the ETL work in fortune 500 deal with dirty data. When issues happen it's because of dirty data (bad references, missing data and dupes).
Many of our customers waste 3-12months dealing with dirty data every year. A developer that isn't thinking about these issues and constantly growing will become obsolete. I have seen fulltime employees (aka not consultants) work this way. I question is that a good thing to teach people? Who wants to stay a low level engineer forever and be someone else's slave? Who wants to work at a job where the tech lead treats them like a pair of fingers?
Thanks for making this interview questions series.. cleared my major doubts
Glad to hear that! Thank you for your support.
Wonderful 😍 thanks a lot 🙏
Good way of teaching
Thank you
I appreciate your work mam and videos are good explanatory
Thank you
Excellent Mam😍
Thank you
Please post more SQL queries.
Will be posting more soon.
Can we use dense rank also?
You are amazing teacher
Thank you 🙏
Video is visible but explanation is superb
very good; thanks
Thank you
Very Helpful. Thank you
Thank you
Excellent. Thank you!
Glad it was helpful!
Best is to use row_number() and partition by to create sequence column on duplicates - applicable to all kind of duplicates ( identical rows especially)
Yes
True
Thanks u mam for sharing this.🙏
Thank you
please upload dataset as well to follow along.
The practice dataset and SQL statements are now available and you can access them here -
know-star.blogspot.com/2023/04/sql-query-how-to-delete-duplicates-from.html
Have you been asked a SQL query interview question that you couldn't answer?
Let us know in the comments below and we will answer those in our upcoming videos!
in my last interview interviewer ask me what is difference between count(*) and count(1) and which one is performance wise better so can u please make video on this.
Rollback; before commit;
You are a bless in my life 😘
Nice 👍helpful
Thank you
In the last approach you mentioned, we are deleting the values from the cte right ? Not from the main table ?
Deleting from CTE will delete it from the underlying table.
@@LearnatKnowstar got it. I wasn't aware of this fact about cte back then.
For finding the duplicate values we can use 'having count >1', it wont be possible to use a delete funtion here and the having clause as in a subquery?
It will delete all occurrences of the duplicate records. The method explained retains one occurrence of the duplicate records.
@@LearnatKnowstar thank u
So deleting data with in cte will delete data in table too , how? Does this happens in derived table and views also?
Same question
Really appreciate your effort..
If possible please add table script as well, it will helpful for beginner's.
Thank you!
Thank you. We have started adding the table scripts in our latest videos!
The practice dataset and SQL statements are now available here -
know-star.blogspot.com/2023/04/sql-query-how-to-delete-duplicates-from.html
From rank function it's better .. thanks
Hello there,
Can you please do a video on how to add a new large dataset (million of rows) to an exsiting table without deleting the data in the table. Provided that column names and data types are the same in old table and new added data. Thanks
may not be best method, but create a loop to insert batch records. 20k columns at a time. make sure you set the loop to end once finished
Best method, drop indexes, create a loop based on optimal size of batch, typically 20k to 500k, using bulk insert, once complete re-apply indexes.
Very good approaches have been mentioned in the comments. Thank you
Thanks
Thank you
How deleting the records from CTE is deleting the rows from main table?
We can use rownumber also right ? While we deleting the duplicates in cte
Yes
ne style bavundi akka
The question is how to delete duplicates from table, not how to display duplicates or how to display unique - two different things. The correct answer is DELETE FROM WHERE max() … , even though this is inefficient.
It’s about how to ingest data into table without duplicates.
It is also about how much time query is taking to execute
prashant verma Right, but the execution time this is secondary problem, first is the functionality, then optimization.
@@danieljust295 absolutely right, if we just consider the test question thn yes your approach was the simplest..
prashant verma I add also that “deleting duplicates” is ambiguous phrase. Does it mean to completely remove duplicated rows from the table or leave unique row in case there are multiple rows with the same values (duplicates).
@@danieljust295 that's a question we should ask before proceeding with the query. However, in an interview we don't get chance. So here we can assume.. remove the duplicate from the table. As she said in her video we can keep the latest one
When I try to duplicate your example on mysql, i get the error Error Code: 1288. The target table employye_cte of the DELETE is not updatable . This is the query I am trying to run
with employye_cte as
(select firstname, lastname, employeenumber, row_number() over (partition by lastname order by employeenumber) as rownumber
from employees1)
delete from employye_cte where rownumber = '2'
What am I missing?
Thanks
this is also my problem, i found that mysql couldn't delete a subquery. do you find the solution for this?
@@ivanbesando556 not so far yet
Hi madam can you please make a video on how to recover accidentally deleted data from the table. Thanks in advance
That's a great question. Will definitely plan a video soon.
We can use roll back command, I think, if we delete the data accidentally. But this command can't be executed for DDL commands
Very helpful. I am totally new. Learning SQL. Question, what platform is this where u are running sql queries?
This is SQL Server. You can download the software for free from Microsoft website.
Thanks for the video, can you pls tell me how the datasets used here can be found or accessed ?
The practice datasets are now available and you can access them here -
know-star.blogspot.com/2023/04/sql-query-how-to-delete-duplicates-from.html
Mam...instead of employee id , can we use rowid here and order by rowid ...because in many tables practically column like emp I'd won't be present.
You will need to choose a key column that identifies a unique record in the table
@@LearnatKnowstar can I not use rowid ?
Rowid would be unique for each row in the table. Each duplicate row will have its own rowid and hence rowid can not be used. You need to identify a key column that represents a unique record to business
with Employee_CTE as
(Select *,
RANK() over (partition by FirstName, Lastname order by EmployeeID desc) as Rank
from Employee);
delete from Employee_CTE where Rank > 1;
whenever I am typing this block of code in my oracle db (11g), I am getting an error
ORA-00923: FROM keyword not found where expected.
can somebody please help me in this matter?
You do not need to terminate the CTE with a semi colon.
Hi, I have a query please reply to what is wrong in it,
my employ table contains,
id, name, sal, email
with dup_emp as (select *,dense_rank() over (partition by email order by id desc) as dens_rnk
from employ e)
delete from dup_emp where dens_rnk >1
now this code is showing this error,
SQL Error [42P01]: ERROR: relation "dup_emp" does not exist
Position: 127
i am selecting everything , including the cte, and then executing the query
Hello,
This video is very helpful. But I have one question, The maximum employee ID is the duplicate one right? But you are deleting min of employee ID. Could you please clarify that?
yes, you can delete max of employee id considering it as a duplicate.
It was just assumed in the example that we want to retain the max employee id.
Deleting from CTE deletes data from the source table????????? How?
Yes, you can try the same.
WITH CTE_dup
AS
(
SELECT EmpID, FirstName, LastName, ROW_NUMBER() OVER (PARTITION BY FirstName, LastName ORDER BY EmpID) AS rownum
FROM [dbo].[tblDuplicate]
)
DELETE FROM CTE_dup
WHERE rownum > 1;
SELECT * FROM [dbo].[tblDuplicate]
please provide practice database ... that you have used in this video
The practice dataset and SQL statements are available here -
know-star.blogspot.com/2023/04/sql-query-how-to-delete-duplicates-from.html
@@LearnatKnowstar thank you
what if there are more no of column and just have different timestamp , or user but having duplicate values( key columns) then in that case how can we delete duplicate using row_number...
You just need to use key columns in partition by clause
@@LearnatKnowstar 0
Can you please share the DDL of all the questions mentioned ?
The DDLs are available her e-
know-star.blogspot.com/2023/04/sql-query-how-to-delete-duplicates-from.html
To delete duplicates just we can go through distinct * from emp we can delete duplicates from entire table
Good knowledge Bro
Not really in the example these records were not duplicates as they differed at EmployeeID - distinct would work only is all fields would be the same. There are other ways of getting rid of duplicates as using rowids something like (using the example where duplicates existed at
first,last name, phone and emal
DELETE FROM EMPLOYEE1
WHERE
ROWID NOT IN (SELECT MIN(ROWID) FROM EMPLOYEE1 GROUP BY FIRSTNAME,LASTNAME,PHONE,EMAIL);
Where can I get these practice tables?
You can practice with tables in the Microsoft Adventure Works database.
The practice dataset and SQL statements are now available and you can access them here -
know-star.blogspot.com/2023/04/sql-query-how-to-delete-duplicates-from.html
If the data set is bigger suppose 10000 rows then how would you remove duplicate from that without seeing which one is duplicate and to remove ? Please tell me.
Where is dataset to download
👍👍👍
Thank you
What is difference between roll_number and rank function? Both can be used interchangeably??
Do you mean Row_Number instead of roll_number?
Output sequence of below query should be -
select firstname,lastname,count(*) from employee gropy by fistnmae,lastname-
output-
firstname lastname count(*)
Adam ownes 2
Mark wills 1
natasha lee 2
ruley jones 1
Hi
Could you please share the sample data.Thanks
The practice dataset and SQL statements are now available and you can access them here -
know-star.blogspot.com/2023/04/sql-query-how-to-delete-duplicates-from.html
Does anyone know how i can have access to the datasources used in this video?
The practice dataset and SQL statements are now available and you can access them here -
know-star.blogspot.com/2023/04/sql-query-how-to-delete-duplicates-from.html
as per my understanding if we have >2 duplicate records in a table then rank() and denserank() will not not work here in this case we have to use row_number() only!!!
Dear Mam, please zoom,writings are not readable. 🙏🏼
Sure.Noted. In latest videos, the font is enlarged.
Question:- You have created CTE and deleted records from CTE then how data got deleted from the original table? Will wait for answer from anyone.
This is a feature of CTE. If you delete from CTE , it will delete from the underlying table 👍
@@LearnatKnowstar thank you!
Doesnt work, table not updateable
I appreciate your way of explanations. I just loved it...I would say thank you but I want you to make a video on common expiration which are very important in SQL.
one more question for you it that. Suppose there are two table and name is table A and table B
Table A having Table B having
ID | STUDENT NAME ID | SUJECT | MARKS
1 A 2 ENGLISH 40
2 B 4 ENGLISH 60
3 C 5 MATHS 100
4 6 SCIENCE 80
Find out student name who got max mark? I had been asked this question. plz solve here so other people can also get to know.
Thank you very much in advance Mam....I will keep on waiting for answer of above question.
Thank you.
Please see the below video- It has a similar query to the one in your comment.
th-cam.com/video/Z34X1a-zOyg/w-d-xo.html
What is in table A?
I hope my answers right, if wrong please let me know tks.
Select A.student name from A inner join B on A.id = B.id where max(b.marks)
Select rank() over ( partition by firstname order by employeeid) as employeenumber, firstname,lastname,phone,email from employee;
Please correct me if i am wrong .
DEAR MADAM , PLEASE HELP.......
WITH NEW_TABLE AS
(
SELECT ID, F_N, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY ID) AS RANK
FROM EMPLOYEES
)
DELETE FROM NEW_TABLE
WHERE RANK>1;
----------------------------------------------------------------------------
ERROR MSG:
ORA-00928: missing SELECT keyword
00928. 00000 - "missing SELECT keyword"
*Cause:
*Action:
Error at Line: 6 Column: 1
You have used row num in beginning and rank command at the end. How it will execute. Use either rownum or rank.
Video not cleared...your voice is clered...
Not able to see words
Where is sql code to practice
The practice dataset and SQL statements are now available and you can access them here -
know-star.blogspot.com/2023/04/sql-query-how-to-delete-duplicates-from.html
PUT QUERY IN DESCRIPTION
The practice dataset and SQL statements are now available and you can access them here -
know-star.blogspot.com/2023/04/sql-query-how-to-delete-duplicates-from.html
My answer: Google
Google might lead you here 👍
Rownum is best instead of these
waste of my time
delete from dbo.employee
where employeeid not in
(select max(employeeid) from dpo.employee group by firstname, lastname)
Hi , Actually after having the query
SELECT *,
(RANK() OVER (PARTITION BY Firstname,Lastname ORDER BY EmployeeId asc)) AS Rank1
FROM dbo.Employee1,
I am unable to get the data in the order of EmployeeId, its getting in the order of Firstname alphabetical order, Can you please let me know the issue
DELETE FROM aliasB
FROM dbo.employee1 aliasA INNER JOIN dbo.employee1 aliasB
ON aliasA.FirstName = aliasB.FirstName
AND aliasA.LastName = aliasB.LastName
AND aliasA.EmployeeID < aliasB.EmployeeID -- ONLY valid if PK supports '
Thanks
Thank you