Using row_number() to remove duplicates
For those of you who are coming to Wagga for the SQL Code Camp (less than four weeks away), you'll hear this tip then. But for the rest of you, here's one of the great uses for row_number() in SQL2005.
Suppose you have a table of data, which includes some duplicates. It happened recently for a guy who posted to the SQLDownUnder mailing list. You want to get rid of duplicates, and keep only the first row for each subset of fields. For example, in the mailing list question, the example was that he wanted to filter to a single field, which I will call idfield, which contained duplicates. It's worth noting here that there is no primary key on the table. If there were, life would be much easier.
One option (without row_number()) is to use min() or max() for each field. Something like this:
select idfield, min(field1), max(field2)
from dbo.sometable
group by idfield
But some types, such as guids, cannot be used with min() and max(). So for that, let's try:
select t1.idfield, (select top 1 t2.field1 from dbo.sometable t2 where t2.idfield = t1.idfield), max(t1.field2)
from dbo.sometable t1
group by t1.idfield
This will do the trick, but it makes life tough for checking that you get fields from the same row. After all, max(field1) and max(field2) might be from different rows. So you can use top with order by to get the right row.
select t1.idfield, (select top 1 t2.field1 from dbo.sometable t2 where t2.idfield = t1.idfield order by field1, field2, field3, field4), (select top 1 t2.field2 from dbo.sometable t2 where t2.idfield = t1.idfield order by field1, field2, field3, field4)
from dbo.sometable t1
group by t1.idfield
And you can see, we're getting pretty ugly. So let's try with row_number(). If we partition over the field that we want to become unique, then our row_number() field starts again from 1 each time it changes. This is ideal, because if we filter to rows which have row_number() as 1, we get exactly what we want!
select * from
(select *, row_number() over (partition by idfield order by idfield, field1, field2) as rn
from dbo.sometable) t
where t.rn = 1
Lovely. Of course, you wouldn't use "select *", but you get the drift. :)