How to Find Duplicate Records with Microsoft Access

Microsoft Access is, and always will be, the Mack Daddy of desktop database applications.
Microsoft Access is, and always will be, the Mack Daddy of desktop database applications. | Source

Definition of a Duplicate

The hardest part about finding duplicate records is knowing what the definition of "Duplicate" really is for your situation. For example, suppose you have a contacts list with the following structure:

tblContacts

ContactID - Autonumber
LastName - Text 25
FirstName - Text 25
MI - Text 1
Phone - Text 11

Duplicate records could be any one of the following scenarios:

1. A person with two Phone numbers where one number is obsolete.
2. A person with two Phone numbers where both are valid phone numbers
3. Two people have the same name, but a different phone number


Duplicate Quiz

Quiz Explanation

Whether or not you took the quiz, the answer is: Only 1 scenario.

Here is a breakdown of the scenarios and whether or not they are a duplicate:

1. A person with two Phone numbers where one number is obsolete.
True: A person with an obsolete phone number is a duplicate, because the former phone number does not contain any value in the database and the obsolete record needs to be deleted.

2. A person with two Phone numbers where both are valid phone numbers
False: In this table example, unfortunately, the only way to represent a person with multiple valid phone numbers is to enter their information into more than one record. Thus, even if the records were found as a duplicate, there is nothing that can be done to remedy it.

3. Two people have the same name, but a different phone number
False: Just like Scenario 2, with the current structure, two people with the exact same name would not be a duplicate. They are two valid entries with two valid phone number. Without another field, such as Drivers License number, to differentiate one from another, then there is no further action that one can be taken.

More Realistic Example

Knowing that the simple table structure above is not realistic, fleshing it out will give it a little more usefulness for when you attempt to find a duplicate. Assume the new table structure is the following:

tblContacts
ContactID
Company
FirstName
LastName
MI
PhoneOffice
PhoneMobile

With this structure, a person can have two phone numbers, and if there are two people with the same name, then the Company field can be used to help differentiate. (Hopefully you can see that this doesn't really solve all problems, because a person can change companies, get a new mobile phone, or get married and change their last name. The possibilities are completely left to interpretation.)

Let's suppose that the business rules that define our duplicates are:

1. Any two records with the same FirstName, LastName, and Mobile phone
2. To resolve issues found by #1, observe the company name.

Based on these business rules, a query can be created to allow for further inspection.

We're off to see the Wizard

Since the untimely death of Clippy the Paperclip, the great sorcerer known as "The Wizard" has been a permanent fixture of many of the Office programs. They can be used for a wide variety of functionality, and creating queries is a well-used feature of them.

Depending on which version you have, triggering the query wizard can happen different ways. For whatever reason, this machine has Access 2007, and the Wizard is found on the ribbon under:

Create (tab) / Query Wizard ('other' section) / Find Duplicates Query Wizard

The Find Duplicates Query Wizard will prompt you for information about your business rules, and then use them to create a query that can retrieve the information based on the rules. (Despite the name, it can't magically understand your data and determine a definition of a duplicate.)

Question 1: Which table or query do you want to search for duplicate field values?
Notice that it allows either a table or a query. This is important, as you may find that to solve your duplicate problems, you may first need to write a query just to get to the point of being able to write this query. Perhaps you need to find duplicates within a certain date range. Writing a "pre-query", as only I have ever called it, will get you off on the correct foot.

In our case, select your contacts table (if you made the one above.)

Question 2: Which fields might contain the duplicate information?
Based on our business rules, we need to add First Name, Last Name and Mobile Phone.

Question 3: Do you want the query to show fields in addition to those with duplicate values?
Despite the fact that you cannot answer this Yes/No question, the way you do answer it is by adding any additional fields that can help you determine if the records is truly a duplicate. In this example, select the company name.

Question 4: What do you want to name your query?
Give it a name based on your naming convention. qryContacts_FindDupes.

Click Finish to view the results. (Obviously, if you haven't added any data to support this example, then the results will be empty.)

Find Duplicates SQL Breakdown

If you're like me, I don't care for magic, I want to know what the sparkling wand created. So, if you open the query in the SQL view, you will see something like the following:

SELECT tblContacts.FirstName, tblContacts.LastName, tblContacts.PhoneMobile, tblContacts.Company FROM tblContacts
WHERE (((tblContacts.FirstName) In

(
SELECT [FirstName] FROM [tblContacts] As Tmp
GROUP BY [FirstName],[LastName],[PhoneMobile]
HAVING Count(*)>1 And
[LastName] = [tblContacts].[LastName] And
[PhoneMobile] = [tblContacts].[PhoneMobile]
)
)
)

ORDER BY tblContacts.FirstName, tblContacts.LastName, tblContacts.PhoneMobile;

This is what's called a Correlated Subquery, whereas the values in a subquery are compared against values in the main query. In this case, the subquery finds any records where Contacts grouped by FirstName, LastName, and PhoneMobile return more than 1 record. HAVING is a where clause for when a GROUP BY is used. It means after the group by is applied, then find values that match this condition.

Duplicate Resolution Summary

According to the 80/20 Rule, after you've defined your business rules for duplication, and gone through the effort to find those that qualify, you're only 80% of the way there. The remaining 20% of the work starts with defining a methodology for remediating the found duplicates. (For example, can records simply be deleted, or is there logic involved to merge two, or more, records into only one.)

Armed with that plan, you can either employ a programmer to devise a clever way to deal with the records, or hire some minimum-wage temporary employees to slog through fixing each record manually.

More by this Author


Comments

No comments yet.

    Sign in or sign up and post using a HubPages Network account.

    0 of 8192 characters used
    Post Comment

    No HTML is allowed in comments, but URLs will be hyperlinked. Comments are not for promoting your articles or other sites.


    Click to Rate This Article
    working