How to use Perl to identify duplicate files

Welcome to another Perl tutorial!
Welcome to another Perl tutorial! | Source

What you will learn

If you have ever migrated your files from one computer to another, or if you have survived a computer crash and rebuild, or if you have been using a computer long enough, you will understand the frustration of realizing that many of the files on your filesystem are duplicates of other files. How can you know for sure? How can you identify which are duplicates?

This tutorial describes how to take the output of one utility (find) and feed it as an argument into another utility (cksum). The idea is to find all files of a certain type (such as .txt or .jpg) within a certain directory and its subdirectories, then compute a checksum on each file found to determine whether two files share the same content. Keep in mind this can get out of hand on large filesystems, because every byte of each file encountered must be read to compute the checksum. Caveat discipulus.

Some concepts encountered along the way are

  • how to read command line arguments within a Perl script
  • how to remove newlines from input
  • how to split delimited data into a list of scalars
  • how to store and retrieve a hash of arrays
  • how to use map to modify array arguments

Command line arguments

The Perl interpreter reads its arguments from the command line and stores them in a built-in variable @ARGV. On line 6 in the code listing below, the shift operator pulls in the first argument from the @ARGV array since no other array is specified. For a much more thorough treatment of command line argument processing, read about GetOpt::Long.

Use chomp to trim newline from input

The default behavior of chomp function removes the trailing newline from its input variable. The documentation for chomp reminds me of Perl's overlap with awk functionality. The reference to Perl's $INPUT_RECORD_SEPARATOR recalls the awk built-in variable RS. (Perl behavior can be tuned by tweaking its built-in variables.) The short-name of this variable is $/ and its default value is newline. By changing the value of $/ you will change the way chomp behaves.

Use split for delimited input

The split function takes a match pattern, an input string, and a maximum number of return values. If no arguments are given, the input string will be taken from the built-in $_ variable, and the pattern will default to /\s+/, and the limit grows to the number of matches found in the input.

As with most built-in Perl functions, the return value of split changes behavior depending on whether it is assigned to a scalar or a list value. On line 20, limit is set to three, and the return value is assigned to a list context of three scalars. In this case, ef there are more than 3 matches on any value of $cksumOut, all but the first two will be assigned to $name.

Use lists as array, queue, or stack

As mentioned in a previous hub (see "Hash and tally" section) Perl's three main data types are scalar, list, and hash. A list is an ordered set of scalars, and is designated with the @ sigil.

An array can be addressed by index, as in a for or foreach loop. For any array @arr, the corresponding variable $#arr holds the value of the highest defined index.

for (my $i=0; $i < $#arr; $i++) { print $arr[$i]; }

A queue holds values in first-in, first-out order. Perl facilitates queue behavior using the push and shift operators on a list. The push operator appends values to the "back" of a list, and the shift operator pulls the values from the "front" of a list.

A stack holds values in a last-in, first-out order, which Perl supports by using the push and pop operators on a list. As before, the push operator appends to the "back". The pop operator also pulls values from the "back" of a list.

The value at any given index in an array or a hash is a scalar, but that scalar can be a reference to another variable. On line 22, the push operator appends the filename (a scalar) from the cksum operation into an array reference (also a scalar) that in turn is stored as a hash value, referenced by the cksum of the file in question.

Perl data structures using lists

Data structure
insert
retrieve
queue
push @array, $value
$value = shift @array
stack
push @array, $value
$value = pop @array
array
push @array, $value
for ( $index = 0; $index < $#array; $index++) { $value = $array[$index] }

Summarizing the results

After the find operation completes, the child process returns EOF to the while loop on line 11 which terminates at the close bracket on line 23. For good housekeeping, we explicitly close the handle on line 24.

The keys function returns an unordered list of keys as defined in a hash. On line 27, each key in the %files hash is stored into the variable $csum for the duration of one iteration through the for loop. The scalar function casts an array or hash variable into scalar context. The effect on line 29 is that it returns the number of values in the array referenced by $files{$csum}. If there's only one value there, then by definition there is no duplicate.

The remaining line on 31 uses join and map to succintly format the script's output. The join function takes every value in the list returned by the map function and appends those values together, separated by newline (\n). The map function evaluates "\t$_" on every value returned from the array referenced by $files{$csum}. The end result is that each key of the %files hash, $csum, is printed on the left-most column of the standard output; every value stored in the array reference is prepended by a tab (\t); and all of these are each followed by a newline (\n).

If you have any questions, comments, or suggestions, please post to the comment section below.

Quick review

find-duplicate.pl

#!/usr/bin/perl

# Author: Jeff Wilson
# Created: 2014
# License: GPL 3.0 ... no warranty, free to re-use in any way

use warnings;
use strict;

# read a command line argument, complain if it isn't set
my $type = shift || die "Expecting filetype suffix";

my %files;
# search through the file system, beginning in the directory where invoked
open(my $ph,"find . -type f -iname \"*.$type\"|");
while (<$ph>) {
  # grab the line of output from the < > operator, rename to $file
  my $file = $_;
  # eat the newline off the end of the line
  chomp $file;
  my $cksumOut;
  # pass this file into the cksum utility, chomp the output
  chomp ($cksumOut = `cksum "$file"`);
  # split the columns by whitespace
  my ($cksum,$size,$name) = split /\s+/,$cksumOut,3;
  # store filenames by their checksum to look for collisions
  push @{$files{$cksum}}, $file;
}
close $ph;

# walk through all the computed checksums
for my $csum (keys %files) {
  # skip any arrays with less than 2 entries
  next unless (scalar @{$files{$csum}} > 1);
  # report anything with 2 or more entries
  print join("\n",$csum,map { "\t$_" } @{$files{$csum}}),"\n";
}
Let me know in the comments if you're interested in this vim syntax config
Let me know in the comments if you're interested in this vim syntax config | Source

More by this Author


Comments 1 comment

Chankey Pathak 21 months ago

Instead of invoking system command you could have used pure Perl solutions. Anyway, thanks for the article!

    Sign in or sign up and post using a HubPages Network account.

    0 of 8192 characters used
    Post Comment

    No HTML is allowed in comments, but URLs will be hyperlinked. Comments are not for promoting your articles or other sites.


    Click to Rate This Article
    working