ArtsAutosBooksBusinessEducationEntertainmentFamilyFashionFoodGamesGenderHealthHolidaysHomeHubPagesPersonal FinancePetsPoliticsReligionSportsTechnologyTravel

How to use Perl to identify duplicate files

Updated on March 17, 2014
Welcome to another Perl tutorial!
Welcome to another Perl tutorial! | Source

What you will learn

If you have ever migrated your files from one computer to another, or if you have survived a computer crash and rebuild, or if you have been using a computer long enough, you will understand the frustration of realizing that many of the files on your filesystem are duplicates of other files. How can you know for sure? How can you identify which are duplicates?

This tutorial describes how to take the output of one utility (find) and feed it as an argument into another utility (cksum). The idea is to find all files of a certain type (such as .txt or .jpg) within a certain directory and its subdirectories, then compute a checksum on each file found to determine whether two files share the same content. Keep in mind this can get out of hand on large filesystems, because every byte of each file encountered must be read to compute the checksum. Caveat discipulus.

Some concepts encountered along the way are

  • how to read command line arguments within a Perl script
  • how to remove newlines from input
  • how to split delimited data into a list of scalars
  • how to store and retrieve a hash of arrays
  • how to use map to modify array arguments

Command line arguments

The Perl interpreter reads its arguments from the command line and stores them in a built-in variable @ARGV. On line 6 in the code listing below, the shift operator pulls in the first argument from the @ARGV array since no other array is specified. For a much more thorough treatment of command line argument processing, read about GetOpt::Long.

Use chomp to trim newline from input

The default behavior of chomp function removes the trailing newline from its input variable. The documentation for chomp reminds me of Perl's overlap with awk functionality. The reference to Perl's $INPUT_RECORD_SEPARATOR recalls the awk built-in variable RS. (Perl behavior can be tuned by tweaking its built-in variables.) The short-name of this variable is $/ and its default value is newline. By changing the value of $/ you will change the way chomp behaves.

Use split for delimited input

The split function takes a match pattern, an input string, and a maximum number of return values. If no arguments are given, the input string will be taken from the built-in $_ variable, and the pattern will default to /\s+/, and the limit grows to the number of matches found in the input.

As with most built-in Perl functions, the return value of split changes behavior depending on whether it is assigned to a scalar or a list value. On line 20, limit is set to three, and the return value is assigned to a list context of three scalars. In this case, ef there are more than 3 matches on any value of $cksumOut, all but the first two will be assigned to $name.

Use lists as array, queue, or stack

As mentioned in a previous hub (see "Hash and tally" section) Perl's three main data types are scalar, list, and hash. A list is an ordered set of scalars, and is designated with the @ sigil.

An array can be addressed by index, as in a for or foreach loop. For any array @arr, the corresponding variable $#arr holds the value of the highest defined index.

for (my $i=0; $i < $#arr; $i++) { print $arr[$i]; }

A queue holds values in first-in, first-out order. Perl facilitates queue behavior using the push and shift operators on a list. The push operator appends values to the "back" of a list, and the shift operator pulls the values from the "front" of a list.

A stack holds values in a last-in, first-out order, which Perl supports by using the push and pop operators on a list. As before, the push operator appends to the "back". The pop operator also pulls values from the "back" of a list.

The value at any given index in an array or a hash is a scalar, but that scalar can be a reference to another variable. On line 22, the push operator appends the filename (a scalar) from the cksum operation into an array reference (also a scalar) that in turn is stored as a hash value, referenced by the cksum of the file in question.

Perl data structures using lists

Data structure
insert
retrieve
queue
push @array, $value
$value = shift @array
stack
push @array, $value
$value = pop @array
array
push @array, $value
for ( $index = 0; $index < $#array; $index++) { $value = $array[$index] }

Summarizing the results

After the find operation completes, the child process returns EOF to the while loop on line 11 which terminates at the close bracket on line 23. For good housekeeping, we explicitly close the handle on line 24.

The keys function returns an unordered list of keys as defined in a hash. On line 27, each key in the %files hash is stored into the variable $csum for the duration of one iteration through the for loop. The scalar function casts an array or hash variable into scalar context. The effect on line 29 is that it returns the number of values in the array referenced by $files{$csum}. If there's only one value there, then by definition there is no duplicate.

The remaining line on 31 uses join and map to succintly format the script's output. The join function takes every value in the list returned by the map function and appends those values together, separated by newline (\n). The map function evaluates "\t$_" on every value returned from the array referenced by $files{$csum}. The end result is that each key of the %files hash, $csum, is printed on the left-most column of the standard output; every value stored in the array reference is prepended by a tab (\t); and all of these are each followed by a newline (\n).

If you have any questions, comments, or suggestions, please post to the comment section below.

find-duplicate.pl

#!/usr/bin/perl

# Author: Jeff Wilson
# Created: 2014
# License: GPL 3.0 ... no warranty, free to re-use in any way

use warnings;
use strict;

# read a command line argument, complain if it isn't set
my $type = shift || die "Expecting filetype suffix";

my %files;
# search through the file system, beginning in the directory where invoked
open(my $ph,"find . -type f -iname \"*.$type\"|");
while (<$ph>) {
  # grab the line of output from the < > operator, rename to $file
  my $file = $_;
  # eat the newline off the end of the line
  chomp $file;
  my $cksumOut;
  # pass this file into the cksum utility, chomp the output
  chomp ($cksumOut = `cksum "$file"`);
  # split the columns by whitespace
  my ($cksum,$size,$name) = split /\s+/,$cksumOut,3;
  # store filenames by their checksum to look for collisions
  push @{$files{$cksum}}, $file;
}
close $ph;

# walk through all the computed checksums
for my $csum (keys %files) {
  # skip any arrays with less than 2 entries
  next unless (scalar @{$files{$csum}} > 1);
  # report anything with 2 or more entries
  print join("\n",$csum,map { "\t$_" } @{$files{$csum}}),"\n";
}
Let me know in the comments if you're interested in this vim syntax config
Let me know in the comments if you're interested in this vim syntax config | Source

Comments

    0 of 8192 characters used
    Post Comment

    • profile image

      Chankey Pathak 

      3 years ago

      Instead of invoking system command you could have used pure Perl solutions. Anyway, thanks for the article!

    working

    This website uses cookies

    As a user in the EEA, your approval is needed on a few things. To provide a better website experience, hubpages.com uses cookies (and other similar technologies) and may collect, process, and share personal data. Please choose which areas of our service you consent to our doing so.

    For more information on managing or withdrawing consents and how we handle data, visit our Privacy Policy at: https://hubpages.com/privacy-policy#gdpr

    Show Details
    Necessary
    HubPages Device IDThis is used to identify particular browsers or devices when the access the service, and is used for security reasons.
    LoginThis is necessary to sign in to the HubPages Service.
    Google RecaptchaThis is used to prevent bots and spam. (Privacy Policy)
    AkismetThis is used to detect comment spam. (Privacy Policy)
    HubPages Google AnalyticsThis is used to provide data on traffic to our website, all personally identifyable data is anonymized. (Privacy Policy)
    HubPages Traffic PixelThis is used to collect data on traffic to articles and other pages on our site. Unless you are signed in to a HubPages account, all personally identifiable information is anonymized.
    Amazon Web ServicesThis is a cloud services platform that we used to host our service. (Privacy Policy)
    CloudflareThis is a cloud CDN service that we use to efficiently deliver files required for our service to operate such as javascript, cascading style sheets, images, and videos. (Privacy Policy)
    Google Hosted LibrariesJavascript software libraries such as jQuery are loaded at endpoints on the googleapis.com or gstatic.com domains, for performance and efficiency reasons. (Privacy Policy)
    Features
    Google Custom SearchThis is feature allows you to search the site. (Privacy Policy)
    Google MapsSome articles have Google Maps embedded in them. (Privacy Policy)
    Google ChartsThis is used to display charts and graphs on articles and the author center. (Privacy Policy)
    Google AdSense Host APIThis service allows you to sign up for or associate a Google AdSense account with HubPages, so that you can earn money from ads on your articles. No data is shared unless you engage with this feature. (Privacy Policy)
    Google YouTubeSome articles have YouTube videos embedded in them. (Privacy Policy)
    VimeoSome articles have Vimeo videos embedded in them. (Privacy Policy)
    PaypalThis is used for a registered author who enrolls in the HubPages Earnings program and requests to be paid via PayPal. No data is shared with Paypal unless you engage with this feature. (Privacy Policy)
    Facebook LoginYou can use this to streamline signing up for, or signing in to your Hubpages account. No data is shared with Facebook unless you engage with this feature. (Privacy Policy)
    MavenThis supports the Maven widget and search functionality. (Privacy Policy)
    Marketing
    Google AdSenseThis is an ad network. (Privacy Policy)
    Google DoubleClickGoogle provides ad serving technology and runs an ad network. (Privacy Policy)
    Index ExchangeThis is an ad network. (Privacy Policy)
    SovrnThis is an ad network. (Privacy Policy)
    Facebook AdsThis is an ad network. (Privacy Policy)
    Amazon Unified Ad MarketplaceThis is an ad network. (Privacy Policy)
    AppNexusThis is an ad network. (Privacy Policy)
    OpenxThis is an ad network. (Privacy Policy)
    Rubicon ProjectThis is an ad network. (Privacy Policy)
    TripleLiftThis is an ad network. (Privacy Policy)
    Say MediaWe partner with Say Media to deliver ad campaigns on our sites. (Privacy Policy)
    Remarketing PixelsWe may use remarketing pixels from advertising networks such as Google AdWords, Bing Ads, and Facebook in order to advertise the HubPages Service to people that have visited our sites.
    Conversion Tracking PixelsWe may use conversion tracking pixels from advertising networks such as Google AdWords, Bing Ads, and Facebook in order to identify when an advertisement has successfully resulted in the desired action, such as signing up for the HubPages Service or publishing an article on the HubPages Service.
    Statistics
    Author Google AnalyticsThis is used to provide traffic data and reports to the authors of articles on the HubPages Service. (Privacy Policy)
    ComscoreComScore is a media measurement and analytics company providing marketing data and analytics to enterprises, media and advertising agencies, and publishers. Non-consent will result in ComScore only processing obfuscated personal data. (Privacy Policy)
    Amazon Tracking PixelSome articles display amazon products as part of the Amazon Affiliate program, this pixel provides traffic statistics for those products (Privacy Policy)