In a project it was necessary to calculate the md5 checksums of files as fast as possible. Under Perl5 there is the module
Digest::MD5.
The suggested way to use this module is not the fastest. The reason is that the method addfile() does not use the buffer optimally.
In the following I have tested all possible variants: the suggested addfile approach, the buffer optimized, the
File::Map based and the system call to 'md5sum' variant:
#!/usr/bin/env perl
# bench to check how fast is memory mapped access
use strict;
use warnings;
use utf8;
use Benchmark qw(:all) ;
use File::Map qw( map_file);
use Digest::MD5;
use File::Slurp;
sub md5offile_mapped {
my $fn = shift;
map_file my $data, $fn, '<';
my $md5obj = Digest::MD5->new;
$md5obj->add($data);
return $md5obj->hexdigest;
}
sub md5offile_orig {
my $fn = shift;
my $fh;
open($fh, '<', $fn) || die ("Can't open '$fn', $!");
binmode($fh);
my ($dev,$ino,$mode,$nlink,$uid,$gid,$rdev,$size,
$atime,$mtime,$ctime,$blksize,$blocks)
= stat $fh;
my $buffer;
my $md5obj = Digest::MD5->new;
while (read($fh, $buffer, $blksize)) {
$md5obj->add($buffer);
}
close $fh || die ("could not close file '$fn', $!");
return $md5obj->hexdigest;
}
sub md5offile_addfile {
my $fn = shift;
my $fh;
open($fh, '<', $fn) || die ("Can't open '$fn', $!");
binmode($fh);
my $md5obj = Digest::MD5->new;
$md5obj->addfile( $fh );
close $fh || die ("could not close file '$fn', $!");
return $md5obj->hexdigest;
}
sub md5offile_md5file {
my $fn = shift;
return system("md5sum $fn >/dev/null 2>&1");
}
my $file = shift @ARGV;
read_file($file); # to warm cache
timethese(500, {
'memory_mapped' => sub{ md5offile_mapped( $file ); },
'original' => sub{ md5offile_orig( $file ); },
'add_file' => sub{ md5offile_addfile( $file ); },
'system' => sub{ md5offile_md5file( $file ); },
;}
);
The variant "
memory-mapped" is about 10% faster than the others. Here a result for checksumming a DNG-file with size of 13MB on a NVME device:
Benchmark: timing 500 iterations of add_file, memory_mapped, original, system...
add_file: 12 wallclock secs (10.68 usr + 1.09 sys = 11.77 CPU) @ 42.48/s (n=500)
memory_mapped: 10 wallclock secs (10.43 usr + 0.23 sys = 10.66 CPU) @ 46.90/s (n=500)
original: 12 wallclock secs (10.98 usr + 0.95 sys = 11.93 CPU) @ 41.91/s (n=500)
system: 16 wallclock secs ( 0.13 usr 0.27 sys + 13.97 cusr 1.34 csys = 15.71 CPU) @ 31.83/s (n=500)
Unfortunately there is a problem with large files. The Digest::MD5 probably calculates the values wrong for scalars >1GB (see
https://rt.cpan.org/Public/Bug/Display.html?id=123185). In this case, the memory mapped approach should not be used.