Mittwoch, 13. November 2019

Improve md5 calculation -- an unexpected journey

In a project it was necessary to calculate the md5 checksums of files as fast as possible. Under Perl5 there is the module Digest::MD5.

The suggested way to use this module is not the fastest. The reason is that the method addfile() does not use the buffer optimally.

In the following I have tested all possible variants: the suggested addfile approach, the buffer optimized, the File::Map based and the system call to 'md5sum' variant:


#!/usr/bin/env perl 
# bench to check how fast is memory mapped access

use strict;
use warnings;
use utf8;
use Benchmark qw(:all) ;
use File::Map qw( map_file);
use Digest::MD5;
use File::Slurp;

sub md5offile_mapped {
    my $fn = shift;
    map_file my $data, $fn, '<';
    my $md5obj = Digest::MD5->new;
    $md5obj->add($data);
    return $md5obj->hexdigest;
}

sub md5offile_orig {
    my $fn = shift;
    my $fh;
    open($fh, '<', $fn) || die ("Can't open '$fn', $!");
    binmode($fh);
    my ($dev,$ino,$mode,$nlink,$uid,$gid,$rdev,$size,
        $atime,$mtime,$ctime,$blksize,$blocks)
    = stat $fh;
    my $buffer;
    my $md5obj = Digest::MD5->new;
    while (read($fh, $buffer, $blksize)) {
        $md5obj->add($buffer);
    }
    close $fh || die ("could not close file '$fn', $!");
    return $md5obj->hexdigest;
}

sub md5offile_addfile {
    my $fn = shift;
    my $fh;
    open($fh, '<', $fn) || die ("Can't open '$fn', $!");
    binmode($fh);
    my $md5obj = Digest::MD5->new;
    $md5obj->addfile( $fh );
    close $fh || die ("could not close file '$fn', $!");
    return $md5obj->hexdigest;
}

sub md5offile_md5file {
    my $fn = shift;
    return system("md5sum $fn >/dev/null 2>&1");
}


my $file = shift @ARGV;
read_file($file); # to warm cache

timethese(500, {
        'memory_mapped' => sub{ md5offile_mapped( $file ); },
        'original'      => sub{ md5offile_orig( $file ); },
        'add_file'      => sub{ md5offile_addfile( $file ); },
        'system'        => sub{ md5offile_md5file( $file ); },
;}
);




The variant "memory-mapped" is about 10% faster than the others. Here a result for checksumming a DNG-file with size of 13MB on a NVME device:

Benchmark: timing 500 iterations of add_file, memory_mapped, original, system...
     add_file: 12 wallclock secs (10.68 usr +  1.09 sys = 11.77 CPU) @ 42.48/s (n=500)
memory_mapped: 10 wallclock secs (10.43 usr +  0.23 sys = 10.66 CPU) @ 46.90/s (n=500)
     original: 12 wallclock secs (10.98 usr +  0.95 sys = 11.93 CPU) @ 41.91/s (n=500)
       system: 16 wallclock secs ( 0.13 usr 0.27 sys + 13.97 cusr  1.34 csys = 15.71 CPU) @ 31.83/s (n=500)


Unfortunately there is a problem with large files. The Digest::MD5 probably calculates the values wrong for scalars >1GB (see https://rt.cpan.org/Public/Bug/Display.html?id=123185). In this case, the memory mapped approach should not be used.