Art1Pirat Blog: Perl

Posts mit dem Label Perl werden angezeigt. Alle Posts anzeigen

Mittwoch, 13. November 2019

Improve md5 calculation -- an unexpected journey

In a project it was necessary to calculate the md5 checksums of files as fast as possible. Under Perl5 there is the module Digest::MD5.

The suggested way to use this module is not the fastest. The reason is that the method addfile() does not use the buffer optimally.

In the following I have tested all possible variants: the suggested addfile approach, the buffer optimized, the File::Map based and the system call to 'md5sum' variant:

#!/usr/bin/env perl 
# bench to check how fast is memory mapped access

use strict;
use warnings;
use utf8;
use Benchmark qw(:all) ;
use File::Map qw( map_file);
use Digest::MD5;
use File::Slurp;

sub md5offile_mapped {
    my $fn = shift;
    map_file my $data, $fn, '<';
    my $md5obj = Digest::MD5->new;
    $md5obj->add($data);
    return $md5obj->hexdigest;
}

sub md5offile_orig {
    my $fn = shift;
    my $fh;
    open($fh, '<', $fn) || die ("Can't open '$fn', $!");
    binmode($fh);
    my ($dev,$ino,$mode,$nlink,$uid,$gid,$rdev,$size,
        $atime,$mtime,$ctime,$blksize,$blocks)
    = stat $fh;
    my $buffer;
    my $md5obj = Digest::MD5->new;
    while (read($fh, $buffer, $blksize)) {
        $md5obj->add($buffer);
    }
    close $fh || die ("could not close file '$fn', $!");
    return $md5obj->hexdigest;
}

sub md5offile_addfile {
    my $fn = shift;
    my $fh;
    open($fh, '<', $fn) || die ("Can't open '$fn', $!");
    binmode($fh);
    my $md5obj = Digest::MD5->new;
    $md5obj->addfile( $fh );
    close $fh || die ("could not close file '$fn', $!");
    return $md5obj->hexdigest;
}

sub md5offile_md5file {
    my $fn = shift;
    return system("md5sum $fn >/dev/null 2>&1");
}


my $file = shift @ARGV;
read_file($file); # to warm cache

timethese(500, {
        'memory_mapped' => sub{ md5offile_mapped( $file ); },
        'original'      => sub{ md5offile_orig( $file ); },
        'add_file'      => sub{ md5offile_addfile( $file ); },
        'system'        => sub{ md5offile_md5file( $file ); },
;}
);

The variant "memory-mapped" is about 10% faster than the others. Here a result for checksumming a DNG-file with size of 13MB on a NVME device:

Benchmark: timing 500 iterations of add_file, memory_mapped, original, system...
     add_file: 12 wallclock secs (10.68 usr +  1.09 sys = 11.77 CPU) @ 42.48/s (n=500)
memory_mapped: 10 wallclock secs (10.43 usr +  0.23 sys = 10.66 CPU) @ 46.90/s (n=500)
     original: 12 wallclock secs (10.98 usr +  0.95 sys = 11.93 CPU) @ 41.91/s (n=500)
       system: 16 wallclock secs ( 0.13 usr 0.27 sys + 13.97 cusr  1.34 csys = 15.71 CPU) @ 31.83/s (n=500)

Unfortunately there is a problem with large files. The Digest::MD5 probably calculates the values wrong for scalars >1GB (see https://rt.cpan.org/Public/Bug/Display.html?id=123185). In this case, the memory mapped approach should not be used.

Dienstag, 14. Mai 2019

Asciiart via Perl komprimieren

Als kleine Fingerübung habe ich ff. Perl-Script geschrieben, welches zur "Komprimierung" von AsciiArt verwendet werden kann.

Sei die Vorlage:

            /| |\
        ,  ( `-´ )
       /(  _\   /____      []
      |  >(__|9|_____)     ||
       )(    | |  ___  ___ ||
      -==-   | | / _ \/ _ \||
             |8|| (_) |(_) ||
            /   \\___/\___(%%)
           ( ,-. )        |''|
            \| |/         ||||
                          `--'

So erzeugt das Script folgenden Perl-Code:

#!/usr/bin/perl
use v5.10;
say ;
say " "x12,"/| |\\";
say " "x8,",  ( `-´ )";
say " "x7,"/(  _\\   /","_"x4," "x6,"[]";
say " "x6,"|  >(__|9|","_"x5,")"," "x5,"||";
say " "x7,")("," "x4,"| |  ___  ___ ||";
say " "x6,"-==-   | | / _ \\/ _ \\||";
say " "x13,"|8|| (_) |(_) ||";
say " "x12,"/   \\\\___/\\___(%%)";
say " "x11,"( ,-. )"," "x8,"|''|";
say " "x12,"\\| |/"," "x9,"|"x4;
say " "x26,"`--'";
say ;
say ;
say ;1;

Alternativ kann der Fuzzy-Mode eingeschaltet werden, der ff. Code erzeugt:


#!/usr/bin/perl

use v5.10;

used in fuzzy mode

say "\n"," "x12,"/| |\\\n"," "x8,","," "x2,"( `-´ )\n"," "x7,"/("," "x2,"_\\"," "x3,"/","_"x4," "x6,"[]\n"," "x6,"|"," "x2,">(","_"x2,"|9|","_"x5,")"," "x5,"|"x2,"\n"," "x7,")("," "x4,"| |"," "x2,"_"x3," "x2,"_"x3," ","|"x2,"\n"," "x6,"-","="x2,"-"," "x3,"| | / _ \\/ _ \\","|"x2,"\n"," "x13,"|8","|"x2," (_) |(_) ","|"x2,"\n"," "x12,"/"," "x3,"\\"x2,"_"x3,"/\\","_"x3,"(","%"x2,")\n"," "x11,"( ,-. )"," "x8,"|","'"x2,"|\n"," "x12,"\\| |/"," "x9,"|"x4,"\n"," "x26,"`","-"x2,"'\n","\n","\n",;1;

Hier das Script:

#!/usr/bin/env perl 
#===============================================================================
#
#         FILE: compress_asciiart.pl
#
#        USAGE: ./compress_asciiart.pl  
#
#  DESCRIPTION: compresses asciiart (or any text) to small perlcode using RLE
#
#      OPTIONS: -f (enables fuzzing)
# REQUIREMENTS: ---
#         BUGS: ---
#        NOTES: ---
#       AUTHOR: Andreas Romeyke
# ORGANIZATION: 
#      VERSION: 1.0
#      CREATED: 08.04.2019
#     REVISION: ---
#===============================================================================

use strict;
use warnings;
use utf8;

print "#!/usr/bin/perl\n";
print "use v5.10;\n";
my $fuzzy=0;
my $runlength=3;
if (defined $ARGV[0] && $ARGV[0] eq "-f") {
    print STDERR "used in fuzzy mode\n";
    $runlength=1;
    $fuzzy=1;
}

sub subs {
    my $s1 = shift;
    my $s2 = shift;
    my $l = length($s2)+1;
    $s1=~s/\\/\\\\/g;
    $s1=~s/"/\\"/g;
    $s2=~s/\\/\\\\/g;
    $s2=~s/"/\\"/g;
    my $v;
    if ($l > $runlength) {
        return 
        qq(").
        qq($s1).
        "\"x$l,";
    } else {
        return
        qq(").
        qq($s1$s2).
        qq(",)
        ;
    }
}
print "say ";
foreach my $line () {
    chomp $line;
    $line=~s/(.)(\1*)/my $l=length($2);subs($1, $2);/eg;
    if ($fuzzy) {
        $line =~ s/$/\"\\n\",/;
    } else {
        $line =~ s/$/;\nsay /;
    }

    while ($line=~s/"([^"]+)","([^"]+)",/"$1$2",/g) {};
    $line=~s/,;\n/;\n/g;
    print $line;
}
print ";1;";
1;

Wer noch weitere Vorschläge hat, wie man noch eleganteren Code erzeugen kann, immer her damit :)

Mittwoch, 17. Dezember 2014

Perl RegEx mit variable length look-behind um Römische Zahlen in Text bedingt zu ersetzen

Quelle: Wikimedia, Lizenznachweis sh. Link Bild

Das Problem

In einem Text sollen Römische Zahlen durch die selbstdefinierte Asciidoc-Notation roman::number[] ausgezeichnet werden. Asciidoc erlaubt zur Zeit allerdings nicht weitere Markups innerhalb einer Bildauszeichnung.

Sprich:

.Konrad roman::number[I]. von Wettin. Luitgard, Konrads roman::number[I]. Gemahlin.
[caption=""]
image:img/011_Konrad_Luitgard.svg[Konrad I. von Wettin. Luitgard, Konrads I. Gemahlin.]

funktioniert, aber

.Konrad roman::number[I]. von Wettin. Luitgard, Konrads roman::number[I]. Gemahlin.
[caption=""]
image:img/011_Konrad_Luitgard.svg[Konrad roman::number[I]. von Wettin. Luitgard, Konrads roman::number[I]. Gemahlin.]

nicht, da in Asciidoc Tags der Form image:foo.png[baz] innerhalb von baz keine anderen asciidoc-Tags enthalten darf.

Römische Zahlen finden

Wenn wir nach römischen Zahlen suchen, könnten wir Reguläre Ausdrücke (RegEx) benutzen. In Perl sähe eine RegEx dann beispielsweise so aus:

my $roman_number = qr{
(
    (I{1,3})| # I … III
    (I?V)|     # IV … V
    (VI{1,3})| # VI … VIII
    (I?X)      # IX … X
)
}x;

Teststring

Bevor wir weitermachen, sollten wir uns einen Teststring definieren, der alle möglichen Varianten enthält und es uns erlaubt die RegExes zu überprüfen:

my $string=<<TEST;
.Konrad I. von
.Konrad I. von Wettin. Luitgard, Konrads I. Gemahlin.
break::folding[test] .Konrad I. von Wettin. break::folding[test2] Luitgard, Konrads I. Gemahlin.
[caption=""]
image:img/011_Konrad_Luitgard.svg[Konrad I. von Wettin. Luitgard, Konrads I. Gemahlin.]
Im Frühjahr
fertig. Im Frühjahr
mahlin. Im Frühling 1147 nahm er mit vielen der ſächſiſchen Fürſten das
als der Imker
I. Foo
Foo I.
Foo II.
Foo III.
Foo IV.
Foo V.
Foo VII.
Foo VIII.
Foo IX.
Foo X.
Foo XIII.
Foo XIV.
Foo XX.
Foo XV.
BarI.
BarII.
IBaz
Vettel
Xanthippe
Baz IIII.
Baz IIV.
Baz IIIV.
Baz IIX.
Baz XIIII
Baz VV.
Baz VX.
Baz IXIX.
Baz IVX.
Baz IIX.
Baz VIIX.
Baz XIVI.
TEST

Lookbehind und Lookahead

Unsere oben definierte RegEx allein reicht für obiges Beispiel nicht, da wir zB. nicht innerhalb einer []-Klammer römische Zahlen ersetzen wollen. Eine Abhilfe wäre, wenn wir im String beliebig zurückschauen könnten, ob eine Klammer noch offen ist.

In Perl sind zwar beliebig tiefe lookahead-Bedingungen in Regulären Ausdrücken erlaubt, allerdings nicht in lookbehind-Bedingungen.

Sprich, mit (?<=foo)bar würden wir auf bar matchen, wenn vorher der fixe String foo auftaucht. Aber (?<=fo+)bar funktioniert unter Perl5 nicht.

Lookahead hat diese Einschränkungen nicht, wenn ich also auf foo matchen will, aber nur, wenn bar oder baar oder ba…ar folgen, dann kann ich foo(?=ba+r) schreiben.

variable length lookbehind über variable length look ahead

Über einen Post in einem Forum bin ich auf eine Lösung gestolpert: Drehe den String und die RegEx um. Statt:

"foobar" =~ s/(?<=fo+)bar/baz/; #funktioniert nicht in Perl5

also alles umdrehen zu:

"raboof" =~s/rab(?=o+f)/zab/; # zaboof -> reverse zaboof = foobaz

Die Lösung

Zurück zum Problem mit Asciidoc und römischen Zahlen, die Lösung sieht in Gänze nun wie folgt aus:

my $rstring = reverse $string;
my $roman_revregex=qr{
    (?<![IVXa-zſßäöü])(          #lookbehind
    (I{1,3})|            # I ... III
    (I{1,3}V)|           # VI ... VIII
    (VI{0,1})|           # IV ... V
    (XI{0,1})|           # IX .. X
    (I{1,3}X)|           # XI .. XIII
    (VI{0,1}X)|          # XIV ... XV
    (I{1,3}VX)|          # XVI ... XVIII
    (XX)                # XX
    )(?=\ )(?![^\[\]]*\[) # lookahead
    }x;
$rstring=~s#$roman_revregex#\]$1\[rebmun::namor#g;
$string = reverse $rstring;
print $string, "\n\n";

Dies ergibt dann folgenden String:

.Konrad roman::number[I]. von
.Konrad roman::number[I]. von Wettin. Luitgard, Konrads roman::number[I]. Gemahlin.
break::folding[test] .Konrad roman::number[I]. von Wettin. break::folding[test2] Luitgard, Konrads roman::number[I]. Gemahlin.
[caption=""]
image:img/011_Konrad_Luitgard.svg[Konrad I. von Wettin. Luitgard, Konrads I. Gemahlin.]
Im Frühjahr
fertig. Im Frühjahr
mahlin. Im Frühling 1147 nahm er mit vielen der ſächſiſchen Fürſten das
als der Imker
I. Foo
Foo roman::number[I].
Foo roman::number[II].
Foo roman::number[III].
Foo roman::number[IV].
Foo roman::number[V].
Foo roman::number[VII].
Foo roman::number[VIII].
Foo roman::number[IX].
Foo roman::number[X].
Foo roman::number[XIII].
Foo roman::number[XIV].
Foo roman::number[XX].
Foo roman::number[XV].
BarI.
BarII.
IBaz
Vettel
Xanthippe
Baz IIII.
Baz IIV.
Baz IIIV.
Baz IIX.
Baz XIIII
Baz VV.
Baz VX.
Baz IXIX.
Baz IVX.
Baz IIX.
Baz VIIX.
Baz XIVI.