EVOLUTION-MANAGER

Edit File: digest.html

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><title>R: Create hash function digests for arbitrary R objects</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<link rel="stylesheet" type="text/css" href="R.css" />
</head><body>

<table width="100%" summary="page for digest {digest}"><tr><td>digest {digest}</td><td style="text-align: right;">R Documentation</td></tr></table>

<h2>Create hash function digests for arbitrary R objects</h2>

<h3>Description</h3>

<p>The <code>digest</code> function applies a cryptographical hash function to
arbitrary R objects. By default, the objects are internally
serialized, and either one of the currently implemented MD5 and SHA-1
hash functions algorithms can be used to compute a compact digest of
the serialized object.
</p>
<p>In order to compare this implementation with others, serialization of
the input argument can also be turned off in which the input argument
must be a character string for which its digest is returned.
</p>

<h3>Usage</h3>

<pre>
digest(object, algo=c("md5", "sha1", "crc32", "sha256", "sha512",
       "xxhash32", "xxhash64", "murmur32", "spookyhash"), serialize=TRUE, file=FALSE,
       length=Inf, skip="auto", ascii=FALSE, raw=FALSE, seed=0,
       errormode=c("stop","warn","silent"),
       serializeVersion=.getSerializeVersion())
</pre>

<h3>Arguments</h3>

<table summary="R argblock">
<tr valign="top"><td><code>object</code></td>
<td>
<p>An arbitrary R object which will then be passed to the
<code><a href="../../base/html/serialize.html">serialize</a></code> function, unless the <code>serialize</code>
argument is set to <code>FALSE</code>.</p>
</td></tr>
<tr valign="top"><td><code>algo</code></td>
<td>
<p>The algorithms to be used; currently available choices are
<code>md5</code>, which is also the default, <code>sha1</code>, <code>crc32</code>,
<code>sha256</code>, <code>sha512</code>, <code>xxhash32</code>, <code>xxhash64</code>,
<code>murmur32</code> and <code>spookyhash</code>.</p>
</td></tr>
<tr valign="top"><td><code>serialize</code></td>
<td>
<p>A logical variable indicating whether the object
should be serialized using <code>serialize</code> (in ASCII
form). Setting this to <code>FALSE</code> allows to compare the digest
output of given character strings to known control output. It also
allows the use of raw vectors such as the output of non-ASCII
serialization.
</p>
</td></tr>
<tr valign="top"><td><code>file</code></td>
<td>
<p>A logical variable indicating whether the object is a file
name or a file name if <code>object</code> is not specified.</p>
</td></tr>
<tr valign="top"><td><code>length</code></td>
<td>
<p>Number of characters to process. By default, when
<code>length</code> is set to <code>Inf</code>, the whole string or file is
processed.</p>
</td></tr>
<tr valign="top"><td><code>skip</code></td>
<td>
<p>Number of input bytes to skip before calculating the
digest. Negative values are invalid and currently treated as zero.
Special value <code>"auto"</code> will cause serialization header to be
skipped if <code>serialize</code> is set to <code>TRUE</code> (the serialization
header contains the R version number thus skipping it allows the
comparison of hashes across platforms and some R versions).
</p>
</td></tr>
<tr valign="top"><td><code>ascii</code></td>
<td>
<p>This flag is passed to the <code>serialize</code> function if
<code>serialize</code> is set to <code>TRUE</code>, determining whether the hash
is computed on the ASCII or binary representation.</p>
</td></tr>
<tr valign="top"><td><code>raw</code></td>
<td>
<p>A logical variable with a default value of FALSE, implying
<code>digest</code> returns digest output as ASCII hex values. Set to TRUE
to return <code>digest</code> output in raw (binary) form. Note that this
option is supported by most but not all of the implemented hashing
algorithms</p>
</td></tr>
<tr valign="top"><td><code>seed</code></td>
<td>
<p>an integer to seed the random number generator.  This is only
used in the <code>xxhash32</code>, <code>xxhash64</code> and <code>murmur32</code> functions
and can be used to generate additional hashes for the same input if
desired.</p>
</td></tr>
<tr valign="top"><td><code>errormode</code></td>
<td>
<p>A character value denoting a choice for the behaviour in
the case of error: &lsquo;stop&rsquo; aborts (and is the default value),
&lsquo;warn&rsquo; emits a warning and returns <code>NULL</code> and
&lsquo;silent&rsquo; suppresses the error and returns an empty string.</p>
</td></tr>
<tr valign="top"><td><code>serializeVersion</code></td>
<td>
<p>An integer value specifying the internal
version of the serialization format, with 2 being the default;
see <code><a href="../../base/html/serialize.html">serialize</a></code> for details. The <code>serializeVersion</code>
field of <code><a href="../../base/html/options.html">option</a></code> can also be used to set a different
value.</p>
</td></tr>
</table>

<h3>Details</h3>

<p>Cryptographic hash functions are well researched and documented. The
MD5 algorithm by Ron Rivest is specified in RFC 1321. The SHA-1
algorithm is specified in FIPS-180-1, SHA-2 is described in
FIPS-180-2.
</p>
<p>For md5, sha-1 and sha-256, this R implementation relies on standalone
implementations in C by Christophe Devine. For crc32, code from the
zlib library by Jean-loup Gailly and Mark Adler is used.
</p>
<p>For sha-512, a standalone implementation from Aaron Gifford is used.
</p>
<p>For xxhash32 and xxhash64, the reference implementation by Yann Collet is used.
</p>
<p>For murmur32, the progressive implementation by Shane Day is used.
</p>
<p>For spookyhash, the original source code by Bob Jenkins is used. The R implementation
that integrates R's serialization directly with the algorithm allowing for
memory-efficient incremental calculation of the hash is by Gabe Becker.
</p>
<p>Please note that this package is not meant to be used for
cryptographic purposes for which more comprehensive (and widely
tested) libraries such as OpenSSL should be used. Also, it is known
that crc32 is not collision-proof. For sha-1, recent results indicate
certain cryptographic weaknesses as well. For more details, see for example
<a href="http://www.schneier.com/blog/archives/2005/02/cryptanalysis_o.html">http://www.schneier.com/blog/archives/2005/02/cryptanalysis_o.html</a>.
</p>

<h3>Value</h3>

<p>The <code>digest</code> function returns a character string of a fixed
length containing the requested digest of the supplied R object. This
string is of length 32 for MD5; of length 40 for SHA-1; of length 8
for CRC32 a string; of length 8 for for xxhash32; of length 16 for
xxhash64; and of length 8 for murmur32.
</p>

<h3>Change Management</h3>

<p>Version 0.6.16 of digest corrects an error in which <code>crc32</code> was not
guaranteeing an eight-character return. We now pad with zero to always
return eight characters. Should the previous behaviour be required,
set <code>option("digestOldCRC32Format"=TRUE)</code> and the output will be
consistent with prior version (but not be consistentnly eight characters).
</p>

<h3>Author(s)</h3>

<p>Dirk Eddelbuettel <a href="mailto:edd@debian.org">edd@debian.org</a> for the <span style="font-family: Courier New, Courier; color: #666666;"><b>R</b></span> interface;
Antoine Lucas for the integration of crc32; Jarek Tuszynski for the
file-based operations; Henrik Bengtsson and Simon Urbanek for improved
serialization patches; Christophe Devine for the hash function
implementations for sha-1, sha-256 and md5; Jean-Loup Gailly and Mark Adler
for crc32; Hannes Muehleisen for the integration of sha-512; Jim Hester for
the integration of xxhash32, xxhash64 and murmur32; Kendon Bell for
the integration of spookyhash using Gabe Becker's R package fastdigest.</p>

<h3>References</h3>

<p>MD5: <a href="http://www.ietf.org/rfc/rfc1321.txt">http://www.ietf.org/rfc/rfc1321.txt</a>.
</p>
<p>SHA-1: <a href="https://en.wikipedia.org/wiki/SHA-1">https://en.wikipedia.org/wiki/SHA-1</a>.
SHA-256: <a href="https://csrc.nist.gov/publications/fips/fips180-2/fips180-2withchangenotice.pdf">https://csrc.nist.gov/publications/fips/fips180-2/fips180-2withchangenotice.pdf</a>.
CRC32:  The original reference webpage at <code>rocksoft.com</code> has
vanished from the web; see
<a href="https://en.wikipedia.org/wiki/Cyclic_redundancy_check">https://en.wikipedia.org/wiki/Cyclic_redundancy_check</a> for
general information on CRC algorithms.
</p>
<p><a href="https://www.aarongifford.com/computers/sha.html">https://www.aarongifford.com/computers/sha.html</a> for the
integrated C implementation of sha-512.
</p>
<p>The page for the code underlying the C functions used here for sha-1
and md5, and further references, is no longer accessible.  Please see
<a href="https://en.wikipedia.org/wiki/SHA-1">https://en.wikipedia.org/wiki/SHA-1</a> and
<a href="https://en.wikipedia.org/wiki/MD5">https://en.wikipedia.org/wiki/MD5</a>.
</p>
<p><a href="https://zlib.net">https://zlib.net</a> for documentation on the zlib library which
supplied the code for crc32.
</p>
<p><a href="https://en.wikipedia.org/wiki/SHA_hash_functions">https://en.wikipedia.org/wiki/SHA_hash_functions</a> for
documentation on the sha functions.
</p>
<p><a href="https://github.com/Cyan4973/xxHash">https://github.com/Cyan4973/xxHash</a> for documentation on the xxHash
functions.
</p>
<p><a href="https://github.com/aappleby/smhasher">https://github.com/aappleby/smhasher</a> for documentation on MurmurHash.
</p>
<p><a href="https://burtleburtle.net/bob/hash/spooky.html">https://burtleburtle.net/bob/hash/spooky.html</a> for the original source code of SpookyHash.
</p>

<p><code><a href="../../base/html/serialize.html">serialize</a></code>, <code><a href="../../tools/html/md5sum.html">md5sum</a></code></p>

<h3>Examples</h3>

<pre>

## Standard RFC 1321 test vectors
md5Input &lt;-
  c("",
    "a",
    "abc",
    "message digest",
    "abcdefghijklmnopqrstuvwxyz",
    "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789",
    paste("12345678901234567890123456789012345678901234567890123456789012",
          "345678901234567890", sep=""))
md5Output &lt;-
  c("d41d8cd98f00b204e9800998ecf8427e",
    "0cc175b9c0f1b6a831c399e269772661",
    "900150983cd24fb0d6963f7d28e17f72",
    "f96b697d7cb7938d525a2f31aaf161d0",
    "c3fcd3d76192e4007dfb496cca67e13b",
    "d174ab98d277d9f5a5611c2c9f419d9f",
    "57edf4a22be3c955ac49da2e2107b67a")

for (i in seq(along=md5Input)) {
  md5 &lt;- digest(md5Input[i], serialize=FALSE)
  stopifnot(identical(md5, md5Output[i]))
}

sha1Input &lt;-
  c("abc", "abcdbcdecdefdefgefghfghighijhijkijkljklmklmnlmnomnopnopq")
sha1Output &lt;-
  c("a9993e364706816aba3e25717850c26c9cd0d89d",
    "84983e441c3bd26ebaae4aa1f95129e5e54670f1")

for (i in seq(along=sha1Input)) {
  sha1 &lt;- digest(sha1Input[i], algo="sha1", serialize=FALSE)
  stopifnot(identical(sha1, sha1Output[i]))
}

crc32Input &lt;-
  c("abc",
    "abcdbcdecdefdefgefghfghighijhijkijkljklmklmnlmnomnopnopq")
crc32Output &lt;-
  c("352441c2",
    "171a3f5f")

for (i in seq(along=crc32Input)) {
  crc32 &lt;- digest(crc32Input[i], algo="crc32", serialize=FALSE)
  stopifnot(identical(crc32, crc32Output[i]))
}

sha256Input &lt;-
  c("abc",
    "abcdbcdecdefdefgefghfghighijhijkijkljklmklmnlmnomnopnopq")
sha256Output &lt;-
  c("ba7816bf8f01cfea414140de5dae2223b00361a396177a9cb410ff61f20015ad",
    "248d6a61d20638b8e5c026930c3e6039a33ce45964ff2167f6ecedd419db06c1")

for (i in seq(along=sha256Input)) {
  sha256 &lt;- digest(sha256Input[i], algo="sha256", serialize=FALSE)
  stopifnot(identical(sha256, sha256Output[i]))
}

# SHA 512 example
sha512Input &lt;-
  c("abc",
    "abcdbcdecdefdefgefghfghighijhijkijkljklmklmnlmnomnopnopq")
sha512Output &lt;-
  c(paste("ddaf35a193617abacc417349ae20413112e6fa4e89a97ea20a9eeee64b55d39a",
          "2192992a274fc1a836ba3c23a3feebbd454d4423643ce80e2a9ac94fa54ca49f",
          sep=""),
    paste("204a8fc6dda82f0a0ced7beb8e08a41657c16ef468b228a8279be331a703c335",
          "96fd15c13b1b07f9aa1d3bea57789ca031ad85c7a71dd70354ec631238ca3445",
          sep=""))

for (i in seq(along=sha512Input)) {
  sha512 &lt;- digest(sha512Input[i], algo="sha512", serialize=FALSE)
  stopifnot(identical(sha512, sha512Output[i]))
}

## xxhash32 example
xxhash32Input &lt;-
    c("abc",
      "abcdbcdecdefdefgefghfghighijhijkijkljklmklmnlmnomnopnopq",
      "")
xxhash32Output &lt;-
    c("32d153ff",
      "89ea60c3",
      "02cc5d05")

for (i in seq(along=xxhash32Input)) {
    xxhash32 &lt;- digest(xxhash32Input[i], algo="xxhash32", serialize=FALSE)
    cat(xxhash32, "\n")
    stopifnot(identical(xxhash32, xxhash32Output[i]))
}

## xxhash64 example
xxhash64Input &lt;-
    c("abc",
      "abcdbcdecdefdefgefghfghighijhijkijkljklmklmnlmnomnopnopq",
      "")
xxhash64Output &lt;-
    c("44bc2cf5ad770999",
      "f06103773e8585df",
      "ef46db3751d8e999")

for (i in seq(along=xxhash64Input)) {
    xxhash64 &lt;- digest(xxhash64Input[i], algo="xxhash64", serialize=FALSE)
    cat(xxhash64, "\n")
    stopifnot(identical(xxhash64, xxhash64Output[i]))
}

## these outputs were calculated using mmh3 python package
murmur32Input &lt;-
    c("abc",
      "abcdbcdecdefdefgefghfghighijhijkijkljklmklmnlmnomnopnopq",
      "")
murmur32Output &lt;-
    c("b3dd93fa",
      "ee925b90",
      "00000000")

for (i in seq(along=murmur32Input)) {
    murmur32 &lt;- digest(murmur32Input[i], algo="murmur32", serialize=FALSE)
    cat(murmur32, "\n")
    stopifnot(identical(murmur32, murmur32Output[i]))
}

## these outputs were calculated using spooky python package
spookyInput &lt;-
    c("a",
      "abc",
      "message digest")
spookyOutput &lt;-
    c("bdc9bba09181101a922a4161f0584275",
      "67c93775f715ab8ab01178caf86713c6",
      "9630c2a55c0987a0db44434f9d67a192")

for (i in seq(along=spookyInput)) {
    # skip = 30 skips the serialization header and just hashes the strings
    spooky &lt;- digest(spookyInput[i], algo="spookyhash", skip = 30)
    cat(spooky, "\n")
    stopifnot(identical(spooky, spookyOutput[i]))
}

# example of a digest of a standard R list structure
digest(list(LETTERS, data.frame(a=letters[1:5], b=matrix(1:10,ncol=2))))

# test 'length' parameter and file input
fname &lt;- file.path(R.home(),"COPYING")
x &lt;- readChar(fname, file.info(fname)$size) # read file
for (alg in c("sha1", "md5", "crc32")) {
  # partial file
  h1 &lt;- digest(x    , length=18000, algo=alg, serialize=FALSE)
  h2 &lt;- digest(fname, length=18000, algo=alg, serialize=FALSE, file=TRUE)
  h3 &lt;- digest( substr(x,1,18000) , algo=alg, serialize=FALSE)
  stopifnot( identical(h1,h2), identical(h1,h3) )
  # whole file
  h1 &lt;- digest(x    , algo=alg, serialize=FALSE)
  h2 &lt;- digest(fname, algo=alg, serialize=FALSE, file=TRUE)
  stopifnot( identical(h1,h2) )
}

# compare md5 algorithm to other tools
library(tools)
fname &lt;- file.path(R.home(),"COPYING")
h1 &lt;- as.character(md5sum(fname))
h2 &lt;- digest(fname, algo="md5", file=TRUE)
stopifnot( identical(h1,h2) )

## digest is _designed_ to return one has summary per object to for a desired
## For vectorised output see digest::getVDigest() which provides
## better performance than base::Vectorize()

md5 &lt;- getVDigest()
v &lt;- md5(1:5)     			# digest integers 1 to 5
stopifnot(identical(v[1], digest(1L)),	# check first and third result
          identical(v[3], digest(3L)))

</pre>

<hr /><div style="text-align: center;">[Package <em>digest</em> version 0.6.25 <a href="00Index.html">Index</a>]</div>
</body></html>