Python hash() is not deterministic. Output of hash
function is not guaranteed
to be the same across different Python versions, platforms or executions of the
same program.
Lets take a look at the following example:
$ python -c "print(hash('foo'))"
-677362727710324010
$ python -c "print(hash('foo'))"
2165398033220216763
$ python -c "print(hash('foo'))"
5782774651590270115
As you can see, the output of hash
function is different for the same input
"foo"
. This is not a bug, but a feature in Python 3.3 and above. The reason
for this is that Python 3.3 introduced a Hash randomization as a security feature
to prevent attackers from using hash collision for denial-of-service attachs.
Every time you start a Python program, a random value is generated and used to
salt the hash values. This ensures that the hash values are consistent within
a single Python run. But, the hash values will be different across different
Python runs.
You could disable hash randomization by setting the environment variable
PYTHONHASHSEED
to 0
, but this is not recommended.
If you want to hash arbitrary objects deterministically, you can use the
ubelt or
joblib.hashing modules.
Here’s an example of using ubelt
import ubelt as ub
print(ub.hash_data('foo', hasher='md5', base='abc', convert=False))
Result:
$ python -c "import ubelt as ub; print(ub.hash_data('foo', hasher='md5', base='abc', convert=False))"
blhtggyvbuyhspdolqxdrhoajdka
$ python -c "import ubelt as ub; print(ub.hash_data('foo', hasher='md5', base='abc', convert=False))"
blhtggyvbuyhspdolqxdrhoajdka
$ python -c "import ubelt as ub; print(ub.hash_data('foo', hasher='md5', base='abc', convert=False))"
blhtggyvbuyhspdolqxdrhoajdka
References