neurotools.jobs.cache module
Functions related to disk caching (memoization)
- neurotools.jobs.cache.get_source(f)[source]
Extracts and returns the source code of a function (if it exists).
- Parameters:
f (function) – Function for which to extract source code
- Returns:
String containing the source code of the passed function
- Return type:
str
- neurotools.jobs.cache.function_hash_no_subroutines(f)[source]
See
function_hash_with_subroutines
. This hash value is based on theUndecorated source code
Docstring
Function name
Nodule name
Function argument specification
This function cannot detect changes in function behavior as a result of changes in subroutines, global variables, or closures over mutable objects.
- Parameters:
f (function) – Function for which to generate a hash value
- Returns:
Hash value that depends on the function. Hash is constructed such that changes in function source code and some dependencies will also generate a different hash.
- Return type:
str
- neurotools.jobs.cache.function_signature(f)[source]
Generates string identifying the cache folder for function
f
.We want to cache results to disk. However, these cached results will be invalid if the source code changes. It is hard to detect this accurately in Python. Cache entries can also become invalid if the behavior of subroutines change. To address this, the cache folder name includes a hash that depends on the function’s
module,
name,
argspec,
source, and
file.
If any of these change, the chache folder will as well. This reduces the chances of retrieving stale / invalid cached results.
- Parameters:
f (function)
- Returns:
name+’.’+code
- Return type:
str
- neurotools.jobs.cache.signature_to_file_string(f, sig, mode='repr', compressed=True, base64encode=True, truncate=True)[source]
Converts an argument signature to a string if possible.
This can be used to store cached results in a human- readable format. Alternatively, we may want to encode the value of the argument signature in a string that is compatible with most file systems.
This does not append the file extension.
Reasonable restrictions for compatibility:
No more than 4096 characters in path string
No more than 255 characters in file string
For windows compatibility try to limit it to 260 character total pathlength
These characters should be avoided:
\/<>:"|?*,@#={}'&`!%$. ASCII 0..31
The easiest way to avoid problematic characters without restricting the input is to re-encode as base 64.
The following modes are supported:
repr: Uses
repr
andast.literal_eval(node_or_string)
to serialize the argument signature. This is safe, but restricts the types permitted as paramteters.json: Uses json to serialize the argument signature. Argument signatures cannot be uniquely recovered, because tuples and lists both map to lists in the json representation. Restricting the types used in the argument signature may circumvent this.
pickle: Uses pickle to serialize argument signature. This should uniquely store argument signatures that can be recovered, but takes more space. Use this with caution, since changes to the pickle serialization protocol between version will make the encoded data irretrievable.
human: Attempts a human-readable format. Experimental.
Compression is on by defaut Signatures are base64 encoded by default
- Parameters:
f (str) – Function being called
sig –
Cleaned-up function arguments created by
neurotools.jobs.ndecorator.argument_signature()
A tuple of:- args: tuple
A tuple consisting of a list of (argument_name, argument_value) tuples.
- vargs:
A tuple containing extra variable arguments (“varargs”), if any.
mode (str; default 'repr') – Can be
'repr'
'json'
'pickle'
'human'
.compressed (boolean; default True) – Compress the resulting signature usingzlib?
base64encode (boolean; default True) – Base-64 encode the resulting signature?
truncate (boolean; default True) – Truncate file names that are too long? This will discard data, but the truncated signature may still serve as an identified with a low collision probability.
- Returns:
filename
- Return type:
str
- neurotools.jobs.cache.file_string_to_signature(filename, mode='repr', compressed=True, base64encode=True)[source]
Extracts the argument key from the compressed representation in a cache filename entry. Inverse of
signature_to_file_string()
.The
filename
should be provided as a string, without the file extension.The following modes are supported:
repr: Uses repr and ast.literal_eval(node_or_string) to serialize the argument signature. This is safe, but restricts the types permitted as paramteters.
json: Uses json to serialize the argument signature. Argument signatures cannot be uniquely recovered, because tuples and lists both map to lists in the json representation. Restricting the types used in the argument signature may circumvent this.
pickle: Uses pickle to serialize argument signature. This should uniquely store argument signatures that can be recovered, but takes more space. Use this with caution, since changes to the pickle serialization protocol between version will make the encoded data irretrievable.
human: Attempts a human-readable format. Experimental.
- human:
Attempts a human-readable format. Experimental.
Compression is on by default Signatures are base64 encoded by default
- Parameters:
filename (str) – Encoded filename, as a string, without the file extension
mode (str; default 'repr') – Can be
'repr'
'json'
'pickle'
'human'
.compressed (boolean; default True) – Whether
zlib
was used to compress this function call signaturebase64encode (boolean; default True) – Whether this function call signature was base-65 encoded.
- Returns:
sig – Function arguments created by
neurotools.jobs.ndecorator.argument_signature()
A tuple of:- args: tuple
A tuple consisting of a list of (argument_name, argument_value) tuples.
- vargs:
A tuple containing extra variable arguments (“varargs”), if any.
- Return type:
nested tuple
- neurotools.jobs.cache.human_encode(sig)[source]
Formats an argument signature for saving as file name
- Parameters:
sig (nested tuple) – Argument signature as a safe nested tuple
- Returns:
result – Human-readable argument-signature filename
- Return type:
str
- neurotools.jobs.cache.human_decode(key)[source]
Formats the argument signature for saving as file name
- Parameters:
key (str) – Human-readable argument-signature filename
- Returns:
sig – Argument signature as a nested tuple
- Return type:
nested tuple
- neurotools.jobs.cache.get_cache_path(cache_root, f, *args, **kwargs)[source]
Locate the directory path for function
f
within the__neurotools_cache__
pathcache_root
.- Parameters:
cache_root (str) – Path to root of the
__neurotools__
cachef (function) – Cached function object
- Returns:
path
- Return type:
str
- neurotools.jobs.cache.locate_cached(cache_root, f, method, *args, **kwargs)[source]
Locate a specific cache entry within
cache_root
for functionf
cached with methodmethod
, and called with arguments*args
and keyword arguments**kwargs
.- Parameters:
cache_root (str) – directory/path as string
f (function) – Function being cached
method (str) – Cache file extension e.g.
"npy"
, “mat
”, etc.args (iterable) – function parameters
kwargs (dict) – function keyword arguments
- Returns:
fn (str) – File name of cache entry without extension
sig (tuple) – Tuple of (args,kwargs) info from
argument_signature()
path (str) – Directory containing cache file
filename (str) – File name with extension
location (str) – Full absolute path to cache entry
- neurotools.jobs.cache.validate_for_matfile(x)[source]
Verify that the nested tuple
x
, which contains the arguments to a function call, can be safely stored in a Matlab matfile (.mat
).Numpy types: these should be compatible Type
Description
bool
Boolean (True or False) stored as a byte
int8
Byte (-128 to 127)
int16
Integer (-32768 to 32767)
int32
Integer (-2147483648 to 2147483647)
int64
Integer (-9223372036854775808 to 9223372036854775807)
uint8
Unsigned integer (0 to 255)
uint16
Unsigned integer (0 to 65535)
uint32
Unsigned integer (0 to 4294967295)
uint64
Unsigned integer (0 to 18446744073709551615)
float16
Half precision float: sign bit, 5 bits exponent, 10 bits mantissa
float32
Single precision float: sign bit, 8 bits exponent, 23 bits mantissa
float64
Double precision float: sign bit, 11 bits exponent, 52 bits mantissa
complex64
Complex number, represented by two float32
complex128
Complex number, represented by two float64
- Parameters:
x (nested tuple) – Arguments to a function
- Return type:
boolean
- neurotools.jobs.cache.validate_for_numpy(x)[source]
Check whether an array-like object can safely be stored in a numpy archive.
Numpy types: these should be compatible Type
Description
bool
Boolean (True or False) stored as a byte
int8
Byte (-128 to 127)
int16
Integer (-32768 to 32767)
int32
Integer (-2147483648 to 2147483647)
int64
Integer (-9223372036854775808 to 9223372036854775807)
uint8
Unsigned integer (0 to 255)
uint16
Unsigned integer (0 to 65535)
uint32
Unsigned integer (0 to 4294967295)
uint64
Unsigned integer (0 to 18446744073709551615)
float16
Half precision float: sign bit, 5 bits exponent, 10 bits mantissa
float32
Single precision float: sign bit, 8 bits exponent, 23 bits mantissa
float64
Double precision float: sign bit, 11 bits exponent, 52 bits mantissa
complex64
Complex number, represented by two float32
complex128
Complex number, represented by two float64
- Parameters:
x (object) – array-like object;
- Returns:
True if the data in
x
can be safely stored in a Numpy archive- Return type:
boolean
- neurotools.jobs.cache.disk_cacher(cache_location, method='npy', write_back=True, skip_fast=False, verbose=False, allow_mutable_bindings=False, cache_identifier='__neurotools_cache__')[source]
Decorator to memoize functions to disk. Currying pattern here where cache_location creates decotrators.
write_back:
True: Default. Computed results are saved to disk
- False: Computed results are not saved to disk. In
this case of hierarchical caches mapped to the filesystem, a background rsync loop can handle asynchronous write-back.
method:
- p: Use pickle to store cache. Can serialize all
objects but seriously slow! May not get ANY speedup due to time costs if pickling and disk IO.
- mat: Use scipy.io.savemat and scipy.io.loadmat.
Nice because it’s compatible with matlab. Unfortunately, can only store numpy types and data that can be converted to numpy types. Data conversion may alter the types of the return arguments when retrieved from the cache.
npy: Use built in numpy.save functionality.
hdf5: Not yet implemented.
- Parameters:
cache_location (str) – Path to disk cache
method (str; default 'npy') – Storange format for caches. Can be ‘pickle’, ‘mat’ or ‘npy’
write_back (boolean; default True) – Whether to copy new cache value back to the disk cache. If false, then previously cached values can be read but new entries will not be creates
skip_fast (boolean; default False) – Attempt to simply re-compute values which are taking too long to retrieve from the cache. Experimental, do not use.
verbose (boolean; default False) – Whether to print detailde logging information
allow_mutable_bindings (boolean; default False) – Whether to allow caching of functions that close over mutable scope. Such functions are more likely to return different results for the same arguments, leading to invalid cached values.
cache_identifier (str; default 'neurotools_cache') – subdirectory name for disk cache.
- Returns:
cached – TODO
- Return type:
disk cacher object
- neurotools.jobs.cache.hierarchical_cacher(fast_to_slow, method='npy', write_back=True, verbose=False, allow_mutable_bindings=False, cache_identifier='neurotools_cache')[source]
Construct a filesystem cache defined in terms of a hierarchy from faster to slower (fallback) caches.
- Parameters:
fast_to_slow (tuple of strings) – list of filesystem paths for disk caches in order from the fast (default or main) cache to slower.
method (string, default
'npy'
) – cache storing method;write_back (bool, default True) – whether to automatically copy newly computed cache values to the slower caches
verbose (bool, defaults to
False
) – whether to print detailed logging iformation to standard out when manipulating the cacheallow_mutable_bindings (bool, default False) – If true, then “unsafe” namespace bindings, for example user-defined functions, will be allowed in disk cached functions. If a cached function calls subroutines, and those subroutines change, the disk cacher cannot detect the implementation different. Consequentially, it cannot tell whether old cached values are invalid.
cache_identifier (str, default 'neurotools_cache') – (sub)folder name to store cached results
- Returns:
hierarchical – A hierarchical disk-caching decorator that can be used to memoize functions to the specified disk caching hierarchy.
- Return type:
decorator
- neurotools.jobs.cache.scan_cachedir(cachedir, method='npy', verbose=False, **kw)[source]
Retrieve all entries in
cachedir
, unpacking their encoded arguments.- Parameters:
cachedir (str) – Cache directory to scan, e.g.
__neurotools_cache__/…/…/…/somefunction
method (str; default
'npy'
) – Can be'npy'
or'mat'
verbose (boolean; default False)
**kw – Forwarded to
file_string_to_signature()
; Seefile_string_to_signature()
for details.
- Returns:
filename -> (args,varags)
dictionary, whereargs
is aparameter_name -> value
dictionary andvarargs
is a list of extra arguments, if any.- Return type:
dict
- neurotools.jobs.cache.base64hash(obj)[source]
Retrieve a base-64 encoded hash for an object. This uses the built-in
encode
function to convert an object toutf-8
, then calls.sha224(obj).digest()
to create a hash, finally packaging the result in base-64.- Parameters:
obj (object)
- Returns:
code
- Return type:
str
- neurotools.jobs.cache.base64hash10bytes(obj)[source]
Retrieve first two bytes of a base-64 encoded has for an object.
- Parameters:
obj (object)
- Returns:
code
- Return type:
str
- neurotools.jobs.cache.function_hash_with_subroutines(f, force=False)[source]
Functions may change if their subroutines change. This function computes a hash value that is sensitive to changes in the source code, docstring, argument specification, name, module, and subroutines.
This is a recursive procedure with a fair amount of overhead. To allow for the possibility of mutual recursion, subroutines are excluded from the hash if the function has already been visited.
This does not use the built-in hash function for functions in python.
Ongoing development notes
Is memoization possible? Making memoization compatible with graceful handling of potentially complex mutually recurrent call structures is tricky. Each function generates a call tree, which does not expand a node if it is already present in the call tree structure. Therefore there are many possible hash values for an intermediate function depending on how far it’s call tree gets expanded, which depends on what has been expanded and encountered so far. Therefore, we cannot cache these intermediate values.
Note: the topology of a mutually recurrent call structure cannot change without changing the source code of at least one function in the call graph? So it suffices to (1) hash the subroutines, (2) expand the call graph (potentially excluding standard and system library functions), (3) grab the non- recursive hash for each of these functions, and (4) then generate the subroutine dependent hash by combining the non-recursive hash with the hash of a datastructure representing the subroutine “profile” obtained from the call graph.
We assume that any decorators wrapping the function do not modify it’s computation, and can safely be stripped.
Note that this function cannot detect changes in effective function behavior that result from changes in global variables or mutable scope that has been closed over.
- Parameters:
force (boolean) – force muse be true, otherwise this function will fail with a warning.
- Returns:
Hash of function
- Return type:
str
- neurotools.jobs.cache.combine_caches(cache_root, f)[source]
Merge all cache folders for function
f
by copying cache files into the current cache folder.Usually, the existence of multiple cache folders indicates that cache files were generated using versions of
f
with different source code. However, you may want to merge caches if you are certain that such changes code did not change the function’s behavior.- Parameters:
cache_root (str) – path to the top-level cache directory
f (function) – cached function to merge
- neurotools.jobs.cache.exists(cache_root, f, method, *args, **kwargs)[source]
Check if a cached result for
f(*args,**kwargs)
of typemethod
exists in cachecache_root
.- Parameters:
cache_root (str) – directory/path as string
f (function) – Function being cached
method (str) – Cache file extension e.g.
"npy"
, “mat
”, etc.args (iterable) – function parameters
kwargs (dict) – function keyword arguments
- Returns:
True if the cache file exists
- Return type:
boolean