neurotools.jobs.cache module

Functions related to disk caching (memoization)

neurotools.jobs.cache.get_source(f)[source]

Extracts and returns the source code of a function (if it exists).

Parameters:: f (function) – Function for which to extract source code
Returns:: String containing the source code of the passed function
Return type:: str

neurotools.jobs.cache.function_hash_no_subroutines(f)[source]

See function_hash_with_subroutines. This hash value is based on the

Undecorated source code

Docstring

Function name

Nodule name

Function argument specification

This function cannot detect changes in function behavior as a result of changes in subroutines, global variables, or closures over mutable objects.

Parameters:: f (function) – Function for which to generate a hash value
Returns:: Hash value that depends on the function. Hash is constructed such that changes in function source code and some dependencies will also generate a different hash.
Return type:: str

neurotools.jobs.cache.function_signature(f)[source]

Generates string identifying the cache folder for function f.

We want to cache results to disk. However, these cached results will be invalid if the source code changes. It is hard to detect this accurately in Python. Cache entries can also become invalid if the behavior of subroutines change. To address this, the cache folder name includes a hash that depends on the function’s

module,

name,

argspec,

source, and

file.

If any of these change, the chache folder will as well. This reduces the chances of retrieving stale / invalid cached results.

Parameters:: f (function)
Returns:: name+’.’+code
Return type:: str

neurotools.jobs.cache.signature_to_file_string(f, sig, mode='repr', compressed=True, base64encode=True, truncate=True)[source]

Converts an argument signature to a string if possible.

This can be used to store cached results in a human- readable format. Alternatively, we may want to encode the value of the argument signature in a string that is compatible with most file systems.

This does not append the file extension.

Reasonable restrictions for compatibility:

No more than 4096 characters in path string

No more than 255 characters in file string

For windows compatibility try to limit it to 260 character total pathlength

These characters should be avoided: \/<>:"|?*,@#={}'&`!%$. ASCII 0..31

The easiest way to avoid problematic characters without restricting the input is to re-encode as base 64.

The following modes are supported:

repr: Uses repr and ast.literal_eval(node_or_string) to serialize the argument signature. This is safe, but restricts the types permitted as paramteters.

json: Uses json to serialize the argument signature. Argument signatures cannot be uniquely recovered, because tuples and lists both map to lists in the json representation. Restricting the types used in the argument signature may circumvent this.

pickle: Uses pickle to serialize argument signature. This should uniquely store argument signatures that can be recovered, but takes more space. Use this with caution, since changes to the pickle serialization protocol between version will make the encoded data irretrievable.

human: Attempts a human-readable format. Experimental.

Compression is on by defaut Signatures are base64 encoded by default

Parameters:

f (str) – Function being called
sig –
Cleaned-up function arguments created by neurotools.jobs.ndecorator.argument_signature() A tuple of:

args: tuple
A tuple consisting of a list of (argument_name, argument_value) tuples.

vargs:
A tuple containing extra variable arguments (“varargs”), if any.
mode (str; default 'repr') – Can be 'repr' 'json' 'pickle' 'human'.
compressed (boolean; default True) – Compress the resulting signature usingzlib?
base64encode (boolean; default True) – Base-64 encode the resulting signature?
truncate (boolean; default True) – Truncate file names that are too long? This will discard data, but the truncated signature may still serve as an identified with a low collision probability.

Returns:

filename

Return type:

str

neurotools.jobs.cache.file_string_to_signature(filename, mode='repr', compressed=True, base64encode=True)[source]

Extracts the argument key from the compressed representation in a cache filename entry. Inverse of signature_to_file_string().

The filename should be provided as a string, without the file extension.

The following modes are supported:

repr: Uses repr and ast.literal_eval(node_or_string) to serialize the argument signature. This is safe, but restricts the types permitted as paramteters.

json: Uses json to serialize the argument signature. Argument signatures cannot be uniquely recovered, because tuples and lists both map to lists in the json representation. Restricting the types used in the argument signature may circumvent this.

pickle: Uses pickle to serialize argument signature. This should uniquely store argument signatures that can be recovered, but takes more space. Use this with caution, since changes to the pickle serialization protocol between version will make the encoded data irretrievable.

human: Attempts a human-readable format. Experimental.

human:: Attempts a human-readable format. Experimental.

Compression is on by default Signatures are base64 encoded by default

Parameters:

filename (str) – Encoded filename, as a string, without the file extension
mode (str; default 'repr') – Can be 'repr' 'json' 'pickle' 'human'.
compressed (boolean; default True) – Whether zlib was used to compress this function call signature
base64encode (boolean; default True) – Whether this function call signature was base-65 encoded.

Returns:

sig – Function arguments created by neurotools.jobs.ndecorator.argument_signature() A tuple of:

args: tuple
A tuple consisting of a list of (argument_name, argument_value) tuples.

vargs:
A tuple containing extra variable arguments (“varargs”), if any.

Return type:

nested tuple

neurotools.jobs.cache.human_encode(sig)[source]

Formats an argument signature for saving as file name

Parameters:: sig (nested tuple) – Argument signature as a safe nested tuple
Returns:: result – Human-readable argument-signature filename
Return type:: str

neurotools.jobs.cache.human_decode(key)[source]

Formats the argument signature for saving as file name

Parameters:: key (str) – Human-readable argument-signature filename
Returns:: sig – Argument signature as a nested tuple
Return type:: nested tuple

neurotools.jobs.cache.get_cache_path(cache_root, f, *args, **kwargs)[source]

Locate the directory path for function f within the __neurotools_cache__ path cache_root.

Parameters:

cache_root (str) – Path to root of the __neurotools__ cache
f (function) – Cached function object

Returns:

path

Return type:

str

neurotools.jobs.cache.locate_cached(cache_root, f, method, *args, **kwargs)[source]

Locate a specific cache entry within cache_root for function f cached with method method, and called with arguments *args and keyword arguments **kwargs.

Parameters:

cache_root (str) – directory/path as string
f (function) – Function being cached
method (str) – Cache file extension e.g. "npy", “mat”, etc.
args (iterable) – function parameters
kwargs (dict) – function keyword arguments

Returns:

fn (str) – File name of cache entry without extension
sig (tuple) – Tuple of (args,kwargs) info from argument_signature()
path (str) – Directory containing cache file
filename (str) – File name with extension
location (str) – Full absolute path to cache entry

neurotools.jobs.cache.validate_for_matfile(x)[source]

Verify that the nested tuple x, which contains the arguments to a function call, can be safely stored in a Matlab matfile (.mat).

Numpy types: these should be compatible
Type	Description
bool	Boolean (True or False) stored as a byte
int8	Byte (-128 to 127)
int16	Integer (-32768 to 32767)
int32	Integer (-2147483648 to 2147483647)
int64	Integer (-9223372036854775808 to 9223372036854775807)
uint8	Unsigned integer (0 to 255)
uint16	Unsigned integer (0 to 65535)
uint32	Unsigned integer (0 to 4294967295)
uint64	Unsigned integer (0 to 18446744073709551615)
float16	Half precision float: sign bit, 5 bits exponent, 10 bits mantissa
float32	Single precision float: sign bit, 8 bits exponent, 23 bits mantissa
float64	Double precision float: sign bit, 11 bits exponent, 52 bits mantissa
complex64	Complex number, represented by two float32
complex128	Complex number, represented by two float64

Parameters:: x (nested tuple) – Arguments to a function
Return type:: boolean

neurotools.jobs.cache.validate_for_numpy(x)[source]

Check whether an array-like object can safely be stored in a numpy archive.

Numpy types: these should be compatible
Type	Description
bool	Boolean (True or False) stored as a byte
int8	Byte (-128 to 127)
int16	Integer (-32768 to 32767)
int32	Integer (-2147483648 to 2147483647)
int64	Integer (-9223372036854775808 to 9223372036854775807)
uint8	Unsigned integer (0 to 255)
uint16	Unsigned integer (0 to 65535)
uint32	Unsigned integer (0 to 4294967295)
uint64	Unsigned integer (0 to 18446744073709551615)
float16	Half precision float: sign bit, 5 bits exponent, 10 bits mantissa
float32	Single precision float: sign bit, 8 bits exponent, 23 bits mantissa
float64	Double precision float: sign bit, 11 bits exponent, 52 bits mantissa
complex64	Complex number, represented by two float32
complex128	Complex number, represented by two float64

Parameters:: x (object) – array-like object;
Returns:: True if the data in x can be safely stored in a Numpy archive
Return type:: boolean

neurotools.jobs.cache.read_cache_entry(location, method)[source]

neurotools.jobs.cache.disk_cacher(cache_location, method='npy', write_back=True, skip_fast=False, verbose=False, allow_mutable_bindings=False, cache_identifier='__neurotools_cache__')[source]

Decorator to memoize functions to disk. Currying pattern here where cache_location creates decotrators.

write_back:

True: Default. Computed results are saved to disk

False: Computed results are not saved to disk. In
this case of hierarchical caches mapped to the filesystem, a background rsync loop can handle asynchronous write-back.

method:

p: Use pickle to store cache. Can serialize all
objects but seriously slow! May not get ANY speedup due to time costs if pickling and disk IO.

mat: Use scipy.io.savemat and scipy.io.loadmat.
Nice because it’s compatible with matlab. Unfortunately, can only store numpy types and data that can be converted to numpy types. Data conversion may alter the types of the return arguments when retrieved from the cache.

npy: Use built in numpy.save functionality.

hdf5: Not yet implemented.

Parameters:

cache_location (str) – Path to disk cache
method (str; default 'npy') – Storange format for caches. Can be ‘pickle’, ‘mat’ or ‘npy’
write_back (boolean; default True) – Whether to copy new cache value back to the disk cache. If false, then previously cached values can be read but new entries will not be creates
skip_fast (boolean; default False) – Attempt to simply re-compute values which are taking too long to retrieve from the cache. Experimental, do not use.
verbose (boolean; default False) – Whether to print detailde logging information
allow_mutable_bindings (boolean; default False) – Whether to allow caching of functions that close over mutable scope. Such functions are more likely to return different results for the same arguments, leading to invalid cached values.
cache_identifier (str; default 'neurotools_cache') – subdirectory name for disk cache.

Returns:

cached – TODO

Return type:

disk cacher object

neurotools.jobs.cache.hierarchical_cacher(fast_to_slow, method='npy', write_back=True, verbose=False, allow_mutable_bindings=False, cache_identifier='neurotools_cache')[source]

Construct a filesystem cache defined in terms of a hierarchy from faster to slower (fallback) caches.

Parameters:

fast_to_slow (tuple of strings) – list of filesystem paths for disk caches in order from the fast (default or main) cache to slower.
method (string, default 'npy') – cache storing method;
write_back (bool, default True) – whether to automatically copy newly computed cache values to the slower caches
verbose (bool, defaults to False) – whether to print detailed logging iformation to standard out when manipulating the cache
allow_mutable_bindings (bool, default False) – If true, then “unsafe” namespace bindings, for example user-defined functions, will be allowed in disk cached functions. If a cached function calls subroutines, and those subroutines change, the disk cacher cannot detect the implementation different. Consequentially, it cannot tell whether old cached values are invalid.
cache_identifier (str, default 'neurotools_cache') – (sub)folder name to store cached results

Returns:

hierarchical – A hierarchical disk-caching decorator that can be used to memoize functions to the specified disk caching hierarchy.

Return type:

decorator

neurotools.jobs.cache.scan_cachedir(cachedir, method='npy', verbose=False, **kw)[source]

Retrieve all entries in cachedir, unpacking their encoded arguments.

Parameters:

cachedir (str) – Cache directory to scan, e.g. __neurotools_cache__/…/…/…/somefunction
method (str; default 'npy') – Can be 'npy' or 'mat'
verbose (boolean; default False)
**kw – Forwarded to file_string_to_signature(); See file_string_to_signature() for details.

Returns:

filename -> (args,varags) dictionary, where args is a parameter_name -> value dictionary and varargs is a list of extra arguments, if any.

Return type:

dict

neurotools.jobs.cache.hashit(obj)[source]

neurotools.jobs.cache.base64hash(obj)[source]

Retrieve a base-64 encoded hash for an object. This uses the built-in encode function to convert an object to utf-8, then calls .sha224(obj).digest() to create a hash, finally packaging the result in base-64.

Parameters:: obj (object)
Returns:: code
Return type:: str

neurotools.jobs.cache.base64hash10bytes(obj)[source]

Retrieve first two bytes of a base-64 encoded has for an object.

Parameters:: obj (object)
Returns:: code
Return type:: str

neurotools.jobs.cache.function_hash_with_subroutines(f, force=False)[source]

Functions may change if their subroutines change. This function computes a hash value that is sensitive to changes in the source code, docstring, argument specification, name, module, and subroutines.

This is a recursive procedure with a fair amount of overhead. To allow for the possibility of mutual recursion, subroutines are excluded from the hash if the function has already been visited.

This does not use the built-in hash function for functions in python.

Ongoing development notes

Is memoization possible? Making memoization compatible with graceful handling of potentially complex mutually recurrent call structures is tricky. Each function generates a call tree, which does not expand a node if it is already present in the call tree structure. Therefore there are many possible hash values for an intermediate function depending on how far it’s call tree gets expanded, which depends on what has been expanded and encountered so far. Therefore, we cannot cache these intermediate values.

Note: the topology of a mutually recurrent call structure cannot change without changing the source code of at least one function in the call graph? So it suffices to (1) hash the subroutines, (2) expand the call graph (potentially excluding standard and system library functions), (3) grab the non- recursive hash for each of these functions, and (4) then generate the subroutine dependent hash by combining the non-recursive hash with the hash of a datastructure representing the subroutine “profile” obtained from the call graph.

We assume that any decorators wrapping the function do not modify it’s computation, and can safely be stripped.

Note that this function cannot detect changes in effective function behavior that result from changes in global variables or mutable scope that has been closed over.

Parameters:: force (boolean) – force muse be true, otherwise this function will fail with a warning.
Returns:: Hash of function
Return type:: str

neurotools.jobs.cache.combine_caches(cache_root, f)[source]

Merge all cache folders for function f by copying cache files into the current cache folder.

Usually, the existence of multiple cache folders indicates that cache files were generated using versions of f with different source code. However, you may want to merge caches if you are certain that such changes code did not change the function’s behavior.

Parameters:

cache_root (str) – path to the top-level cache directory
f (function) – cached function to merge

neurotools.jobs.cache.exists(cache_root, f, method, *args, **kwargs)[source]

Check if a cached result for f(*args,**kwargs) of type method exists in cache cache_root.

Parameters:

cache_root (str) – directory/path as string
f (function) – Function being cached
method (str) – Cache file extension e.g. "npy", “mat”, etc.
args (iterable) – function parameters
kwargs (dict) – function keyword arguments

Returns:

True if the cache file exists

Return type:

boolean