Autograd Utility: PyTorch AD functions

Autograd Utility: PyTorch AD functions#

This module imports PyTorch’s own autograd functions, depending on the version.

Important! Before PyTorch 2.0.0, functorch does not work together with custom autograd functions, which we definitely require. Additionally, functorch imposes the implementation of a forward and setup_context method, i.e., the traditional way of using forward with the ctx argument does not work.

Note

functorch is shipped with PyTorch 1.13.0 and later. Earlier versions require a separate installation.

tad_mctc.autograd.internals.fjacrev(func, argnums=0, *, has_aux=False, chunk_size=None, _preallocate_and_copy=False)#

Computes the Jacobian of func with respect to the arg(s) at index argnum using reverse mode autodiff

Note

Using chunk_size=1 is equivalent to computing the jacobian row-by-row with a for-loop i.e. the constraints of vmap() are not applicable.

Args:
func (function): A Python function that takes one or more arguments,

one of which must be a Tensor, and returns one or more Tensors

argnums (int or tuple[int, …]): Optional, integer or tuple of integers,

saying which arguments to get the Jacobian with respect to. Default: 0.

has_aux (bool): Flag indicating that func returns a

(output, aux) tuple where the first element is the output of the function to be differentiated and the second element is auxiliary objects that will not be differentiated. Default: False.

chunk_size (None or int): If None (default), use the maximum chunk size

(equivalent to doing a single vmap over vjp to compute the jacobian). If 1, then compute the jacobian row-by-row with a for-loop. If not None, then compute the jacobian chunk_size rows at a time (equivalent to doing multiple vmap over vjp). If you run into memory issues computing the jacobian, please try to specify a non-None chunk_size.

Returns:

Returns a function that takes in the same inputs as func and returns the Jacobian of func with respect to the arg(s) at argnums. If has_aux is True, then the returned function instead returns a (jacobian, aux) tuple where jacobian is the Jacobian and aux is auxiliary objects returned by func.

A basic usage with a pointwise, unary operation will give a diagonal array as the Jacobian

>>> from torch.func import jacrev
>>> x = torch.randn(5)
>>> jacobian = jacrev(torch.sin)(x)
>>> expected = torch.diag(torch.cos(x))
>>> assert torch.allclose(jacobian, expected)

If you would like to compute the output of the function as well as the jacobian of the function, use the has_aux flag to return the output as an auxiliary object:

>>> from torch.func import jacrev
>>> x = torch.randn(5)
>>>
>>> def f(x):
>>>   return x.sin()
>>>
>>> def g(x):
>>>   result = f(x)
>>>   return result, result
>>>
>>> jacobian_f, f_x = jacrev(g, has_aux=True)(x)
>>> assert torch.allclose(f_x, f(x))

jacrev() can be composed with vmap to produce batched Jacobians:

>>> from torch.func import jacrev, vmap
>>> x = torch.randn(64, 5)
>>> jacobian = vmap(jacrev(torch.sin))(x)
>>> assert jacobian.shape == (64, 5, 5)

Additionally, jacrev() can be composed with itself to produce Hessians

>>> from torch.func import jacrev
>>> def f(x):
>>>   return x.sin().sum()
>>>
>>> x = torch.randn(5)
>>> hessian = jacrev(jacrev(f))(x)
>>> assert torch.allclose(hessian, torch.diag(-x.sin()))

By default, jacrev() computes the Jacobian with respect to the first input. However, it can compute the Jacboian with respect to a different argument by using argnums:

>>> from torch.func import jacrev
>>> def f(x, y):
>>>   return x + y ** 2
>>>
>>> x, y = torch.randn(5), torch.randn(5)
>>> jacobian = jacrev(f, argnums=1)(x, y)
>>> expected = torch.diag(2 * y)
>>> assert torch.allclose(jacobian, expected)

Additionally, passing a tuple to argnums will compute the Jacobian with respect to multiple arguments

>>> from torch.func import jacrev
>>> def f(x, y):
>>>   return x + y ** 2
>>>
>>> x, y = torch.randn(5), torch.randn(5)
>>> jacobian = jacrev(f, argnums=(0, 1))(x, y)
>>> expectedX = torch.diag(torch.ones_like(x))
>>> expectedY = torch.diag(2 * y)
>>> assert torch.allclose(jacobian[0], expectedX)
>>> assert torch.allclose(jacobian[1], expectedY)

Note

Using PyTorch torch.no_grad together with jacrev. Case 1: Using torch.no_grad inside a function:

>>> def f(x):
>>>     with torch.no_grad():
>>>         c = x ** 2
>>>     return x - c

In this case, jacrev(f)(x) will respect the inner torch.no_grad.

Case 2: Using jacrev inside torch.no_grad context manager:

>>> with torch.no_grad():
>>>     jacrev(f)(x)

In this case, jacrev will respect the inner torch.no_grad, but not the outer one. This is because jacrev is a “function transform”: its result should not depend on the result of a context manager outside of f.

tad_mctc.autograd.internals.fvmap(func, in_dims=0, out_dims=0, randomness='error', *, chunk_size=None)#

vmap is the vectorizing map; vmap(func) returns a new function that maps func over some dimension of the inputs. Semantically, vmap pushes the map into PyTorch operations called by func, effectively vectorizing those operations.

vmap is useful for handling batch dimensions: one can write a function func that runs on examples and then lift it to a function that can take batches of examples with vmap(func). vmap can also be used to compute batched gradients when composed with autograd.

Note

torch.vmap() is aliased to torch.func.vmap() for convenience. Use whichever one you’d like.

Args:
func (function): A Python function that takes one or more arguments.

Must return one or more Tensors.

in_dims (int or nested structure): Specifies which dimension of the

inputs should be mapped over. in_dims should have a structure like the inputs. If the in_dim for a particular input is None, then that indicates there is no map dimension. Default: 0.

out_dims (int or Tuple[int]): Specifies where the mapped dimension

should appear in the outputs. If out_dims is a Tuple, then it should have one element per output. Default: 0.

randomness (str): Specifies whether the randomness in this

vmap should be the same or different across batches. If ‘different’, the randomness for each batch will be different. If ‘same’, the randomness will be the same across batches. If ‘error’, any calls to random functions will error. Default: ‘error’. WARNING: this flag only applies to random PyTorch operations and does not apply to Python’s random module or numpy randomness.

chunk_size (None or int): If None (default), apply a single vmap over inputs.

If not None, then compute the vmap chunk_size samples at a time. Note that chunk_size=1 is equivalent to computing the vmap with a for-loop. If you run into memory issues computing the vmap, please try a non-None chunk_size.

Returns:

Returns a new “batched” function. It takes the same inputs as func, except each input has an extra dimension at the index specified by in_dims. It takes returns the same outputs as func, except each output has an extra dimension at the index specified by out_dims.

One example of using vmap() is to compute batched dot products. PyTorch doesn’t provide a batched torch.dot API; instead of unsuccessfully rummaging through docs, use vmap() to construct a new function.

>>> torch.dot  # [D], [D] -> []
>>> batched_dot = torch.func.vmap(torch.dot)  # [N, D], [N, D] -> [N]
>>> x, y = torch.randn(2, 5), torch.randn(2, 5)
>>> batched_dot(x, y)

vmap() can be helpful in hiding batch dimensions, leading to a simpler model authoring experience.

>>> batch_size, feature_size = 3, 5
>>> weights = torch.randn(feature_size, requires_grad=True)
>>>
>>> def model(feature_vec):
>>> # Very simple linear model with activation
>>>     return feature_vec.dot(weights).relu()
>>>
>>> examples = torch.randn(batch_size, feature_size)
>>> result = torch.vmap(model)(examples)

vmap() can also help vectorize computations that were previously difficult or impossible to batch. One example is higher-order gradient computation. The PyTorch autograd engine computes vjps (vector-Jacobian products). Computing a full Jacobian matrix for some function f: R^N -> R^N usually requires N calls to autograd.grad, one per Jacobian row. Using vmap(), we can vectorize the whole computation, computing the Jacobian in a single call to autograd.grad.

>>> # Setup
>>> N = 5
>>> f = lambda x: x**2
>>> x = torch.randn(N, requires_grad=True)
>>> y = f(x)
>>> I_N = torch.eye(N)
>>>
>>> # Sequential approach
>>> jacobian_rows = [torch.autograd.grad(y, x, v, retain_graph=True)[0]
>>>                  for v in I_N.unbind()]
>>> jacobian = torch.stack(jacobian_rows)
>>>
>>> # vectorized gradient computation
>>> def get_vjp(v):
>>>     return torch.autograd.grad(y, x, v)
>>> jacobian = torch.vmap(get_vjp)(I_N)

vmap() can also be nested, producing an output with multiple batched dimensions

>>> torch.dot  # [D], [D] -> []
>>> batched_dot = torch.vmap(
...     torch.vmap(torch.dot)
... )  # [N1, N0, D], [N1, N0, D] -> [N1, N0]
>>> x, y = torch.randn(2, 3, 5), torch.randn(2, 3, 5)
>>> batched_dot(x, y)  # tensor of size [2, 3]

If the inputs are not batched along the first dimension, in_dims specifies the dimension that each inputs are batched along as

>>> torch.dot  # [N], [N] -> []
>>> batched_dot = torch.vmap(torch.dot, in_dims=1)  # [N, D], [N, D] -> [D]
>>> x, y = torch.randn(2, 5), torch.randn(2, 5)
>>> batched_dot(
...     x, y
... )  # output is [5] instead of [2] if batched along the 0th dimension

If there are multiple inputs each of which is batched along different dimensions, in_dims must be a tuple with the batch dimension for each input as

>>> torch.dot  # [D], [D] -> []
>>> batched_dot = torch.vmap(torch.dot, in_dims=(0, None))  # [N, D], [D] -> [N]
>>> x, y = torch.randn(2, 5), torch.randn(5)
>>> batched_dot(
...     x, y
... )  # second arg doesn't have a batch dim because in_dim[1] was None

If the input is a Python struct, in_dims must be a tuple containing a struct matching the shape of the input:

>>> f = lambda dict: torch.dot(dict["x"], dict["y"])
>>> x, y = torch.randn(2, 5), torch.randn(5)
>>> input = {"x": x, "y": y}
>>> batched_dot = torch.vmap(f, in_dims=({"x": 0, "y": None},))
>>> batched_dot(input)

By default, the output is batched along the first dimension. However, it can be batched along any dimension by using out_dims

>>> f = lambda x: x**2
>>> x = torch.randn(2, 5)
>>> batched_pow = torch.vmap(f, out_dims=1)
>>> batched_pow(x)  # [5, 2]

For any function that uses kwargs, the returned function will not batch the kwargs but will accept kwargs

>>> x = torch.randn([2, 5])
>>> def fn(x, scale=4.):
>>>   return x * scale
>>>
>>> batched_pow = torch.vmap(fn)
>>> assert torch.allclose(batched_pow(x), x * 4)
>>> batched_pow(x, scale=x)  # scale is not batched, output has shape [2, 2, 5]

Note

vmap does not provide general autobatching or handle variable-length sequences out of the box.

tad_mctc.autograd.internals.jacrev(func, argnums=0, *, has_aux=False, chunk_size=None, _preallocate_and_copy=False)#

Computes the Jacobian of func with respect to the arg(s) at index argnum using reverse mode autodiff

Note

Using chunk_size=1 is equivalent to computing the jacobian row-by-row with a for-loop i.e. the constraints of vmap() are not applicable.

Args:
func (function): A Python function that takes one or more arguments,

one of which must be a Tensor, and returns one or more Tensors

argnums (int or tuple[int, …]): Optional, integer or tuple of integers,

saying which arguments to get the Jacobian with respect to. Default: 0.

has_aux (bool): Flag indicating that func returns a

(output, aux) tuple where the first element is the output of the function to be differentiated and the second element is auxiliary objects that will not be differentiated. Default: False.

chunk_size (None or int): If None (default), use the maximum chunk size

(equivalent to doing a single vmap over vjp to compute the jacobian). If 1, then compute the jacobian row-by-row with a for-loop. If not None, then compute the jacobian chunk_size rows at a time (equivalent to doing multiple vmap over vjp). If you run into memory issues computing the jacobian, please try to specify a non-None chunk_size.

Returns:

Returns a function that takes in the same inputs as func and returns the Jacobian of func with respect to the arg(s) at argnums. If has_aux is True, then the returned function instead returns a (jacobian, aux) tuple where jacobian is the Jacobian and aux is auxiliary objects returned by func.

A basic usage with a pointwise, unary operation will give a diagonal array as the Jacobian

>>> from torch.func import jacrev
>>> x = torch.randn(5)
>>> jacobian = jacrev(torch.sin)(x)
>>> expected = torch.diag(torch.cos(x))
>>> assert torch.allclose(jacobian, expected)

If you would like to compute the output of the function as well as the jacobian of the function, use the has_aux flag to return the output as an auxiliary object:

>>> from torch.func import jacrev
>>> x = torch.randn(5)
>>>
>>> def f(x):
>>>   return x.sin()
>>>
>>> def g(x):
>>>   result = f(x)
>>>   return result, result
>>>
>>> jacobian_f, f_x = jacrev(g, has_aux=True)(x)
>>> assert torch.allclose(f_x, f(x))

jacrev() can be composed with vmap to produce batched Jacobians:

>>> from torch.func import jacrev, vmap
>>> x = torch.randn(64, 5)
>>> jacobian = vmap(jacrev(torch.sin))(x)
>>> assert jacobian.shape == (64, 5, 5)

Additionally, jacrev() can be composed with itself to produce Hessians

>>> from torch.func import jacrev
>>> def f(x):
>>>   return x.sin().sum()
>>>
>>> x = torch.randn(5)
>>> hessian = jacrev(jacrev(f))(x)
>>> assert torch.allclose(hessian, torch.diag(-x.sin()))

By default, jacrev() computes the Jacobian with respect to the first input. However, it can compute the Jacboian with respect to a different argument by using argnums:

>>> from torch.func import jacrev
>>> def f(x, y):
>>>   return x + y ** 2
>>>
>>> x, y = torch.randn(5), torch.randn(5)
>>> jacobian = jacrev(f, argnums=1)(x, y)
>>> expected = torch.diag(2 * y)
>>> assert torch.allclose(jacobian, expected)

Additionally, passing a tuple to argnums will compute the Jacobian with respect to multiple arguments

>>> from torch.func import jacrev
>>> def f(x, y):
>>>   return x + y ** 2
>>>
>>> x, y = torch.randn(5), torch.randn(5)
>>> jacobian = jacrev(f, argnums=(0, 1))(x, y)
>>> expectedX = torch.diag(torch.ones_like(x))
>>> expectedY = torch.diag(2 * y)
>>> assert torch.allclose(jacobian[0], expectedX)
>>> assert torch.allclose(jacobian[1], expectedY)

Note

Using PyTorch torch.no_grad together with jacrev. Case 1: Using torch.no_grad inside a function:

>>> def f(x):
>>>     with torch.no_grad():
>>>         c = x ** 2
>>>     return x - c

In this case, jacrev(f)(x) will respect the inner torch.no_grad.

Case 2: Using jacrev inside torch.no_grad context manager:

>>> with torch.no_grad():
>>>     jacrev(f)(x)

In this case, jacrev will respect the inner torch.no_grad, but not the outer one. This is because jacrev is a “function transform”: its result should not depend on the result of a context manager outside of f.

tad_mctc.autograd.internals.vmap(func, in_dims=0, out_dims=0, randomness='error', *, chunk_size=None)#

vmap is the vectorizing map; vmap(func) returns a new function that maps func over some dimension of the inputs. Semantically, vmap pushes the map into PyTorch operations called by func, effectively vectorizing those operations.

vmap is useful for handling batch dimensions: one can write a function func that runs on examples and then lift it to a function that can take batches of examples with vmap(func). vmap can also be used to compute batched gradients when composed with autograd.

Note

torch.vmap() is aliased to torch.func.vmap() for convenience. Use whichever one you’d like.

Args:
func (function): A Python function that takes one or more arguments.

Must return one or more Tensors.

in_dims (int or nested structure): Specifies which dimension of the

inputs should be mapped over. in_dims should have a structure like the inputs. If the in_dim for a particular input is None, then that indicates there is no map dimension. Default: 0.

out_dims (int or Tuple[int]): Specifies where the mapped dimension

should appear in the outputs. If out_dims is a Tuple, then it should have one element per output. Default: 0.

randomness (str): Specifies whether the randomness in this

vmap should be the same or different across batches. If ‘different’, the randomness for each batch will be different. If ‘same’, the randomness will be the same across batches. If ‘error’, any calls to random functions will error. Default: ‘error’. WARNING: this flag only applies to random PyTorch operations and does not apply to Python’s random module or numpy randomness.

chunk_size (None or int): If None (default), apply a single vmap over inputs.

If not None, then compute the vmap chunk_size samples at a time. Note that chunk_size=1 is equivalent to computing the vmap with a for-loop. If you run into memory issues computing the vmap, please try a non-None chunk_size.

Returns:

Returns a new “batched” function. It takes the same inputs as func, except each input has an extra dimension at the index specified by in_dims. It takes returns the same outputs as func, except each output has an extra dimension at the index specified by out_dims.

One example of using vmap() is to compute batched dot products. PyTorch doesn’t provide a batched torch.dot API; instead of unsuccessfully rummaging through docs, use vmap() to construct a new function.

>>> torch.dot  # [D], [D] -> []
>>> batched_dot = torch.func.vmap(torch.dot)  # [N, D], [N, D] -> [N]
>>> x, y = torch.randn(2, 5), torch.randn(2, 5)
>>> batched_dot(x, y)

vmap() can be helpful in hiding batch dimensions, leading to a simpler model authoring experience.

>>> batch_size, feature_size = 3, 5
>>> weights = torch.randn(feature_size, requires_grad=True)
>>>
>>> def model(feature_vec):
>>> # Very simple linear model with activation
>>>     return feature_vec.dot(weights).relu()
>>>
>>> examples = torch.randn(batch_size, feature_size)
>>> result = torch.vmap(model)(examples)

vmap() can also help vectorize computations that were previously difficult or impossible to batch. One example is higher-order gradient computation. The PyTorch autograd engine computes vjps (vector-Jacobian products). Computing a full Jacobian matrix for some function f: R^N -> R^N usually requires N calls to autograd.grad, one per Jacobian row. Using vmap(), we can vectorize the whole computation, computing the Jacobian in a single call to autograd.grad.

>>> # Setup
>>> N = 5
>>> f = lambda x: x**2
>>> x = torch.randn(N, requires_grad=True)
>>> y = f(x)
>>> I_N = torch.eye(N)
>>>
>>> # Sequential approach
>>> jacobian_rows = [torch.autograd.grad(y, x, v, retain_graph=True)[0]
>>>                  for v in I_N.unbind()]
>>> jacobian = torch.stack(jacobian_rows)
>>>
>>> # vectorized gradient computation
>>> def get_vjp(v):
>>>     return torch.autograd.grad(y, x, v)
>>> jacobian = torch.vmap(get_vjp)(I_N)

vmap() can also be nested, producing an output with multiple batched dimensions

>>> torch.dot  # [D], [D] -> []
>>> batched_dot = torch.vmap(
...     torch.vmap(torch.dot)
... )  # [N1, N0, D], [N1, N0, D] -> [N1, N0]
>>> x, y = torch.randn(2, 3, 5), torch.randn(2, 3, 5)
>>> batched_dot(x, y)  # tensor of size [2, 3]

If the inputs are not batched along the first dimension, in_dims specifies the dimension that each inputs are batched along as

>>> torch.dot  # [N], [N] -> []
>>> batched_dot = torch.vmap(torch.dot, in_dims=1)  # [N, D], [N, D] -> [D]
>>> x, y = torch.randn(2, 5), torch.randn(2, 5)
>>> batched_dot(
...     x, y
... )  # output is [5] instead of [2] if batched along the 0th dimension

If there are multiple inputs each of which is batched along different dimensions, in_dims must be a tuple with the batch dimension for each input as

>>> torch.dot  # [D], [D] -> []
>>> batched_dot = torch.vmap(torch.dot, in_dims=(0, None))  # [N, D], [D] -> [N]
>>> x, y = torch.randn(2, 5), torch.randn(5)
>>> batched_dot(
...     x, y
... )  # second arg doesn't have a batch dim because in_dim[1] was None

If the input is a Python struct, in_dims must be a tuple containing a struct matching the shape of the input:

>>> f = lambda dict: torch.dot(dict["x"], dict["y"])
>>> x, y = torch.randn(2, 5), torch.randn(5)
>>> input = {"x": x, "y": y}
>>> batched_dot = torch.vmap(f, in_dims=({"x": 0, "y": None},))
>>> batched_dot(input)

By default, the output is batched along the first dimension. However, it can be batched along any dimension by using out_dims

>>> f = lambda x: x**2
>>> x = torch.randn(2, 5)
>>> batched_pow = torch.vmap(f, out_dims=1)
>>> batched_pow(x)  # [5, 2]

For any function that uses kwargs, the returned function will not batch the kwargs but will accept kwargs

>>> x = torch.randn([2, 5])
>>> def fn(x, scale=4.):
>>>   return x * scale
>>>
>>> batched_pow = torch.vmap(fn)
>>> assert torch.allclose(batched_pow(x), x * 4)
>>> batched_pow(x, scale=x)  # scale is not batched, output has shape [2, 2, 5]

Note

vmap does not provide general autobatching or handle variable-length sequences out of the box.