Autograd Utility: PyTorch AD functions#
This module imports PyTorch’s own autograd functions, depending on the version.
Important! Before PyTorch 2.0.0, functorch does not work together with custom autograd functions, which we definitely require. Additionally, functorch imposes the implementation of a forward and setup_context method, i.e., the traditional way of using forward with the ctx argument does not work.
Note
functorch is shipped with PyTorch 1.13.0 and later. Earlier versions require a separate installation.
- tad_mctc.autograd.internals.fjacrev(func, argnums=0, *, has_aux=False, chunk_size=None, _preallocate_and_copy=False)#
Computes the Jacobian of
funcwith respect to the arg(s) at indexargnumusing reverse mode autodiffNote
Using
chunk_size=1is equivalent to computing the jacobian row-by-row with a for-loop i.e. the constraints ofvmap()are not applicable.- Args:
- func (function): A Python function that takes one or more arguments,
one of which must be a Tensor, and returns one or more Tensors
- argnums (int or tuple[int, …]): Optional, integer or tuple of integers,
saying which arguments to get the Jacobian with respect to. Default: 0.
- has_aux (bool): Flag indicating that
funcreturns a (output, aux)tuple where the first element is the output of the function to be differentiated and the second element is auxiliary objects that will not be differentiated. Default: False.- chunk_size (None or int): If None (default), use the maximum chunk size
(equivalent to doing a single vmap over vjp to compute the jacobian). If 1, then compute the jacobian row-by-row with a for-loop. If not None, then compute the jacobian
chunk_sizerows at a time (equivalent to doing multiple vmap over vjp). If you run into memory issues computing the jacobian, please try to specify a non-None chunk_size.
- Returns:
Returns a function that takes in the same inputs as
funcand returns the Jacobian offuncwith respect to the arg(s) atargnums. Ifhas_aux is True, then the returned function instead returns a(jacobian, aux)tuple wherejacobianis the Jacobian andauxis auxiliary objects returned byfunc.
A basic usage with a pointwise, unary operation will give a diagonal array as the Jacobian
>>> from torch.func import jacrev >>> x = torch.randn(5) >>> jacobian = jacrev(torch.sin)(x) >>> expected = torch.diag(torch.cos(x)) >>> assert torch.allclose(jacobian, expected)
If you would like to compute the output of the function as well as the jacobian of the function, use the
has_auxflag to return the output as an auxiliary object:>>> from torch.func import jacrev >>> x = torch.randn(5) >>> >>> def f(x): >>> return x.sin() >>> >>> def g(x): >>> result = f(x) >>> return result, result >>> >>> jacobian_f, f_x = jacrev(g, has_aux=True)(x) >>> assert torch.allclose(f_x, f(x))
jacrev()can be composed with vmap to produce batched Jacobians:>>> from torch.func import jacrev, vmap >>> x = torch.randn(64, 5) >>> jacobian = vmap(jacrev(torch.sin))(x) >>> assert jacobian.shape == (64, 5, 5)
Additionally,
jacrev()can be composed with itself to produce Hessians>>> from torch.func import jacrev >>> def f(x): >>> return x.sin().sum() >>> >>> x = torch.randn(5) >>> hessian = jacrev(jacrev(f))(x) >>> assert torch.allclose(hessian, torch.diag(-x.sin()))
By default,
jacrev()computes the Jacobian with respect to the first input. However, it can compute the Jacboian with respect to a different argument by usingargnums:>>> from torch.func import jacrev >>> def f(x, y): >>> return x + y ** 2 >>> >>> x, y = torch.randn(5), torch.randn(5) >>> jacobian = jacrev(f, argnums=1)(x, y) >>> expected = torch.diag(2 * y) >>> assert torch.allclose(jacobian, expected)
Additionally, passing a tuple to
argnumswill compute the Jacobian with respect to multiple arguments>>> from torch.func import jacrev >>> def f(x, y): >>> return x + y ** 2 >>> >>> x, y = torch.randn(5), torch.randn(5) >>> jacobian = jacrev(f, argnums=(0, 1))(x, y) >>> expectedX = torch.diag(torch.ones_like(x)) >>> expectedY = torch.diag(2 * y) >>> assert torch.allclose(jacobian[0], expectedX) >>> assert torch.allclose(jacobian[1], expectedY)
Note
Using PyTorch
torch.no_gradtogether withjacrev. Case 1: Usingtorch.no_gradinside a function:>>> def f(x): >>> with torch.no_grad(): >>> c = x ** 2 >>> return x - c
In this case,
jacrev(f)(x)will respect the innertorch.no_grad.Case 2: Using
jacrevinsidetorch.no_gradcontext manager:>>> with torch.no_grad(): >>> jacrev(f)(x)
In this case,
jacrevwill respect the innertorch.no_grad, but not the outer one. This is becausejacrevis a “function transform”: its result should not depend on the result of a context manager outside off.
- tad_mctc.autograd.internals.fvmap(func, in_dims=0, out_dims=0, randomness='error', *, chunk_size=None)#
vmap is the vectorizing map;
vmap(func)returns a new function that mapsfuncover some dimension of the inputs. Semantically, vmap pushes the map into PyTorch operations called byfunc, effectively vectorizing those operations.vmap is useful for handling batch dimensions: one can write a function
functhat runs on examples and then lift it to a function that can take batches of examples withvmap(func). vmap can also be used to compute batched gradients when composed with autograd.Note
torch.vmap()is aliased totorch.func.vmap()for convenience. Use whichever one you’d like.- Args:
- func (function): A Python function that takes one or more arguments.
Must return one or more Tensors.
- in_dims (int or nested structure): Specifies which dimension of the
inputs should be mapped over.
in_dimsshould have a structure like the inputs. If thein_dimfor a particular input is None, then that indicates there is no map dimension. Default: 0.- out_dims (int or Tuple[int]): Specifies where the mapped dimension
should appear in the outputs. If
out_dimsis a Tuple, then it should have one element per output. Default: 0.- randomness (str): Specifies whether the randomness in this
vmap should be the same or different across batches. If ‘different’, the randomness for each batch will be different. If ‘same’, the randomness will be the same across batches. If ‘error’, any calls to random functions will error. Default: ‘error’. WARNING: this flag only applies to random PyTorch operations and does not apply to Python’s random module or numpy randomness.
- chunk_size (None or int): If None (default), apply a single vmap over inputs.
If not None, then compute the vmap
chunk_sizesamples at a time. Note thatchunk_size=1is equivalent to computing the vmap with a for-loop. If you run into memory issues computing the vmap, please try a non-None chunk_size.
- Returns:
Returns a new “batched” function. It takes the same inputs as
func, except each input has an extra dimension at the index specified byin_dims. It takes returns the same outputs asfunc, except each output has an extra dimension at the index specified byout_dims.
One example of using
vmap()is to compute batched dot products. PyTorch doesn’t provide a batchedtorch.dotAPI; instead of unsuccessfully rummaging through docs, usevmap()to construct a new function.>>> torch.dot # [D], [D] -> [] >>> batched_dot = torch.func.vmap(torch.dot) # [N, D], [N, D] -> [N] >>> x, y = torch.randn(2, 5), torch.randn(2, 5) >>> batched_dot(x, y)
vmap()can be helpful in hiding batch dimensions, leading to a simpler model authoring experience.>>> batch_size, feature_size = 3, 5 >>> weights = torch.randn(feature_size, requires_grad=True) >>> >>> def model(feature_vec): >>> # Very simple linear model with activation >>> return feature_vec.dot(weights).relu() >>> >>> examples = torch.randn(batch_size, feature_size) >>> result = torch.vmap(model)(examples)
vmap()can also help vectorize computations that were previously difficult or impossible to batch. One example is higher-order gradient computation. The PyTorch autograd engine computes vjps (vector-Jacobian products). Computing a full Jacobian matrix for some function f: R^N -> R^N usually requires N calls toautograd.grad, one per Jacobian row. Usingvmap(), we can vectorize the whole computation, computing the Jacobian in a single call toautograd.grad.>>> # Setup >>> N = 5 >>> f = lambda x: x**2 >>> x = torch.randn(N, requires_grad=True) >>> y = f(x) >>> I_N = torch.eye(N) >>> >>> # Sequential approach >>> jacobian_rows = [torch.autograd.grad(y, x, v, retain_graph=True)[0] >>> for v in I_N.unbind()] >>> jacobian = torch.stack(jacobian_rows) >>> >>> # vectorized gradient computation >>> def get_vjp(v): >>> return torch.autograd.grad(y, x, v) >>> jacobian = torch.vmap(get_vjp)(I_N)
vmap()can also be nested, producing an output with multiple batched dimensions>>> torch.dot # [D], [D] -> [] >>> batched_dot = torch.vmap( ... torch.vmap(torch.dot) ... ) # [N1, N0, D], [N1, N0, D] -> [N1, N0] >>> x, y = torch.randn(2, 3, 5), torch.randn(2, 3, 5) >>> batched_dot(x, y) # tensor of size [2, 3]
If the inputs are not batched along the first dimension,
in_dimsspecifies the dimension that each inputs are batched along as>>> torch.dot # [N], [N] -> [] >>> batched_dot = torch.vmap(torch.dot, in_dims=1) # [N, D], [N, D] -> [D] >>> x, y = torch.randn(2, 5), torch.randn(2, 5) >>> batched_dot( ... x, y ... ) # output is [5] instead of [2] if batched along the 0th dimension
If there are multiple inputs each of which is batched along different dimensions,
in_dimsmust be a tuple with the batch dimension for each input as>>> torch.dot # [D], [D] -> [] >>> batched_dot = torch.vmap(torch.dot, in_dims=(0, None)) # [N, D], [D] -> [N] >>> x, y = torch.randn(2, 5), torch.randn(5) >>> batched_dot( ... x, y ... ) # second arg doesn't have a batch dim because in_dim[1] was None
If the input is a Python struct,
in_dimsmust be a tuple containing a struct matching the shape of the input:>>> f = lambda dict: torch.dot(dict["x"], dict["y"]) >>> x, y = torch.randn(2, 5), torch.randn(5) >>> input = {"x": x, "y": y} >>> batched_dot = torch.vmap(f, in_dims=({"x": 0, "y": None},)) >>> batched_dot(input)
By default, the output is batched along the first dimension. However, it can be batched along any dimension by using
out_dims>>> f = lambda x: x**2 >>> x = torch.randn(2, 5) >>> batched_pow = torch.vmap(f, out_dims=1) >>> batched_pow(x) # [5, 2]
For any function that uses kwargs, the returned function will not batch the kwargs but will accept kwargs
>>> x = torch.randn([2, 5]) >>> def fn(x, scale=4.): >>> return x * scale >>> >>> batched_pow = torch.vmap(fn) >>> assert torch.allclose(batched_pow(x), x * 4) >>> batched_pow(x, scale=x) # scale is not batched, output has shape [2, 2, 5]
Note
vmap does not provide general autobatching or handle variable-length sequences out of the box.
- tad_mctc.autograd.internals.jacrev(func, argnums=0, *, has_aux=False, chunk_size=None, _preallocate_and_copy=False)#
Computes the Jacobian of
funcwith respect to the arg(s) at indexargnumusing reverse mode autodiffNote
Using
chunk_size=1is equivalent to computing the jacobian row-by-row with a for-loop i.e. the constraints ofvmap()are not applicable.- Args:
- func (function): A Python function that takes one or more arguments,
one of which must be a Tensor, and returns one or more Tensors
- argnums (int or tuple[int, …]): Optional, integer or tuple of integers,
saying which arguments to get the Jacobian with respect to. Default: 0.
- has_aux (bool): Flag indicating that
funcreturns a (output, aux)tuple where the first element is the output of the function to be differentiated and the second element is auxiliary objects that will not be differentiated. Default: False.- chunk_size (None or int): If None (default), use the maximum chunk size
(equivalent to doing a single vmap over vjp to compute the jacobian). If 1, then compute the jacobian row-by-row with a for-loop. If not None, then compute the jacobian
chunk_sizerows at a time (equivalent to doing multiple vmap over vjp). If you run into memory issues computing the jacobian, please try to specify a non-None chunk_size.
- Returns:
Returns a function that takes in the same inputs as
funcand returns the Jacobian offuncwith respect to the arg(s) atargnums. Ifhas_aux is True, then the returned function instead returns a(jacobian, aux)tuple wherejacobianis the Jacobian andauxis auxiliary objects returned byfunc.
A basic usage with a pointwise, unary operation will give a diagonal array as the Jacobian
>>> from torch.func import jacrev >>> x = torch.randn(5) >>> jacobian = jacrev(torch.sin)(x) >>> expected = torch.diag(torch.cos(x)) >>> assert torch.allclose(jacobian, expected)
If you would like to compute the output of the function as well as the jacobian of the function, use the
has_auxflag to return the output as an auxiliary object:>>> from torch.func import jacrev >>> x = torch.randn(5) >>> >>> def f(x): >>> return x.sin() >>> >>> def g(x): >>> result = f(x) >>> return result, result >>> >>> jacobian_f, f_x = jacrev(g, has_aux=True)(x) >>> assert torch.allclose(f_x, f(x))
jacrev()can be composed with vmap to produce batched Jacobians:>>> from torch.func import jacrev, vmap >>> x = torch.randn(64, 5) >>> jacobian = vmap(jacrev(torch.sin))(x) >>> assert jacobian.shape == (64, 5, 5)
Additionally,
jacrev()can be composed with itself to produce Hessians>>> from torch.func import jacrev >>> def f(x): >>> return x.sin().sum() >>> >>> x = torch.randn(5) >>> hessian = jacrev(jacrev(f))(x) >>> assert torch.allclose(hessian, torch.diag(-x.sin()))
By default,
jacrev()computes the Jacobian with respect to the first input. However, it can compute the Jacboian with respect to a different argument by usingargnums:>>> from torch.func import jacrev >>> def f(x, y): >>> return x + y ** 2 >>> >>> x, y = torch.randn(5), torch.randn(5) >>> jacobian = jacrev(f, argnums=1)(x, y) >>> expected = torch.diag(2 * y) >>> assert torch.allclose(jacobian, expected)
Additionally, passing a tuple to
argnumswill compute the Jacobian with respect to multiple arguments>>> from torch.func import jacrev >>> def f(x, y): >>> return x + y ** 2 >>> >>> x, y = torch.randn(5), torch.randn(5) >>> jacobian = jacrev(f, argnums=(0, 1))(x, y) >>> expectedX = torch.diag(torch.ones_like(x)) >>> expectedY = torch.diag(2 * y) >>> assert torch.allclose(jacobian[0], expectedX) >>> assert torch.allclose(jacobian[1], expectedY)
Note
Using PyTorch
torch.no_gradtogether withjacrev. Case 1: Usingtorch.no_gradinside a function:>>> def f(x): >>> with torch.no_grad(): >>> c = x ** 2 >>> return x - c
In this case,
jacrev(f)(x)will respect the innertorch.no_grad.Case 2: Using
jacrevinsidetorch.no_gradcontext manager:>>> with torch.no_grad(): >>> jacrev(f)(x)
In this case,
jacrevwill respect the innertorch.no_grad, but not the outer one. This is becausejacrevis a “function transform”: its result should not depend on the result of a context manager outside off.
- tad_mctc.autograd.internals.vmap(func, in_dims=0, out_dims=0, randomness='error', *, chunk_size=None)#
vmap is the vectorizing map;
vmap(func)returns a new function that mapsfuncover some dimension of the inputs. Semantically, vmap pushes the map into PyTorch operations called byfunc, effectively vectorizing those operations.vmap is useful for handling batch dimensions: one can write a function
functhat runs on examples and then lift it to a function that can take batches of examples withvmap(func). vmap can also be used to compute batched gradients when composed with autograd.Note
torch.vmap()is aliased totorch.func.vmap()for convenience. Use whichever one you’d like.- Args:
- func (function): A Python function that takes one or more arguments.
Must return one or more Tensors.
- in_dims (int or nested structure): Specifies which dimension of the
inputs should be mapped over.
in_dimsshould have a structure like the inputs. If thein_dimfor a particular input is None, then that indicates there is no map dimension. Default: 0.- out_dims (int or Tuple[int]): Specifies where the mapped dimension
should appear in the outputs. If
out_dimsis a Tuple, then it should have one element per output. Default: 0.- randomness (str): Specifies whether the randomness in this
vmap should be the same or different across batches. If ‘different’, the randomness for each batch will be different. If ‘same’, the randomness will be the same across batches. If ‘error’, any calls to random functions will error. Default: ‘error’. WARNING: this flag only applies to random PyTorch operations and does not apply to Python’s random module or numpy randomness.
- chunk_size (None or int): If None (default), apply a single vmap over inputs.
If not None, then compute the vmap
chunk_sizesamples at a time. Note thatchunk_size=1is equivalent to computing the vmap with a for-loop. If you run into memory issues computing the vmap, please try a non-None chunk_size.
- Returns:
Returns a new “batched” function. It takes the same inputs as
func, except each input has an extra dimension at the index specified byin_dims. It takes returns the same outputs asfunc, except each output has an extra dimension at the index specified byout_dims.
One example of using
vmap()is to compute batched dot products. PyTorch doesn’t provide a batchedtorch.dotAPI; instead of unsuccessfully rummaging through docs, usevmap()to construct a new function.>>> torch.dot # [D], [D] -> [] >>> batched_dot = torch.func.vmap(torch.dot) # [N, D], [N, D] -> [N] >>> x, y = torch.randn(2, 5), torch.randn(2, 5) >>> batched_dot(x, y)
vmap()can be helpful in hiding batch dimensions, leading to a simpler model authoring experience.>>> batch_size, feature_size = 3, 5 >>> weights = torch.randn(feature_size, requires_grad=True) >>> >>> def model(feature_vec): >>> # Very simple linear model with activation >>> return feature_vec.dot(weights).relu() >>> >>> examples = torch.randn(batch_size, feature_size) >>> result = torch.vmap(model)(examples)
vmap()can also help vectorize computations that were previously difficult or impossible to batch. One example is higher-order gradient computation. The PyTorch autograd engine computes vjps (vector-Jacobian products). Computing a full Jacobian matrix for some function f: R^N -> R^N usually requires N calls toautograd.grad, one per Jacobian row. Usingvmap(), we can vectorize the whole computation, computing the Jacobian in a single call toautograd.grad.>>> # Setup >>> N = 5 >>> f = lambda x: x**2 >>> x = torch.randn(N, requires_grad=True) >>> y = f(x) >>> I_N = torch.eye(N) >>> >>> # Sequential approach >>> jacobian_rows = [torch.autograd.grad(y, x, v, retain_graph=True)[0] >>> for v in I_N.unbind()] >>> jacobian = torch.stack(jacobian_rows) >>> >>> # vectorized gradient computation >>> def get_vjp(v): >>> return torch.autograd.grad(y, x, v) >>> jacobian = torch.vmap(get_vjp)(I_N)
vmap()can also be nested, producing an output with multiple batched dimensions>>> torch.dot # [D], [D] -> [] >>> batched_dot = torch.vmap( ... torch.vmap(torch.dot) ... ) # [N1, N0, D], [N1, N0, D] -> [N1, N0] >>> x, y = torch.randn(2, 3, 5), torch.randn(2, 3, 5) >>> batched_dot(x, y) # tensor of size [2, 3]
If the inputs are not batched along the first dimension,
in_dimsspecifies the dimension that each inputs are batched along as>>> torch.dot # [N], [N] -> [] >>> batched_dot = torch.vmap(torch.dot, in_dims=1) # [N, D], [N, D] -> [D] >>> x, y = torch.randn(2, 5), torch.randn(2, 5) >>> batched_dot( ... x, y ... ) # output is [5] instead of [2] if batched along the 0th dimension
If there are multiple inputs each of which is batched along different dimensions,
in_dimsmust be a tuple with the batch dimension for each input as>>> torch.dot # [D], [D] -> [] >>> batched_dot = torch.vmap(torch.dot, in_dims=(0, None)) # [N, D], [D] -> [N] >>> x, y = torch.randn(2, 5), torch.randn(5) >>> batched_dot( ... x, y ... ) # second arg doesn't have a batch dim because in_dim[1] was None
If the input is a Python struct,
in_dimsmust be a tuple containing a struct matching the shape of the input:>>> f = lambda dict: torch.dot(dict["x"], dict["y"]) >>> x, y = torch.randn(2, 5), torch.randn(5) >>> input = {"x": x, "y": y} >>> batched_dot = torch.vmap(f, in_dims=({"x": 0, "y": None},)) >>> batched_dot(input)
By default, the output is batched along the first dimension. However, it can be batched along any dimension by using
out_dims>>> f = lambda x: x**2 >>> x = torch.randn(2, 5) >>> batched_pow = torch.vmap(f, out_dims=1) >>> batched_pow(x) # [5, 2]
For any function that uses kwargs, the returned function will not batch the kwargs but will accept kwargs
>>> x = torch.randn([2, 5]) >>> def fn(x, scale=4.): >>> return x * scale >>> >>> batched_pow = torch.vmap(fn) >>> assert torch.allclose(batched_pow(x), x * 4) >>> batched_pow(x, scale=x) # scale is not batched, output has shape [2, 2, 5]
Note
vmap does not provide general autobatching or handle variable-length sequences out of the box.