On Derivative Notation, Part 1: The Problems

This is Part 1 of a five-part series on derivative notation. Part 2 explains the origin of these problems and why they persist. Parts 3 and 4 examine two approaches mathematicians have taken to resolve them. Part 5 discusses the challenge of notational redesign. The remaining parts will be published over the coming weeks.

In my recent post, I made a footnote that expressed some of my complaints about the current state of our notation for differentiation. I want to make that a little more precise here. See, I’ve always had a bit of trouble working with derivatives symbolically.¹ For the longest time, I had trouble understanding what the $x$ means in notation like $\frac{\partial}{\partial x}$ . Is it a variable? An argument name? A warning label? (It might as well be.) I also had trouble keeping track of types, since many of the resources I used seemed to use “the derivative” loosely in a way that was inconsistent with its formal definition. The situation becomes even worse when one tries to write out higher-order derivatives of compositions of matrix- or tensor-valued functions, at which point the notation stops having any illusion of being helpful.

In this post, I want to present a selection of issues I’ve encountered. I will intentionally rely on your pre-existing knowledge to interpret notation like $\nabla$ and $\frac{\partial}{\partial x}$ and so on, since part of the issue is how mechanically and ritualistically we treat these symbols. Depending on your background, some of these issues may seem more or less problematic. The goal here is simply to confront you with all of the confusion at once, since it’s unlikely that you can reconcile all of these problems with your internal model of the derivative unless your notational abuse Elo is above maybe 1600, or you’ve thought carefully about the issue.

The problems

Problems of interpretation

Is $\frac{d y}{d x}$ a fraction? Let’s get this one out of the way. Everyone’s heard it, and most of you probably know the origin story for this notation and why it looks like a fraction. The answer is, well, kind of. It’s not really a fraction, but you can get uncomfortably far treating it as such, until you run into a “paradox” like the following.

Suppose we have three variables $x$ , $y$ , and $z$ subject to a constraint, so that only two of them can vary independently. For concreteness, say $x + y + z = 0$ . If we treat the partial derivatives as fractions, then we might expect to find an identity like

\frac{\partial x}{\partial y} \frac{\partial y}{\partial z} \frac{\partial z}{\partial x} = ? \frac{\partial x}{\partial x} = 1.

You could also “derive” this via the chain rule. It’s true that $\frac{\partial x}{\partial x} = \frac{\partial x}{\partial y} \frac{\partial y}{\partial x}$ . And naively applying the chain rule gives $\frac{\partial y}{\partial x} = \frac{\partial y}{\partial z} \frac{\partial z}{\partial x}$ . So, just put these identities together and you arrive at the same identity.

But if you actually compute these partials, you get $- 1$ . From $x = - y - z$ , we have $\frac{\partial x}{\partial y} = - 1$ . From $y = - x - z$ , we have $\frac{\partial y}{\partial z} = - 1$ . From $z = - x - y$ , we have $\frac{\partial z}{\partial x} = - 1$ . The product is $(- 1)^{3} = - 1$ . What went wrong?

What is the scope of a partial derivative? Suppose we are given a function $f : R^{n} \times R^{n} \to R$ , which we annotate as $f (x, y)$ . Now tell me: what is $\nabla_{x} f (x, x)$ ? I have asked some version of this question in multiple rooms, and it has never produced fewer than three answers.² Depending on whom you ask, you might hear that it’s the derivative of $f$ with respect to its first argument, evaluated at the point $(x, x)$ . Or you will hear that it’s the gradient of the map $x \mapsto f (x, x)$ , evaluated at the point $x$ . Or you will hear that the expression is ambiguous and can’t be definitively answered without context.

Stanford’s CS 229 has some linear algebra review notes³⁴ that point out another issue that can arise from this ambiguity.⁶ If $f : R^{m} \to R$ is given by $f (x) = x^{⊤} x$ , and $A \in R^{m \times n}$ is a fixed matrix, then what is $\nabla_{x} f (A x)$ ? Is it $2 A x \in R^{m}$ , or is it $2 A^{⊤} A x \in R^{n}$ ? Depending on your interpretation, the answer can have not just a different value, but a completely different type!

What does $x$ mean in $\frac{d}{d x}$ ? Jason Howald posted a MathOverflow question that identifies a related problem. If we think of $\frac{d}{d x}$ as the abstract operator $D$ that takes a univariate function to its derivative, then $\frac{d}{d x}$ and $\frac{d}{d t}$ should be the same operator, since the symbol we decide to use to denote the argument of a function $f$ ought not matter. If $\frac{d}{d x}$ and $\frac{d}{d t}$ are the same operator, then we should be able to substitute freely. But if we annotate a function as $f (x)$ , then it seems wrong to write $\frac{df}{d t}$ . And, of course, $\frac{d}{d x} t^{3} = 3 t^{2}$ is wrong, while $\frac{d}{d x} x^{3} = 3 x^{2}$ is right.

I can hear some of you mumbling something about lambda expressions. You are probably imagining that $\frac{d}{d x} x^{3}$ is simply shorthand for $D ([x \mapsto x^{3}])$ , i.e., the derivative of the map which cubes its argument. So $\frac{d}{d x} x^{3} = 3 x^{2}$ expresses that $D ([x \mapsto x^{3}]) = [x \mapsto 3 x^{2}]$ . This helps, but if all $\frac{d}{d x}$ does is turn into $D$ after all variable expressions are turned into lambda functions, then surely you could write $\frac{d}{d x} t^{3} = D ([t \mapsto t^{3}])$ . Perhaps worse, if you are suggesting that expressions with free variables are shorthand for the function object, then you’d have to accept that $\frac{d}{d x} x^{3} = 3 t^{2}$ , since $[x \mapsto 3 x^{2}]$ and $[t \mapsto 3 t^{2}]$ are the same function.

What’s that? The symbol $\frac{d}{d x}$ binds the variable $x$ ? Well, Joel David Hamkins notes that this problem isn’t fully resolved by suggesting $\frac{d}{d x}$ binds the name $x$ like a quantifier, since in $\frac{d}{d x} x^{3} = 3 x^{2}$ , one would like to interpret $x$ as bound by $\frac{d}{d x}$ on the left-hand side, but as free on the right-hand side. That creates a bit of a problem. As Michael Bächtold points out, one is then tempted to conclude from $\frac{d}{d x} x^{2} = \frac{d}{d t} t^{2}$ , $\frac{d}{d x} x^{2} = 2 x$ , and $\frac{d}{d t} t^{2} = 2 t$ that $2 x = 2 t$ , so that a priori distinct free variables are all actually the same (which would be nice in a way; a whole lot of mathematics would become simpler).⁷

The answer clearly isn’t to suppose that $x$ is free in the entire expression. Try making a substitution, say, $x = 3$ in the identity $\frac{d}{d x} x^{3} = 3 x^{2}$ . You get $\frac{d}{d 3} 3^{3} = 27$ . What the hell does that mean? So we have to go the other way and try to expand the bound scope of $\frac{d}{d x}$ beyond just the immediate expression to keep $x$ as bound in some consistent way. How far should this go? To both sides of an equation involving the derivative? To all identities in its context? What would be the nature of this binding? The trouble is that if we want to interpret $\frac{d}{d x}$ as binding in some way, then the nature of this binding depends on the context in a way that is difficult (or undesirable) to formalize. Sometimes $\frac{d}{d x} x^{3}$ appears on its own and is, indeed, shorthand for the function $D ([t \mapsto t^{3}])$ . Other times it is intended to be a standalone free expression which is the evaluation of $D ([t \mapsto t^{3}])$ at a free variable $x$ . In an identity such as $\frac{d}{d x} x^{3} = 3 x^{2}$ , we might mean the sentence $\forall x [D ([t \mapsto t^{3}]) (x) = 3 x^{2}]$ , or perhaps we mean simply the expression $D ([t \mapsto t^{3}]) (x) = 3 x^{2}$ , or perhaps we have a specific $x$ in mind. The notation is systematically ambiguous about whether and how $x$ is bound, and no single semantic interpretation captures all actual usages.

Now I’m sympathetic to the viewpoint here that we should just shut up and calculate and treat $\frac{d}{d x}$ as purely a rewrite rule for symbolic expressions. After all, that does seem to be the way we most often use $\frac{d}{d x}$ and its kin. Alas, I am not too sure that this quite works. On this view, $\frac{d}{d x} (x^{2} + 1) = 2 x$ is perfectly coherent as a syntactic derivation. But should we attempt to interpret this identity to extract mathematical meaning, we run into the same issues as just described, so we’ll have to accept that either interpretation is to be done relatively ad hoc, or that we’ll have phantom expressions haunting our whiteboards and worksheets that on their own represent no specific mathematical content. What’s more, any given function $f$ still has no inherent variable names, so a purely syntactic operator would still have no way to distinguish $\frac{df}{d x}$ and $\frac{df}{d t}$ .

What is the difference between $\frac{d}{d x}$ and $\frac{\partial}{\partial x}$ ? This question may seem absurd at first, but consider that for a function $f : R \times R \to R$ annotated as $f (x, y)$ , the partial derivative with respect to $x$ at a point $(x_{0}, y_{0})$ is usually defined to be

\frac{\partial f}{\partial x}_{(x_{0}, y_{0})} := t \to 0 lim \frac{f ( x _{0} + t , y _{0} ) - f ( x _{0} , y _{0} )}{t} .

And what is $\frac{df}{d x}$ ? Well, you might have heard this called the total derivative of $f$ with respect to $x$ , given at $(x_{0}, y_{0})$ by

\frac{df}{d x}_{(x_{0}, y_{0})} = \frac{\partial f}{\partial x}_{(x_{0}, y_{0})} + \frac{\partial f}{\partial y}_{(x_{0}, y_{0})} \frac{d y}{d x}_{(x_{0}, y_{0})} .

But what the hell is $\frac{d y}{d x}$ ? What does it mean to differentiate a variable with respect to another variable? We can’t apply the limit definition to “the second positional argument of $f$ ”. If we demand an answer, then it should seem the only sensible one is $\frac{d y}{d x} = 0$ , and we find $\frac{df}{d x} = \frac{\partial f}{\partial x}$ after all.

Implicit differentiation produces different expressions. Start with the relationship between Cartesian coordinates $(x, y)$ and polar coordinates $(r, θ)$ , namely, $x = r cos θ$ and $y = r sin θ$ . We want to treat $r$ as a function of the other variables, so that we can compute $\frac{\partial r}{\partial x}$ . However, there is some ambiguity, first explicitly stated by Carl Gustav Jacob Jacobi in 1841 (though not with this example). If you “treat $y$ as constant”, then you would compute

\frac{\partial r}{\partial x} = \frac{\partial x ^{2} + y ^{2}}{\partial x} = \frac{x}{x ^{2} + y ^{2}} = \frac{x}{r} = cos θ .

On the other hand, if you “treat $θ$ as constant”, then you compute

\frac{\partial r}{\partial x} = \frac{\partial}{\partial x} (\frac{x}{cos θ}) = \frac{1}{cos θ} .

Obviously, $\frac{\partial r}{\partial x}$ cannot therefore refer to just one thing. This is why one often sees, particularly in physics, notation like $(\frac{\partial r}{\partial x})_{θ}$ and $(\frac{\partial r}{\partial x})_{y}$ , where the subscript distinguishes which variables are treated as constant. What exactly does this mean? Syntactically it is clear, since we understand how to handle concrete numbers like “2” when differentiating, so we just follow the same rules for the symbols “held constant” instead. But what is the formal meaning of such manipulations?

Problems of representation

Why is the gradient a (column) vector? The gradient of a function $f : R^{n} \to R$ at a point $x \in R^{n}$ is traditionally defined as the vector

\nabla f (x) = \frac{\partial f}{\partial x _{1}} (x) ⋮ \frac{\partial f}{\partial x _{n}} (x) .

Let’s now take a look at the units involved. Suppose that the input to $f$ is a vector whose entries bear units of meters, and the output of $f$ has units of degrees Celsius. Then a quantity such as $\frac{\partial f}{\partial x _{1}} (x)$ will have units of degrees Celsius per meter, so the vector $\nabla f (x)$ collects together these temperature gradients. This issue comes in when we next consider doing gradient descent on $f$ , which has iterates

x^{(k + 1)} = x^{(k)} - α \nabla f (x^{(k)})

for a step size $α > 0$ . If $α$ is unitless, then this makes no sense. You can’t add a vector with entrywise units of meters to a vector with entrywise units of degrees Celsius per meter. Are we really to interpret $α$ then as having units of square meters per degree Celsius? Well, perhaps it is better to think of $α$ as a time-varying step size $α^{(k)} = \frac{η ^{(k)}}{∥\nabla f ( x ^{(k)} ) ∥ _{2}}$ , where now $η^{(k)}$ has units of meters and it scales a unitless direction vector $\frac{\nabla f ( x ^{(k)} )}{∥\nabla f ( x ^{(k)} ) ∥ _{2}}$ . (Still, the iterates as described above are standard.)

That still leaves a more technical issue. (Don’t worry if this is hard to follow right now; I explain in greater detail in Part 3.) More abstractly, take $f : V \to R$ to be a function over a generic Banach space $V$ . Then $\nabla f (x)$ is supposed to be a representative vector in $V$ that encodes the total derivative $D f ∣_{x}$ , which is an element of the dual space $V^{*}$ . (In particular, don’t confuse this with the total derivative $\frac{df}{d x}$ from earlier. I’ll explain how these relate in Part 4.) But even in the finite-dimensional case, there is not a canonical such representative unless we impose some additional structure. Yet we can nevertheless make sense of gradient descent in some of these spaces, and indeed we can do so by picking elements of $V$ . So what is “the gradient” of a function really? Does it depend on the context?

There is another issue here, which is that we optimization people like to use the notation $\nabla^{2} f (x)$ to denote the Hessian of $f$ at $x$ . (Physicists would object that $\nabla^{2} f (x)$ is the Laplacian of $f$ at $x$ .) This should be illegal though, since $\nabla f$ as defined is a map $R^{n} \to R^{n}$ , so it makes no sense to apply the gradient operator to $\nabla f$ . And of course $\nabla f (x)$ is just a vector, so it’s also not a candidate for differentiation in this way. We apparently have to be sloppy and only now treat $\nabla f (x)$ as a linear functional $R^{n} \to R$ (in other words, transpose it), and we can then take its gradient. (This is, for example, what the CS 229 review notes suggest to do.) And then what if we want a third-order or even higher-order derivative? Is $\nabla^{3} f$ a thing? How would we make sense of that notation?

Layout conventions. When dealing with a map $f : R^{n} \to R^{m}$ , or a map $g : R \to R^{a \times b}$ , or a map $h : R^{p \times q} \to R$ (whose arguments we’ll label $x$ , $y$ , and $z$ , respectively), there are different conventions for what kind of thing $\frac{\partial f}{\partial x}$ , $\frac{\partial g}{\partial y}$ , and $\frac{\partial h}{\partial z}$ are. In the so-called numerator layout or Jacobian formulation, we are told

\frac{\partial f}{\partial x} : R^{n} \to R^{m \times n}, \frac{\partial g}{\partial y} : R \to R^{a \times b}, \frac{\partial h}{\partial z} : R^{p \times q} \to R^{p \times q}, (\frac{\partial f}{\partial x})_{ij} = \frac{\partial f _{i}}{\partial x _{j}}; (\frac{\partial g}{\partial y})_{ij} = \frac{\partial g _{ij}}{\partial y}; and (\frac{\partial h}{\partial z})_{ij} = \frac{\partial h}{\partial z _{ij}} .

In the so-called denominator layout or Hessian formulation, all of these are transposed. And holy buckets are there problems here. Let’s just focus on the numerator layout for now. Each of the three partials is a matrix, yes, but they encode their respective total derivatives (see Part 3) in entirely different ways.

The Jacobian $\frac{\partial f}{\partial x} (x) \in R^{m \times n}$ encodes the linear map $D f ∣_{x} : R^{n} \to R^{m}$ via the matrix-vector multiplication $v \mapsto \frac{\partial f}{\partial x} (x) v$ .
The matrix $\frac{\partial g}{\partial y} (y) \in R^{a \times b}$ encodes the linear map $D g ∣_{y} : R \to R^{a \times b}$ via the scalar multiplication $s \mapsto s \cdot \frac{\partial g}{\partial y} (y)$ .
The matrix $\frac{\partial h}{\partial z} (z) \in R^{p \times q}$ encodes the linear map $D h ∣_{z} : R^{p \times q} \to R$ via the trace inner product $V \mapsto tr (\frac{\partial h}{\partial z} (z)^{⊤} V)$ .

This is a problem because the chain rule for the total derivative, $D (g \circ f) ∣_{x} = D g ∣_{f (x)} \circ D f ∣_{x}$ , involves composition of linear maps, which corresponds to matrix multiplication only when all the derivatives are encoded consistently (as with $f$ ). Consider the composition

R^{p \times q} h R g R^{a \times b} .

The chain rule says $D (g \circ h) ∣_{z} = D g ∣_{h (z)} \circ D h ∣_{z}$ . Well, for a perturbation $V \in R^{p \times q}$ , first $D h ∣_{z}$ produces the scalar $tr (\frac{\partial h}{\partial z} (z)^{⊤} V)$ , then $D g ∣_{h (z)}$ scales $\frac{\partial g}{\partial y} (h (z))$ by this scalar. The result is

D (g \circ h) ∣_{z} (V) = tr (\frac{\partial h}{\partial z} (z)^{⊤} V) \cdot \frac{\partial g}{\partial y} (h (z)),

an outer-product structure that is emphatically not matrix multiplication. If you naively write “ $\frac{\partial g}{\partial y} \frac{\partial h}{\partial z}$ ” and interpret it as matrix multiplication, then you get nonsense (it’s not even defined unless $b = p$ , and even then it’s wrong).

OK, so let’s now focus on the case of $f : R^{n} \to R^{m}$ , since that seems to correspond most clearly with the total derivative. It is common in multivariable calculus and analysis to rename $\frac{\partial f}{\partial x}$ in the numerator layout as the Jacobian $J_{f} (x) \in R^{m \times n}$ . It satisfies the familiar chain rule

J_{g \circ f} (x) = J_{g} (f (x)) \cdot J_{f} (x) .

Unfortunately, this nice structure is butchered when you use the numerator layout and gradients, since you’re implicitly mixing convention. Strictly adhering to the numerator layout would suggest you should take the gradient to be a row vector, not a column vector. To see how quickly things become unwieldy, imagine a toy computation $L : R^{n} \to R$ that is a composition of maps, alternating between vector-valued and scalar-valued,

R^{n} f R^{m} g R h R^{r} ℓ R,

and suppose we want the gradient of $L$ with respect to $x$ , expecting an $n$ -dimensional column vector. In numerator layout, the Jacobians at a point $x$ are $J_{f} (x) \in R^{m \times n}$ , $J_{g} (x) \in R^{1 \times m}$ , $J_{h} (x) \in R^{r \times 1}$ , and $J_{ℓ} (x) \in R^{1 \times r}$ . Using column gradients for the scalar-valued maps, we have $\nabla g (x) = J_{g} (x)^{⊤} \in R^{m \times 1}$ and $\nabla ℓ (x) = J_{ℓ} (x)^{⊤} \in R^{r \times 1}$ . Working out the chain rule while respecting these conventions, we obtain

\nabla L (x) = [J_{h} (g (f (x)))^{⊤} \nabla ℓ (h (g (f (x))))] \cdot J_{f} (x)^{⊤} \nabla g (f (x)),

where the bracketed factor is a scalar. Good luck intuiting that this is the correct order for the objects to appear when applying the chain rule manually. And this is just with functions whose derivatives are readily written as matrices. Only God⁸ can help if you want to do any higher-order derivatives, or derivatives of maps between spaces of matrices or higher-order tensors.⁹

These examples are meant to illustrate the sheer variety of ways our derivative notation has to wear away our will to live. In the next post, I’ll explain how we ended up in this mess.

In general, I struggle working with notation that obscures or conflates types, has inconsistent usage, or omits important dependencies. Expectation is another one of those places where I run into a lot of friction in understanding computations, since the measure is often implicit. Strangely, I actually quite like the notation Olav Kallenberg uses for measure theory, e.g. writing $μ f$ instead of $\int f (x) μ (d x)$ . I am not as much of a fan of the notation he adopts for kernels but I could get used to it. ↩
Ignore the fact that I was alone in all of these rooms and supplied all three answers myself. ↩
I'd link them here but they require Stanford credentials to view. ↩
I noticed some errors while I was searching through them for this example. So if someone is reading this who can address these, then please do! The worst of the errors is that on page 19 it is claimed that "The rank of $A$ is equal to the number of non-zero eigenvalues of $A$ ", which is false unless $A$ is diagonalizable. For instance,

$A = [0010]$

has rank 1 but both its eigenvalues are 0. There is also a typo on page 26 which suggests to take a matrix $A \in R^{n}$ , which is I imagine supposed to be $A \in R^{m \times n}$ . Finally, there is an unaddressed subtlety in the derivation of the derivative of $lo g det$ . The domain is $S_{++}^{n}$ , not $R^{n \times n}$ , so technically the tangent space in this case is $S^{n}$ . That is, the perturbations are restricted to be symmetric. Under the trace inner product, orthogonal projection onto $S^{n}$ is just $X \mapsto \frac{1}{2} (X + X^{⊤})$ , so we can compute the derivatives here by extending $f$ to an open subset of $R^{n \times n}$ ,⁵ computing the gradient of the extension, and then symmetrizing the result. Happily, the "forgetful" derivative in this case is already symmetric, so it works out. But a priori this is not guaranteed. ↩
For instance, take $U = {X \in R^{n \times n} : det (X) > 0}$ , and define the extension $\tilde{f} : U \to R$ as $\tilde{f} (X) = lo g det (X)$ . ↩
I've changed the example slightly to label the argument to $f$ as $x$ so I could subscript $\nabla$ with $x$ , because I actually don't think there should be any ambiguity at all with $\nabla f (A x)$ . On its own, $\nabla$ operates on functions and ought to be coordinate-free, so I don't personally see how $\nabla f (A x)$ can be interpreted to mean anything other than $[z \mapsto 2 z] (A x)$ . ↩
This puzzle is reminiscent of "the antinomy of the variable", which is about how variables in logic appear to be both the same (freely renamable placeholders) and different (we distinguish $x$ from $y$ when appearing jointly). For instance, variables $x$ and $y$ must have different meanings because substituting one for the other can change the meaning of a formula, e.g., $x \leq y$ is different from $x \leq x$ . Yet they must have the same meaning because sentences that differ only by alphabetic variation are synonymous, e.g., $\forall x (x \leq x)$ is the same as $\forall y (y \leq y)$ . See The Antinomy of the Variable: A Tarskian Resolution by Bryan Pickel and Brian Rabern, which provides a more thorough discussion of the problem and examines a proposed resolution using Tarskian semantics. ↩
Also known as Wolfram Mathematica, as of Version 14.1. ↩
I suffered great anguish because of this issue. In Spring 2023 I was two points off of a perfect score on my CS 229 final exam because in a question that asked for us to manually do backpropagation on a composite function that involved matrices, I had my matrix derivative transposed to what was expected.¹⁰ If only we were permitted to use matrixcalculus.org during the exam... well, actually, it probably wouldn't have helped me since I did not realize that the course defaulted to the denominator layout, but matrixcalculus.org uses a mixed layout. ↩
Well, this was only a one-point deduction. Even if I had gotten this question correct, I missed a second point because I misread one of the True/False questions. ↩