Questions

Week 13:

• Q: why is the d* which maps 1-forms to 0-forms in the discrete called the divergence?
A: The adjoint d* gets from a 1-form F a 0-form d*F. The confusion about the terminology comes from how things were used traditionally. In the divergence theorem, the divergence technically gives from a 2-form a 3-form. This would be the correct way to see the situation. But since 2-forms can be written as vector fields which are also identified with 1-forms, the divergence can also be seen as a map which gets from a vector field a scalar field. Now, we can combine the divergence with the gradient and build div grad f = div [ fx,fy,fz ] = fxx+ fyy+ fzz, which is the Laplacian. Now, if one pairs 0-forms with 3-forms and 1-forms with 2 forms in the right way (similarly as one can pair column vectors with row vectors, this can be done by introducing a dot product on p-forms), then the divergence can also be seen as the adjoint of d and so be written as d*. But this is more precise as d* maps a 1-form to a 0-form. In the discrete, one sees this very well. If the graph has say 4 vertices and 5 edges, then the gradient maps a function of 4 variables to a function of 5 variables. This is a 5x4 matrix. The d*, the transpose matrix, maps a function of the edges to a function on the vertices. This is a 4x5 matrix dT which one writes as d*. Now if you combine div grad = d* d, then one gets a 4x4 matrix, which can be seen as a map on vertices.
• Q: How to get started with 36.3-5?
• Q: How come, that anti-symmetry is so important?
Both in the discrete as in the continuum, the anti-symmetry is important for differential forms. In the continuum, we deal with multi-linear functions = tensors rather than vectors and with anti-symmetry, we get forms. Also in the discrete, we deal with functions, but functions on complete subgraphs of a graph. Anti-symmetry like dxdy = - dydx assures also dxdx=0, which manifests in mathematics in the nice property that d d F = 0. If we differentiate twice, we get zero. (This means curl(grad(f))=0 and div(curl(F))=0 for example). In physics anti-symmetry manifests in the Pauli exclusion principle. It is not possible for two identical Fermions to occupy the same quantum state. Mathematically, the wave function for two identical particles is anti-symmetric. The consequences are essential for us. Without that principle, electrons in an atom would all occupy the lowest energy state. It is the Pauli exclusion principle which forces the electrons to occupy also higher energy states. This allows the build-up of an interesting periodic table of elements enabling life to exist. (Life would be difficult to imagine in a hydrogen-only world). The anti-symmetry is also mathematically pleasant. We have not yet much talked about determinants, but determinants are examples of anti-symmetric multi-linear functions. Now, if we look at the symmetric analogue of determinants (called permanents), one has a much more complex computation task. Nobody knows how to compute Permanents quickly. Try it out in Mathematica, the time to compute it doubles each time we increase the number of rows and columns by 1:
```F[k_]:=Permanent[Table[Random[],{k},{k}]]; F[16]
A=Table[First[Timing[F[k]]],{k,16}]
ListPlot[Log[A]]
Fit[Log[A],{1,t},t]
```
For a 16x16 matrix, we already need more than 7 seconds to compute the permanent. For a 20x20 matrix, it is already more than 3 minutes. There is no way we could say compute the permanent of a random 100x100 matrix. But for determinants, this is easy. To compute the determinant of a 1000x1000 matrix, the computer needs only 0.3 seconds, where most of the time is spent for building up the matrix.
```k=1000; Timing[Det[Table[Random[],{k},{k}]]]
```
• Q: when does one write dx ∧ dy and when dxdy?
Both are used. The father of differential forms, Élie Cartan wrote dxdy and avoided the wedge symbol. It does not matter but leaving away the wedge reduces clutter in the notation. Once we know that dxdy=-dydx, we don't have to remind ourselves everytime again. But there is a reason for the notation dx ∧ dy: one can define an algebra of differential forms, where one has a multiplication and this multiplication is usually written as ∧. Now, dx is a 1-form, which is the linear map dx(u) = u1. and dy is a 1-form which is the linear map dy(u) = u2 and then dx ∧ dy(u,v) is a 2-form which is dx ∧ dy (u,v) = u1 u2 - u2 u1. (this is a bilinear anti-symmetric function, assigning to two vectors u,v in Rn, a number). But one can then also abbreviate this to dxdy. Just remember that dxdy is not the same than dydx.
• Q: In which order do we write the entries of a differential 2-form in R4.
There is no specific order. We usually write A dxdy + B dxdz + C dxdw + D dydz + E dydw + F dzdw for the 6 components. It is not necessary to rewrite this as a vector, but if you want, you can write it as [A,B,C,D,E,F]. An example of a 2-form in physics is the electromagnetic field F which has 6 components, three electric and three magnetic components.
• Q: How do we integrate in 4D?
In Problem 35.5 you have to integrate over the interior of a 4 dimensional ball which has a 3-sphere as a boundary. That integral can be done in cartesian coordinates using a computer but with the Hopf paramerization r(ρ,φ,θ12 = [ ρ cos(φ) cos(θ1), ρ cos(φ) sin(θ1), ρ sin(φ) cos(θ2), ρ sin(φ) sin(θ1)], it is possible. You have already computed the integration factor ρ3 sin(2 φ)/2 in a previous homework. Still, you might want to use Mathematica or wolfram alpha to compute the integral.

Week 12:

• Q: how much of the applications do we have to know
We have seen a few motivations this week, don't worry about monopoles, complex integration or fluid dynamics. But they are all great motivators.

Week 11:

• Q: will we see the general Stokes theorem?
We saw that the proof of Stokes theorem was quite easy when seeing it just as a transplanted Green theorem. The parametrization of a surface S by a two dimensional region G imports the curl. This is the content of the important formula, which is by the way proven on the AI page. It turns out that the ``important formula" generalizes to general differential forms. We will look on Thursday after thanksgiving at the formalism and define differential forms and see the general theorem. The idea is the same. We will see that the divergence theorem essentially paves the way. The divergence theorem in n dimensions deals with (m-1) forms F. The divergence dF is then an m-form. In two dimensions, this is essentially Green (a turned version), in three dimensions, it is Gauss divergence theorem and it generalizes to higher dimensions. If one uses a parametrization to push this theorem to a ``surface" in Rn we get the general Stokes theorem. But we will define differential forms and see that the general theorem is not much more difficult than Green. It is just that some language has to be built. All this has been done by Henry Cartan The book of Henry Cartan (who invented the modern language of differential forms) is still the gold standard.
• Q: who invented the discrete case?
It is mostly due to Kirchhoff in the 1850ies. It is standard in topology also since more than 100 years to do computations in topological spaces with some sort of discretization like simplicial complexes. The language of graph theory is especially suited as graphs are very intuitive. And even differential geometry becomes simpler. Many notions in multivariable calculus can be translated to notions in the discrete. It is possible to treat things in very short expositions in the language of graph theory. Mathematically a bit more elegant are simplicial complexes as these are just finite sets of non-empty sets which are closed under the operation of taking finite subsets. It is an amazing world. One can also see it as Quantum calculus.

Week 10:

• Q: is this course curved?
A: yes, we like innovation and are obsessed with originality, in this course we will the first time in history use a random curve. Joke beside, it is probably the most frequently asked question in schools especially around midterm times, but also one of the most silly one. Why? First of all, one must have a clear definition what "curved" means. Any academic course needs to evaluate performance and it is usually done by assigning in a predefined way a numerical value 0-100. This has then to be translated into a letter grade. It is here that the "curve" comes in. The curve is a graph of a function from a numerical value [0,100] to a letter value F-A. Now, this function is usually monotone. [Thinking out of the box, we could actually try for once a non-monotone one like 95-100 for A, for 90-95 C, for 75-85 B and the rest are all D's. It would certainly be innovative] Every course has to use therefore a curve like that. But this is not what "is this course curved?" means. What the question usually means is whether there is a predefined curve which is applied rigidly, like 5 percent A, 30 percent B, 40 percent C and 15 percent D's and 10 percent F, or then that there is a predefined quantization scheme. Now, this is rarely done on college levels, it is rarely done even in high schools even as it can be rather unfair, especially in small classes. It would mean for example in a class of 6 that one will get an A, and one will fail, no matter what. Or that, if the curve has been determined without failure, that every student will pass, no matter what. Such kind of rigid curving is not good also because it encourages in a GPA obsessed environment only to take easy courses which is a shame as one learns most if one is challenged in a healthy way. I myself have never taken nor seen a course "curved" in that sense and if, then the grade structure is built rather softly and in a flexible way with various usually opaque "extra credit" options which come with their own problems: a strong student can be beaten by somebody pulling some clever "extra credit stunt" at the end. So, what translation curve from [0,100] to [F-A] is used here? The answer is that this will be decided at the very end, when the distribution and the difficulties of the exams are known. Most classes determine the curve after the distribution is known and not before. So, it is possible in principle that everybody in the course gets an A (very unlikely), it is also possible in principle that everybody fails the course (also unlikely but maybe a bit more likely than all A's). What is the upshot? Just work hard!

Week 9:

• Q: How does one compute the surface area of the intersection of three cylinders elegantly? The integral is a bit complicated.
A: Yes, the integrals in the case of three cylinders are indeed a bit trickier. Lets first look at the volumes. In both the intersection of two cylinders and three cylinders we have a graph f(x,y) = (1-x2)1/2 (which comes from the cylinder x2+z2=1) In the intersection over two cylinders, we integrate a triangular region R bound by x=y,x=-y and y=1. Eight times the integral ∫R f(x,y) dx dy =2/3 gave the volume 16/3. For the intersection of three cylinders, the region R is intersected with x2+y2 < 1 leading to a sector G. When integrating, we get the result was 16-8*21/2 as computed in section 25.9 in the text (or below with Mathematica). Now for the surface areas: we do not integrate f(x,y) now, but |rx x ry| with r(x,y) = [x,y,f(x,y)]T. This simplifies to (1-x^2)-1/2. In the case of the intersection of two cylinders, we integrate this over the triangle R and get 2. The total area is 16. It is an observation of Archimedes is the surface area of the square cylinder wrapped around this solid. It is the same principle which shows that the surface area of the sphere is 4pi, the surface area of the cylinder wrapped around it. Now, when integrating |rx x ry| over G and multiply with 8, one gets 32-16 21/2 which is twice the volume. The integrals can of course also be found computer assisted:
```Area3Cylinders  =8 Integrate[r/Sqrt[1-r^2 Cos[t]^2],{r,0,1},{t,-Pi/4,Pi/4}]
Volume3Cylinders=8 Integrate[r Sqrt[1-r^2 Cos[t]^2],{r,0,1},{t,-Pi/4,Pi/4}]
```
Q: Why does the volume of unit spheres peak at about 6?
A. The recursion essentially multiplies in every dimension increase by 2π/n, where n is the dimension. If n is smaller than 2π we expect an increase in dimension, if n is larger than 2*pi; then we expect the dimension to decrease.
• Q: in HW 25.3, to compute the volume of the remaining part, is it possible to compute things without integration?
A: Yes, look up ``exclusion-inclusion formula". Start with the cube, subtract the cylinders, add the intersection of two cylinders, add the intersection of three cylinders. (By te way, the intersection of three cylinders is a computation which is given in the text. Here is a picture of the solid. .
• Q: in HW 25.5. Can we compute the integral without using that the one dimensional integral of ex, which we can call the unknown I
A: Yes, first see that the triple integral is I3. Then see that the right hand side (integrate by parts in the variable ρ) that the right hand side is (4 π I/4. Fro that we can compute I.

Week 8:

• Q: When we see a volume computation, is this not a triple integral?
A. Yes, it is and that would also be the proper way to look at it. However we have seen that the double integral ∫ ∫R f(x,y) dx dy can be interpreted as a volume under the graph of f(x,y) and above the x,y plane. (if f should be negative, then this counts negatively). We will look next week at triple integrals.
• Q: How does the change of order of integration work?
A. It is important to make a picture. So, if you see an integral, where your integration techniques do not work, also try to change the order of integration. There are two important steps:
1. First translate the integral into a picture. It is important to take that seriously and make it large enough. All the lines and curves which matter should be drawn in. It is also helpful to label each line. That will help in the next step.
2. Write down the integral again using the picture, but with the order changed. This should give an integral, where the inner part can be solved and when solved simplifies so that also the outer part can be solved

Week 7:

• Q: In Lagrange case with one constraint g=c, one also has to look at the critical points of g. What happens in the case of two constraints?
A: If the gradients of g and h are both non-zero at the critical point, the Fermat principle tells that the gradient of f has to be in the plane spanned by the gradients of g and h. If the gradient of one function like g is zero and the gradient of h not, then the gradients of f and h have to be parallel. In that case we look at nabla f = lambda nabla h, nabla g=0, g=c,h=d. We have also to look at cases, where both g and h are critical . In that case we have to look at nabla g=0, nabla h=0, g=c,h=d. Note however that in Morse situations, nabla g=0 alone produces only finitely many solutions.
• Q: how much do we have to know about Morse functions ?
Definitely the definition and the Morse lemma. If you want to see a bit more detailed proof, here [PDF] is the page from THE book (by John Milnor) about Morse theory. Essentially every other text covering Morse theory follows this book, sometimes pretty closely. The lemma tells that Morse functions are nice also near critical points. In general, they can be nasty, so catastrophically nasty that there is an entire branch of mathematics dedicated to them (catastrophe theory). The lemma is needed already for understanding what happens for a function of two variables with a critical point for which D is non-zero. We have proven that either, the point is then a maximum, or a minimum or then a point which is neither a maximum nor minimum. It is not a priory clear from that whether the critical point can not be some kind of Monkey saddle. By the way: we love the Monkey saddle: you find here an exhibit from 2007, here a monkey saddle problem from 2014, we printed out the Monkey Saddle, 2016 (clip youtube) or then placed it in google earth (clip youtube).] What happens at non-Morse points is that the function can be crazy already in one dimensions. Critical points can accumulate like with the function f(x) = sin(1/x) x^4 or then we can have infinitely many critical points. The function f(x)=exp(-1/x) for x > 0, and f(x)=0 for x < 0 we have seen before is a function where all points on the negative axes are critical points. In higher dimensions f(x,y)=x^2 is an example, where the entire y-axes consists of critical points, In the case f(x,y) = x sin(1/x) + y sin(1/y), seen here we have infinitely many critical points near the origin. In Morse situations, all these pathologies go away and we can prove nice things.

Week 6:

• Q: What is the meaning of the gradient
A: We carefully distinguished between df and ∇ f = dfT. The first is a row vector the second is a column vector. Technically, the first is a differential form, the second is a vector field. Distinguishing these things can be important in physics but also in any more advanced mathematics like differential geometry or topology. Most calculus courses don't distinguish and it does not matter much here. In physics, one distinguishes row vectors from column vectors using notation. Einstein would write vi for a column vector and vi for a row vector. Dirac would write |ψ> for a column vector and <ψ| for a row vector. The dot product combines them: Einstein would write vi wi and mean the matrix product or the row vector with the column vector, Dirac would write <φ|ψ>, combining the bra - with a -ket to get a braket. Why do physicists distinguish these things? It actually matters a lot if you make coordinate changes. And one of the most important rules in physics is that we should have the same physics when changing the coordinate systems.
A: best advise: don't read much about it, just use it. If you can solve Homework question 16.1 you are fine. Make sure you really understand that problem. It is difficult to write about the chain rule. And all the super stars of textbook writers like Stewart fail miserably. Why? Because it is difficult to capture the schizophrenia you get into when dealing with the chain rule. This is already the case in one dimensions: to differentiate sin(cos(x)) you have mentally to think about the cos(x) as a variable u, then differentiate sin(u) and then remember that u was a function so that the derivative is cos(cos(x)) (-sin(x)). It is best to see chain rule in higher dimensions as the same rule as in one dimensions, just noting that instead of f', we write df, which is the Jacobean of the function, which is a matrix. The usual multiplication becomes the matrix multiplication. This general form however is better remembered if one just focuses on one variable in the domain and call it t, then look at one coordinate value in the range. This produces the chain rule d/dt f(r(t)) = ∇ f(r(t)) . r'(t) . This is the version of the chain rule which really matters. If you like to associate a picture with it, think of f as potential energy and ∇ f as a force and r'(t) as a velocity. The rule tells that the rate of change of work is power, which is force times velocity.
• Q: How much Taylor does one need to know?
A: Definitely the single variable Taylor theorem. It is a milestone and essential to understand linear and quadratic approximation. It is also a prototype result showing how one can deal with complicated functions by approximating them with polynomials. In the proof part, Homework 18.1, you see that one needs more than smoothness for a Taylor expansion. There is a whole spectrum of functions, we have the continuous functions, C0 the differentiable functions C1, the smooth functions C. And then, there are the real analytic function Cω, which are the function for which one can make a Taylor approximation. One really only can appreciate this class Cω, when looking at Calculus over the complex plane. An important example, where one needs complex values is the Riemann zeta function. You ponder the enigma how one can assign a value ζ(-1)=-1/12 in a homework. The answer is that one has to look at the function in the complex plane and that one can extend the scope of a function by ``analytic continuation", which uses the Taylor series. It is one of the most important ideas of Riemann that a function has a life much beyond the domain in which the Taylor series exists. The function f(z) = 1/(1-z) for example, has a Taylor series 1+z+z2 + ... which converges for |z|< 1 but the function f(z)=1/(1-z) also makes sense for z=2. Its value there is -1. The series 1+2+4+8+16+... however does not converge. Still, one can say with confidence that -1 is the right value attached to this divergent series.
In any case, the Taylor series is very useful: lets look for example at the function f(x) = sinc(x) =sin(x)/x, the sinc function. It is one of the most important functions in applications. But we can not express its anti derivative Si(x) = ∫0x sinc(t) dt, the Sine integral. If we look at the functions as Taylor series, we can work effectively however: like if sin(x)/x = 1-x2/3! + x4/5! - ..., then the derivative is Si(x) = x-x3/(3*3!) + x5/(5*5!) - ....
Back to the question about Taylor: after longer forth and back, I decided to treat the higher dimensional case using the directional derivative. This renders the multi-dimensional version one dimensional. The usual Taylor series f(x+t) = exp( D t) f(x) now f(x+ t v) = exp( Dv t) f(x) where Dv is the directional derivative in the direction v. What we do here is to look at a line r(t) = x + t v through the point and restrict the function to that line. This gives f(r(t)) = f(x+tv) which is now a function of one variable for which we can use the single variable chain rule. The operator Dv maps a function from Rm to Rn to a new function of the same form. Note that the usual Jacobean derivative produces functions which become larger. For example, if you have a function f from R3 to R3, then the Jacobean df is a vector [fx,fy,fz] which has as a transpose the gradient ∇ f(x,y,z) which is now a vector field which is a map from R3 to R3. The derivative of this map is the Jacobean d df which is the Hessian matrix of f. Now, the third derivative is a tensor which encodes all the third 27 third derivatives like fxyz or fyyz. It is no more a matrix. If you wanted to write it down physicially, it would have to be written down as a 3 dimensional 3x3x3 array of functions. It is an example of what one calls a Tensor. In the notes, I wrote down this Taylor series again but it is more difficult to parse what is going on. That's why we restrict it then to the linear and quadratic approximation. So, to answer the original question: you have to know well the linear and quadratic approximation of a function. We call these functions L(x) and Q(x). These functions are pivotal. The quadratic approximation is especially useful when we study extrema. The result which we want to understand there is the ``second derivative test".

Week 5:

• Q: What other examples of proof by contradiction are there?
. A: Here are two examples:
• Example 1: Prove that if for any positive integer n, the statement n3 is odd implies that n is odd. To prove this, we assume that n is even that is n=2k. Then n3 = (2k)3 = 8 k3 is even. This contradicts the assumption that n3 is odd.
• Example 2: Prove that if x2 is irrational, then x is irrational. To prove this, assume that x is rational. Then x=p/q. But then x2 = p2/q2 is rational too, contradicting that x2 is irrational.
• Q: What other examples of proof by deformation are there?
. A: We will look at more in week 7.

Week 4:

• Q: What other proof techniques are there?
We will learn a couple of more. For now, we have the proof technique of doing a computation, the proof technique of applying an already established theorem or inequality and the proof technique of induction. The method of induction covers more than you might think. Here is an other a bit more heavy example: Theorem of Fermat: for every integer a and every prime p, the number ap - a is divisible by p. Proof: we make induction with respect to a. (i) induction foundation: the statement is true for a=1. (ii) induction step: assume the statement is true for a, prove that it is true for a+1. We have to show that (a+1)p -(a+1) is divisible by p. We split this up and write it as ap - a + [ B(p,1) ap-1 + B(p,2) ap-2 + ... + B(p,p-1) a], where B(p,k) is the Binomial coefficient B(p,k)=p!/(k! (p-k)!). This step required the Binomial formula (a+1)^p = ap + B(p,1) ap-1 + ... +1. The induction assumption assures that ap-a is divisible by p. The claim from the theorem follows from the fact that for a prime, B(p,k) = p!/(k! (p-k)!) is divisible by p, if k is positive and smaller than p. But this follows from the fact that none of the numbers k! and (p-k)! contains any prime factor p and that B(p,k) is an integer.
• Q: what is the Mandelbrot set used for?
A: first of all, it is just gorgeous. See the Zoom. But the picture is also an icon of both chaos theory and fractal geometry. One just looks at the world differently after having seen such pictures. There is also engineering benefit. Ideals from fractal analysis has allowed to make cell phone antennas smaller.
• Q: how does one get started with Homework 10.5?
A: lets assume |c| = 2 + a, where a is some positive number. You have to show that if you start with 0, then this orbit goes to infinity. Lets look
T(0) = c is in distance 2+a from the origin. We have T(T(0)) = T(c) = c2+c. How big is that? Use the triangle inequality to see |c2 +c| ≥ |c|2 - |c| = (2+a)2 - (2+a) = 2+3a+a2. We have gone further away.
• Q: how does one prove the Taylor formula?
A: you can easily do it for polynomials. Check that if f(x) = a + b x + c x2 + d x3 + ... then f(0)=a, f'(0) = b, f''(0)/2 = c, f'''(0)/6 = d etc.
• Q: something about last week: what is a uniformly continuous function which is not Lipschitz?
A: take f(x)=sqrt(x)=x1/2. Then this is continuous on [a,b]=[0,1] but not Lipschitz continuous. The slope of the function becomes arbitrarily large near 0. But it is uniformly continuous. If |x-y| ≤ 1/n then |f(x)-f(y)| ≤ Mn = 1/sqrt(1/n). Lipschitz would be that Mn = C/n. Just as a reiteration of one of the main points of that lecture is that on a finite closed interval, continuity and uniform continuity are the same. This is very nice because uniform continuity allows to avoid the epsilon-delta definition of continuity and still have all the power for proofs at hand.

Week 3:

• Q: if we say differentiable, does this mean continuously differentiable?
A: in the homework about the mean value theorem, we indeed mentioned differentiable but ment continuously differentiable. There is a difference. The function x2 sin(1/x), f(0)=0 is differentiable everywhere but not continuously differentiable at x=0. The derivative is not continuous but [f(h)-f(0)]/h = f(h)/h=h sin(1/h^2) converges to 0. But in the text, if we say r in C1, we mean that the derivative is continuous.
• Q: what is the limsup?
A: if we have a sequence of numbers xn in a bounded interval, then this sequence does not necessarily converge, but it will have an accumulation point. The limsup is the largest accumulation point. It can be defined formally as limsup xn = inf m ≥ 0 supn ≥ m xn. Examples: the sequence xn=(-1)n does not converge but it has the limsupn xn = 1. A sequence converges if and only if the limsup and liminf are the same.
• Q: What is the insight behind curvature?
A: There are various angles to it. A straight line has zero curvature. Important is to see that a circle of radius r has curvature 1/r. If you approximate a curve with a snug circle, the curvature of that circle and the curvature of the curve agree. The formula |r'(t) x r''(t)|/|r'(t)|3 also gives some insight: if the acceleration is always parallel to the velocity, then we don't change direction and the curvature is zero. Indeed, the cross product is zero then. In general, what is important is that the curvature is independent of the parametrization. One can "see the curvature" while one "feels the acceleration". If we drive along a curve with unit speed |r'(t)|=1, then the r'' is perpendicular to r' [ differentiate the identity r'(t).r'(t) = 1 to get with the product rule r''(t).r'(t) + r'(t).r''(t) = 2 r'(t).r''(t)=0 ] and so |r'(t) x r''(t)|/|r'(t)|3 = |r'(t)| |r''(t)| sin(Pi/2)/|r'(t)|3 = |r''(t)| because |r'(t)| = 1.

Week 2:

• Q: Why can we restrict at diagonal matrices B when looking for quadratic manifolds f(x,y,z) = x.B.x + A.x = d?
A: One of the major results proven in linear algebra is the spectral theorem. It assures that there are coordinates in which the symmetric matrix B is diagonal. We have seen in class that we can without loss of generality assume that B is symmetric. The fact that one can diagonalize a symmetric matrix will be highlight in the linear algebra part of the course in the spring. To the question why we can assume B to be symmetric: let us look at the two-dimensional case where B={{4,3},{7,9}} is a 2x2 matrix. The corresponding polynomial X.(B.X) is 4x2+5xy+7yx + 9y2. But this equivalent to 4x2 + 6xy + 6yx+9y2 which belongs to the matrix B={{4,6},{6,9}}. We were able to balance things out so that B is symmetric. This is important as we can diagonalize a symmetric matrix but not a general matrix. It has also practical reasons. In order to represent a quadratic manifold in the computer, we only need 6+3+1=10 numbers and not 9+3+1=13.
• Q: Why does the cross product [a,b]T x [c,d]T = ad-bc in two dimensions give the area?
A: One possibility is to write the vectors in three dimensions and look at [a,b,0] x [c,d,0] = [0,0,ad-bc] and then look at the third component ad-bc. Since we have proven that in three dimensions, the length of the cross product is the area, we see this now also in two dimensions. An other possibility would be to repeat the proof of the formula |v x w| = |v| |w| sin(α) also in two dimensions. The cross product in two dimensions is obtained by taking the determinant of the matrix in which the two vectors are column vectors. A bit more sophisticated is the point of view that we look at the tensor product [a,b] ⊗ [c,d] =[[a c,a d],[b c,b d]] which is a matrix product (remember that a (2 x 1) matrix times (1 x 2) matrix gives 2 x 2 matrix). Now make this anti symmetric [a,b] ∧ [c,d] = [[0,ad-bc],[bc-ad,0]] (see the next question about the reason. It is a necessity for matter to allow more complex systems like atoms or stars). A anti-symmetric 2x2 matrix has only one component which counts, which is ad-bc. We will later see that this is a determinant.
• Q: Is there more to the cross product in higher dimensions?
A: Yes, there is much more. It is the start of multi-linear algebra. The basic underlying construct is what one calls the tensor product v ⊗ w of vectors. Unlike for the Cartesian product of two vectors, where one takes [v1,v2,v3] ⊕ [w1,w2,w3] which is the 6-dimensional vector v ⊕ w = [v1,v2,v3,w1,w2,w3], the tensor product of two vectors produces an object in 9 dimensions. (For the Cartesian product, the dimensions add up, for the tensor product the dimensions multiply). In some sense the tensor product is the dual to the dot product. We wrote v^T w for the dot product which is the matrix product of a (1 x n) and a (n x 1) matrix leading to a scalar. The tensor product is v w^T which is the matrix product of a (n x 1) and a (1 x n) matrix which is a n x n matrix. We essential get v ⊗ w = [v1 w1,v2 w1,v3,w1, v1 w2,v2 w2,v3,w2,v1 w3,v2 w3,v3,w3]. It turns out that nature likes to anti-commute. In physics this is the Pauli exclusion principle (essential for the stability of matter without which matter would collapse). It is also essential for calculus as we will see in the end of the course). So, what is the cross product generalization? One has to define v ∧ w = v ⊗ w - w ⊗ v. Now, the 9 dimensional tensor (3 x 3 matrix) becomes a anti-symmetric 3 x 3 matrix, which only has three components! (You verify in the homework that the cross product algebra is isomorphic to the algebra of skew symmetric 3x3 matrices with the anti-commutation as product). The magic is that in three dimensions, the anti-symmetrized tensor product between two vectors gives again something of the same dimension. If one associates a vector with a Fermion particle, then the wedge product is a 2 Fermion particle system. An example is a meson. A combination of two Fermions is a Boson. The meson is a Boson. Back go geometry: if one associates a vector with a one dimensional space in geometry, then the wedge product represents a two-dimensional plane. In the case of a three dimensional system, where we have 3 fundamental coordinate planes, one can identify planes with their normal vector.

Week 1:

• Q: In Unit 1, we defined the angle as arccos(v.w/(|v| |w|)). Why is this the geometric angle?
A: We have defined the angle also where we can not ``see them", like in dimension 11, and we will later in the course even apply it in infinite dimensions, where we take an integral as a dot product. This is not just for the sake of generality: In statistics, vectors are random variables and the cosine of the angle is the correlation between the data. One can now define the angle also between continuous random variables, which is an infinite dimensional case. But to see the relation with geometry: take v=[1,0] and w=[a,b]. Now cos(α) = v.w/(|v| |w|) = a/(a2 + b2)1/2. But that is the definition of the cosine (SOCATOA). But again, in order to motivate that we have used Pythagoras as we interpreted |w| with (a2 + b2)1/2. Again, as we wanted to prove Pythagoas theorem, it would have been misleading to refer to this geometric connection.
• Q: In Unit 1, we have looked at row vectors and column vectors. What is the difference?
A: Usually, in vector calculus, one does not distinguish, but in linear algebra or physics one does. When the objects transform one has a ``covariant" (row vector) and ``contravariant" (column vector) behavior. (CO-BELOW-ROW memnonic). Albert Einstein would write vi for row vectors and vi for column vectors and vi wi (Einstein convention) for the dot product. Paul Dirac would use the notation < φ| for Bra's and |ψ > for Kets. The dot product is then < φ|ψ >, the Bracket. In differential geometry (which is also the frame work for general theory of relativity), one looks at column vectors as elements in the tangent space, while row vectors are elements in the cotangent space. In linear algebra (which is the language for quantum mechanics), we distinguish the two objects as the space M(n,1) for column vectors and M(1,n) for row vectors. A superb synthetic understanding of vector calculus and linear algebra would allow us to understand quantum gravity but such an understanding is still missing.
• Q: When doing row reduction. Can we do two steps at once?
A: In principle, but we have to avoid circular steps (remember CATCH 22!), like subtracting the row from itself or subtracting row 1 from row 2 and subtracting row 2 from row 1. Both are CATCH 22. A good advise is to go slow and do one step at a time.
• Q: Why don't we just solve Ax=b by inverting the matrix A and get x=A-1 b?
A: We will do that later but this works only if A is a square matrix. It turns out that in order to find the inverse, we also have to row reduce. It goes as follows: [A|1] row reduces to [1|A-1], where 1 is the identity matrix which has 1 in the diagonal and 0 everywhere else. We will come to that at a later time.
• Q: How can one get insight into the proof of Cauchy-Schwarz?
A: There is insight but it would have confused as it uses Pythagoras and the goal was to give a proof without Pythagoras (using Pythagoras would have been a circular CATCH 22!). Geometrically, what we have done is looked at the scalar projection a = v.w . Now, u = v-a w is a vector perpendicular to w. What we have computed is the length of this vector and it is positive. The difference between v.v and a.a is now by Pythagoras u.u. In some sense, the Cauchy-Schwarz inequality is Pythagoras. It would be circular however to use Pythagoras as we work in an arbitrary space with a dot product and want to derive Pythagoras.
• Q: Is there more insight in the last part of the proof of uniqueness of href?
A: It is the hardest part of the proof no doubt. It took myself the longest to find it (actually almost a day of thinking, on and off, while refusing to look the proof up). The reason why it is a difficult part is because it needs an idea and finding an idea needs tossing around lots of ideas until the right one is found. This can take for ever. In this particular proof, there are two key insights: one is the idea to use induction with respect to the number of columns. The other is the last part of the proof, where one looks at the matrix A as an augmented matrix of a system of equations. The induction step now essentially boils down to the question that we have a solution set of a system of equations Ax = b, then the vector b is determined uniquely. Let us assume we have this system already row reduced. Take for example x+3y = 4, z + 5 w = 6. Here we have m=4 variables x,y,z,w and n=2 equations. The variables y and w are free variables and paramatrize the solution set. The induction step claims that if we have two systems of equations Ax=b, A x = c, for which [A b], [A c] is row reduced, then if the two systems have the same solution set (row reduction preserves the solution set), then b = c. But that is obtained by subtracting the two systems from each other. If you look at Thomas Yuster's proof, this is formulated much shorter even then in our notes. Most of the time, a short proof is more clear than a long one, but there is a bit of thinking involved in the last part, which makes the proof not easy to digest.
• Q: How do we get ideas?
A: there are techniques. And we will learn some. Getting new ideas is hard. If somebody tells you a problem was easy then either the person was lucky, or the person knew the answer already, or the person has looked it up somewhere, or the person got the answer from somebody else. The reason why it is hard is because getting an idea requires search and trying out lots of things and this requires time (which is sometimes difficult to acquire in a time of so many distractions). But knowing something can also mean to know the technique. And this can be learned.