Thanks for your lectures, they're really useful for me as an ocean physics grad student! I wonder if you could cite other materials as well that you used to make each of your lecture, like this one. It will help us when we go deep after your videos.
@@yongsuna5683 Hi, thanks so much for the kind words! :) Unfortunately, the literature on good intuition and derivations for adjoint methods is not too vast. I will publically upload my Master Thesis soon that contains many in-detailed derivations for sensitivity methods of various problems. I will announce it here on the channel and social media (Twitter & LinkedIn). Some other cool materials are Chris Rauckackas' course: mitmath.github.io/18337/ Also check out that amazing Neurips tutorial by Matthew Johnson, David Duvenaud and Zico Kolter: implicit-layers-tutorial.org/ (I also learned so much from Matt Johnson on his issue answers over on the JAX github repo, he is a great educator) I think the JaxOpt paper can be of interest as well: arxiv.org/abs/2105.15183 Also check out this nice talk at JuliaCon 2022: ruclips.net/video/TkVDcujVNJ4/видео.html
Amazing! I was confused while reading the Neural ODEs paper about how exactly they are doing the adjoint method and your two videos on this helped a lot in understanding it better. Totally appreciated! I wonder if it is possible to derive an explicit system of ODEs that can model a given simple ResNet or feedforward network.
Thank you so much for providing such a wonderful series which is really insightful! Would you be so kind as to elaborate a bit on two aspects: 1. the original computation issue of dx/dθ (starting 14:20). It is understandable that the naive calculation of dx/dθ is much more complicated than solving x=A^{-1}b as dx/dθ involves matrix-matrix multiplication while the later one is just matrix-vector multiplication. But what is a bit puzzling is how to view it as a "batch linear system of equations", in my naive thinking, since the inverse of A^{-1}(θ) only needs to be calculated once, it is typically O(N^3) and does not deal with P at all. Hence the rest is a matrix multiplication of N*N with N*P (the bracket), and from there, indeed the θ's dimensionality starts to have an effect on this multiplication's complexity, but this seems does not answer where the O(PN^3) (15:55) comes from. 2. it seems the key insight is to simply switch the order of calculating by calculating the vector-Jacobian product first so that this helps mitigate the computation complexity of the original Jacobian-vector product formulation. If this is the case, in this specific system, what is the purpose of defining this additional "Adjoint term", as seems all it introduces is just to remind us to calculate vector-Jacobian product instead of Jacobian-vector product, which can originally be done with a careful order of matrix multiplication.
Hi, Thanks for the kind comment and the interesting question! 😊 (1) Regarding the complexity: You are right. The O(P * N^3) for a dense system is not the most efficient. You would only do the LU decomposition once, which is O(N^3), and then each solve (consisting of backward and forward substitution) is O(N^2). Hence, loosely speaking, the cost is O(N^3 + P * N^2), which still becomes problematic for larger P. (2) Indeed, the critical insight is using the associativity of matrix-matrix multiplication. The reason I like to explicitly introduce an "adjoint term" is to remind that the inverse matrix (for most practical cases) should never be materialized. This aux term shows that we need one additional linear solve.
Cool, is there an example or code for discrete accompanied CFD shape optimization using automatic differentiation, like SU2, but the instructional version?
Thanks for the comment 😊 From the back of my head I do not have anything right away, but I remember that SU2 had quite extensive documentation. So there could be educational examples as well. You could also check resources on FEniCs and Openfoam. Best of luck 🤞
The video is excellent. Can you suggest how to use the adjoint methods for FEA static problems when the stiffness matrix depends on a large number of parameters? Any available Matlab codes for adjoint methods of linear system
Thanks for the nice feedback :) The next video will be on a code example for the adjoint of a linear system of equation, but with a right-hand-side that depends on parameters. Therefore, I do not have a code yet for the adjoint when the stiffness matrix is parameter-dependent. I also expect it to be a little more tricky to make it work efficiently, since you have to work with 3D-objects, but I also could be wrong. What is your specific application, if I may ask?
For the equation dx/du = (A^-1)*(db/dTheta - dA/dTheta*x), I had a question. Ideally, dx/du should be a function of u alone (with no remaining x-terms). However, the additional x coefficient of dA/dTheta does not cancel out, leaving dx/du in terms of both x and u. How do we deal with this? Are we supposed to use observed values (xObs) in place of a purely symbolic x here? If the x remains in the solution, it is not very useful. Thanks.
Hi, thanks for the great question :). First, I assume you mean (by the notation used in the video) dx/dTheta? If that is the case: the x on the dA/dTheta comes from the primal pass of the calculations. In other words, this is the x from the original problem: Ax=b. Whenever we do automatic Differentiation (or adjoint methods) we need (next to the adjoint pass) also a primal pass. In other words: you have a concrete Theta for which you want the gradient dJ/dTheta, you would first assemble system matrix and RHS, then solve the linear system and (importantly) save the result of it: x. Then, you can compute your loss J. This is the primal pass. Afterwards, you can obtain the loss derivative delJ/delX and use it as a RHS in the adjoint linear system. Finally, you would evaluate equation (3) from the end of the video. Here you would use the stored x value. There will be a new video coming out on Friday on the pullback/vJp primitive Operation of linear system solving :). Personally, I prefer this newer derivation as it allows for tighter integration into AD engines. It also showcases that there are potential computational savings (e.g. using the adjoint form of a factorization in the backward pass or re-using the adjoint form of a preconditioner). Hope that helped 😊 Let me know if it was unclear.
@@MachineLearningSimulation Thanks so much. Yes, I did mean dx/dTheta. I kind of understand your response. For something like a neural network, the (first) primal pass would be performing forward sweep using the initialized system parameters (theta)? This would yield updated states? Then, during back propagation, we would use those already-defined states (x) to update theta from dError/dTheta=(dError/dx)(dx/dTheta)? I guess my confusion was for the case where dError/dTheta is applied multiple times between primal passes. We know that x is a function of Theta. So, updates to Theta should require an update of x. But, I guess we could simply calculate that from x = inv(A(Theta))*b(Theta). Does that sound remotely correct? Second question. I am having a bit of trouble understanding what the core advantage of the adjoint method is? I know it should be more efficient. And I understand how the grouping and multiplication of dError/dx and inv(A) first is advantageous, before multiplying that product by (db/dTheta-dA/dTheta*x). But beyond that I tend to get lost. I feel like I cannot see the forest through the trees. Sorry for so many questions. I will also check out your other video.
Hi. I am having a question regarding the use of Leibniz rule. When you formulate the problem as a constrained optimization with ODE constraint, you take the total derivative of the Lagrangian with respect to the parameter \theta. By the Leibniz rule, you brought the derivative d/d\theta inside the integral, but when you brought it inside, it should become a partial derivative wrt \theta, is it not?
Hi, thanks for the comment :) The dJ/dtheta is the gradient we are after, ultimately. Imagine, we wanted to fit (constitutive) Parameters for a physics model. An example would be the E modulus or other quantities. Then we could use this information in a gradient-based optimization scheme, to iteratively find better values of theta. Hope that helps:)
Not yet; or at least not directly ;): I can recommend the autodiff perspective on differentiating over linsolves (ruclips.net/video/y8_4t9E-lxw/видео.html ). Once, you have the propagation rule down, you can get higher-order derivatives by e.g. performing forward-mode over reverse-mode. jax.readthedocs.io/en/latest/_autosummary/jax.hessian.html and jax.readthedocs.io/en/latest/_autosummary/jax.hessian.html .
Interesting question. I assume that by mesh nodes you refer to the spatial position of the discretization points. I am unsure what a potential use case would be, but the adjoint method, especially as formulated for the linear systems of equations (as in this video), is rather generic. Once you are on the level of a linear system of equations, all other complexity (like mesh, underlying PDE, constitutive law etc.) is already absorbed. It is most often used in the context of design optimization (i.e., structural or topology optimization) for which you have the geometry parameterized, and the meshing is most often done automatic. And you also find it in parameter fits for constitutive laws. The involved parameters can be classical ones like Young's modulus etc. or the weights to certain Machine Learning models that more closely model the constitutive behavior (like Neural Networks).
@@MachineLearningSimulation for plugging in at the end, if you have a derivative of a matrix A with respect to theta( say mesh matrices), how would that work? is there a easier way to get that? I have a function that creates matrix given the mesh nodes ( as an array ) but it seems I need to calculate the derivative of that matrix with respect to the mesh matrix to get the dJ_over_dtheta. Thank you for your videos!
@@kapilkhanal9678 Indeed, that's a tricky one as it would give a rank-3 tensor, and it is most often sparse. It's a problem that also bugged me for a long time. One can derive such a quantity by hand, but it is rather tedious. What actually helps is to view the problem a bit different and have a differentiable version of the matrix assembly, then build the system matrix cotangent and pullback the information to the parameter vector. That sounds abstract, but take a look at my new playlist on AD primitive rules: ruclips.net/video/PwSaD50jTv8/видео.html I will soon have a video on linear systems (and maybe I will try to accompany it with a coding example).
@@MachineLearningSimulation thank you for the video. I was wondering if there is a way to contact you for some of the issues I have understanding diff. programming and implementing it on Jax. Again, thank you for your help. I have been stuck on this problem for a while now.
You can check out the notes by Prof. Steven Johnson : math.mit.edu/~stevenj/18.336/adjoint.pdf Or the book "engineering design optimization" by Ning & Martins in chapter 6.7
I still don't understand the relationship between θ and A?Could you please provide a simple example?Can I take the example?A(θ) = | cos(θ) 0 sin(θ) | ,| 0 1 0 | , | -sin(θ) 0 cos(θ) |)
The example you brought up is a valid one. It also is a special case, because you only have one parameter. In other words, your parameter vector is only one dimensional. That greatly simplifies things, because your dA/dtheta is now not a rank-3 tensor (one tensor rank above a matrix), but just a matrix (or the additional axis is one-dimensional). More generally speaking, the way A depends on theta is highly problem dependent. For instance, if you discretized a 1D heat equation with Dirichlet BCs with the BTCS scheme, and the parameter to optimize was the diffusivity, then your matrix structure was tridiagonal and the theta appeared in all of the three diagonals.
I have a question regarding your statement that lamba needs to be calculated once. Lambda depends on A, which depends on theta. Usually, theta is found iteratively for optimization problems, which means that theta changes in each iteration, which also causes A to change, and hence lamba. Of course, if A doesn't depend on lamba, then we don't have this problem. What do you think?
Hi, thanks for the comment 😊 Can you give a timestamp to the point in the video you are referring to? It's been some time since I uploaded the video, so I can't exactly recall what I said when in which context. Typically, during each optimization iteration, you would solve one primal/forward problem to obtain a loss estimate and one adjoint problem to obtain a gradient estimate. For the sake of the adjoint problem, you take the same system matrix (just transposed) from the primal pass. Hence, the adjoint variable lambda is independent from whether the system matrix depends on theta or not
@@MachineLearningSimulationThanks for the reply. To make myself more clear, what I'm talking about is optimization until the solution is found. For example, consider gradient descent where we make a guess on what theta is and then iteratively update it. If I understand correctly, the method you described in this video is just for one step/iteration of such a process. Am I correct? If that is the case, then I am baffles that you said that we need to calculate lambda once, which in my mind meant you get to use the same lambda for all iterations until convergence is reached.
Thanks for the clarification 👍. Yes, in the case of optimization (with sth like gradient descent etc.) you would solve one adjoint problem per iteration. In other words, you would two linear systems per iterations. One for the primal/forward evaluation and one for the adjoint problem. As such, your lambda (adjoint variable) is different in each iteration. Now thinking about it: one could say that the adjoint variable depends on theta. Why I was struggling with this at first is the fact that usually, in an implementation, you would never find the functional form of how lambda depends on theta. All operations you do are always at fixed theta values. But technically, you are right, there is a dependency. Hope that helps 😊 Let me know if sth is unclear.
@@MachineLearningSimulation Thanks again. One last point. What exactly did you mean by "you calculate theta only once"? Is there any reason why it should be calculated more than once? I think this is the part that confused me the most.
Great, you're welcome 😊 That's the point where I'd need a time stamp. I can't recall in which context I said this. Maybe I meant the favorable scaling properties of the adjoint method, i.e., that it scales constantly in the number of parameters. In other words, it does not matter how many entries your parameter vector has, the adjoint problem will always be as hard as the primal/forward problem.
Solving a batch of linear equation does not take O(PN^3) , You do LU decomposition once which is O(2/3N^3) and then for each P columns , we need to do forward and backward substitution which is O(PN^2)
Hi, yes you are right. Solving a dense system batched is (in a loose notation for complexity) O(N^3 + P*N^2) because (as you correctly mentioned) LU decomp only has to be done once. Still, the argument holds that the direct computation scales linearly in the number of parameters which can become problematic if P is large (think for instance of neural nets where you can easily have P approx N).
The derivation follows Prof. Johnson's fantastic notes you can find here: math.mit.edu/~stevenj/18.336/adjoint.pdf
Thanks for your lectures, they're really useful for me as an ocean physics grad student! I wonder if you could cite other materials as well that you used to make each of your lecture, like this one. It will help us when we go deep after your videos.
@@yongsuna5683 Hi, thanks so much for the kind words! :)
Unfortunately, the literature on good intuition and derivations for adjoint methods is not too vast. I will publically upload my Master Thesis soon that contains many in-detailed derivations for sensitivity methods of various problems. I will announce it here on the channel and social media (Twitter & LinkedIn).
Some other cool materials are Chris Rauckackas' course: mitmath.github.io/18337/
Also check out that amazing Neurips tutorial by Matthew Johnson, David Duvenaud and Zico Kolter: implicit-layers-tutorial.org/ (I also learned so much from Matt Johnson on his issue answers over on the JAX github repo, he is a great educator)
I think the JaxOpt paper can be of interest as well: arxiv.org/abs/2105.15183
Also check out this nice talk at JuliaCon 2022: ruclips.net/video/TkVDcujVNJ4/видео.html
Amazing! I was confused while reading the Neural ODEs paper about how exactly they are doing the adjoint method and your two videos on this helped a lot in understanding it better. Totally appreciated! I wonder if it is possible to derive an explicit system of ODEs that can model a given simple ResNet or feedforward network.
Glad it was helpful! You're very welcome :)
This is great! Working on a package that uses the adjoint method for implicit differentiation and this video explained things really well :D
You're welcome :)
Excellent lecture!
Thanks for the kind words
Thank you so much for providing such a wonderful series which is really insightful! Would you be so kind as to elaborate a bit on two aspects: 1. the original computation issue of dx/dθ (starting 14:20). It is understandable that the naive calculation of dx/dθ is much more complicated than solving x=A^{-1}b as dx/dθ involves matrix-matrix multiplication while the later one is just matrix-vector multiplication. But what is a bit puzzling is how to view it as a "batch linear system of equations", in my naive thinking, since the inverse of A^{-1}(θ) only needs to be calculated once, it is typically O(N^3) and does not deal with P at all. Hence the rest is a matrix multiplication of N*N with N*P (the bracket), and from there, indeed the θ's dimensionality starts to have an effect on this multiplication's complexity, but this seems does not answer where the O(PN^3) (15:55) comes from. 2. it seems the key insight is to simply switch the order of calculating by calculating the vector-Jacobian product first so that this helps mitigate the computation complexity of the original Jacobian-vector product formulation. If this is the case, in this specific system, what is the purpose of defining this additional "Adjoint term", as seems all it introduces is just to remind us to calculate vector-Jacobian product instead of Jacobian-vector product, which can originally be done with a careful order of matrix multiplication.
Hi,
Thanks for the kind comment and the interesting question! 😊
(1) Regarding the complexity: You are right. The O(P * N^3) for a dense system is not the most efficient. You would only do the LU decomposition once, which is O(N^3), and then each solve (consisting of backward and forward substitution) is O(N^2). Hence, loosely speaking, the cost is O(N^3 + P * N^2), which still becomes problematic for larger P.
(2) Indeed, the critical insight is using the associativity of matrix-matrix multiplication. The reason I like to explicitly introduce an "adjoint term" is to remind that the inverse matrix (for most practical cases) should never be materialized. This aux term shows that we need one additional linear solve.
very under rated video
Thanks a lot :).
Feel free to share it (or the channel) with colleagues and peers. :)
Cool, is there an example or code for discrete accompanied CFD shape optimization using automatic differentiation, like SU2, but the instructional version?
Thanks for the comment 😊
From the back of my head I do not have anything right away, but I remember that SU2 had quite extensive documentation. So there could be educational examples as well.
You could also check resources on FEniCs and Openfoam.
Best of luck 🤞
The video is excellent. Can you suggest how to use the adjoint methods for FEA static problems when the stiffness matrix depends on a large number of parameters? Any available Matlab codes for adjoint methods of linear system
Thanks for the nice feedback :)
The next video will be on a code example for the adjoint of a linear system of equation, but with a right-hand-side that depends on parameters. Therefore, I do not have a code yet for the adjoint when the stiffness matrix is parameter-dependent. I also expect it to be a little more tricky to make it work efficiently, since you have to work with 3D-objects, but I also could be wrong.
What is your specific application, if I may ask?
For the equation dx/du = (A^-1)*(db/dTheta - dA/dTheta*x), I had a question. Ideally, dx/du should be a function of u alone (with no remaining x-terms). However, the additional x coefficient of dA/dTheta does not cancel out, leaving dx/du in terms of both x and u. How do we deal with this? Are we supposed to use observed values (xObs) in place of a purely symbolic x here? If the x remains in the solution, it is not very useful. Thanks.
Hi,
thanks for the great question :).
First, I assume you mean (by the notation used in the video) dx/dTheta? If that is the case: the x on the dA/dTheta comes from the primal pass of the calculations. In other words, this is the x from the original problem: Ax=b. Whenever we do automatic Differentiation (or adjoint methods) we need (next to the adjoint pass) also a primal pass.
In other words: you have a concrete Theta for which you want the gradient dJ/dTheta, you would first assemble system matrix and RHS, then solve the linear system and (importantly) save the result of it: x. Then, you can compute your loss J. This is the primal pass. Afterwards, you can obtain the loss derivative delJ/delX and use it as a RHS in the adjoint linear system. Finally, you would evaluate equation (3) from the end of the video. Here you would use the stored x value.
There will be a new video coming out on Friday on the pullback/vJp primitive Operation of linear system solving :). Personally, I prefer this newer derivation as it allows for tighter integration into AD engines. It also showcases that there are potential computational savings (e.g. using the adjoint form of a factorization in the backward pass or re-using the adjoint form of a preconditioner).
Hope that helped 😊 Let me know if it was unclear.
@@MachineLearningSimulation Thanks so much. Yes, I did mean dx/dTheta. I kind of understand your response. For something like a neural network, the (first) primal pass would be performing forward sweep using the initialized system parameters (theta)? This would yield updated states? Then, during back propagation, we would use those already-defined states (x) to update theta from dError/dTheta=(dError/dx)(dx/dTheta)?
I guess my confusion was for the case where dError/dTheta is applied multiple times between primal passes. We know that x is a function of Theta. So, updates to Theta should require an update of x. But, I guess we could simply calculate that from x = inv(A(Theta))*b(Theta).
Does that sound remotely correct?
Second question. I am having a bit of trouble understanding what the core advantage of the adjoint method is? I know it should be more efficient. And I understand how the grouping and multiplication of dError/dx and inv(A) first is advantageous, before multiplying that product by (db/dTheta-dA/dTheta*x). But beyond that I tend to get lost.
I feel like I cannot see the forest through the trees. Sorry for so many questions. I will also check out your other video.
Hi. I am having a question regarding the use of Leibniz rule. When you formulate the problem as a constrained optimization with ODE constraint, you take the total derivative of the Lagrangian with respect to the parameter \theta. By the Leibniz rule, you brought the derivative d/d\theta inside the integral, but when you brought it inside, it should become a partial derivative wrt \theta, is it not?
Hello Mr,thanks for your video.I still have one question could you please give a example of how to get ∂J/∂θ?J is a combination of θ,right?
Hi, thanks for the comment :)
The dJ/dtheta is the gradient we are after, ultimately.
Imagine, we wanted to fit (constitutive) Parameters for a physics model. An example would be the E modulus or other quantities. Then we could use this information in a gradient-based optimization scheme, to iteratively find better values of theta.
Hope that helps:)
Do you have any video about adjoint in second or arbitrary-order?
Not yet; or at least not directly ;): I can recommend the autodiff perspective on differentiating over linsolves (ruclips.net/video/y8_4t9E-lxw/видео.html ). Once, you have the propagation rule down, you can get higher-order derivatives by e.g. performing forward-mode over reverse-mode. jax.readthedocs.io/en/latest/_autosummary/jax.hessian.html and jax.readthedocs.io/en/latest/_autosummary/jax.hessian.html .
For a FEM simulation, can the parameters be the mesh nodes or some matrices built from the mesh nodes?
Interesting question. I assume that by mesh nodes you refer to the spatial position of the discretization points. I am unsure what a potential use case would be, but the adjoint method, especially as formulated for the linear systems of equations (as in this video), is rather generic. Once you are on the level of a linear system of equations, all other complexity (like mesh, underlying PDE, constitutive law etc.) is already absorbed.
It is most often used in the context of design optimization (i.e., structural or topology optimization) for which you have the geometry parameterized, and the meshing is most often done automatic. And you also find it in parameter fits for constitutive laws. The involved parameters can be classical ones like Young's modulus etc. or the weights to certain Machine Learning models that more closely model the constitutive behavior (like Neural Networks).
@@MachineLearningSimulation for plugging in at the end, if you have a derivative of a matrix A with respect to theta( say mesh matrices), how would that work? is there a easier way to get that? I have a function that creates matrix given the mesh nodes ( as an array ) but it seems I need to calculate the derivative of that matrix with respect to the mesh matrix to get the dJ_over_dtheta. Thank you for your videos!
bumping this..
@@kapilkhanal9678 Indeed, that's a tricky one as it would give a rank-3 tensor, and it is most often sparse. It's a problem that also bugged me for a long time. One can derive such a quantity by hand, but it is rather tedious.
What actually helps is to view the problem a bit different and have a differentiable version of the matrix assembly, then build the system matrix cotangent and pullback the information to the parameter vector. That sounds abstract, but take a look at my new playlist on AD primitive rules: ruclips.net/video/PwSaD50jTv8/видео.html
I will soon have a video on linear systems (and maybe I will try to accompany it with a coding example).
@@MachineLearningSimulation thank you for the video. I was wondering if there is a way to contact you for some of the issues I have understanding diff. programming and implementing it on Jax. Again, thank you for your help. I have been stuck on this problem for a while now.
Can you point me to resources on this?
You can check out the notes by Prof. Steven Johnson : math.mit.edu/~stevenj/18.336/adjoint.pdf
Or the book "engineering design optimization" by Ning & Martins in chapter 6.7
I still don't understand the relationship between θ and A?Could you please provide a simple example?Can I take the example?A(θ) = | cos(θ) 0 sin(θ) | ,| 0 1 0 | , | -sin(θ) 0 cos(θ) |)
The example you brought up is a valid one. It also is a special case, because you only have one parameter. In other words, your parameter vector is only one dimensional. That greatly simplifies things, because your dA/dtheta is now not a rank-3 tensor (one tensor rank above a matrix), but just a matrix (or the additional axis is one-dimensional).
More generally speaking, the way A depends on theta is highly problem dependent. For instance, if you discretized a 1D heat equation with Dirichlet BCs with the BTCS scheme, and the parameter to optimize was the diffusivity, then your matrix structure was tridiagonal and the theta appeared in all of the three diagonals.
I have a question regarding your statement that lamba needs to be calculated once. Lambda depends on A, which depends on theta. Usually, theta is found iteratively for optimization problems, which means that theta changes in each iteration, which also causes A to change, and hence lamba. Of course, if A doesn't depend on lamba, then we don't have this problem. What do you think?
Hi, thanks for the comment 😊
Can you give a timestamp to the point in the video you are referring to? It's been some time since I uploaded the video, so I can't exactly recall what I said when in which context.
Typically, during each optimization iteration, you would solve one primal/forward problem to obtain a loss estimate and one adjoint problem to obtain a gradient estimate. For the sake of the adjoint problem, you take the same system matrix (just transposed) from the primal pass. Hence, the adjoint variable lambda is independent from whether the system matrix depends on theta or not
@@MachineLearningSimulationThanks for the reply. To make myself more clear, what I'm talking about is optimization until the solution is found. For example, consider gradient descent where we make a guess on what theta is and then iteratively update it. If I understand correctly, the method you described in this video is just for one step/iteration of such a process. Am I correct? If that is the case, then I am baffles that you said that we need to calculate lambda once, which in my mind meant you get to use the same lambda for all iterations until convergence is reached.
Thanks for the clarification 👍.
Yes, in the case of optimization (with sth like gradient descent etc.) you would solve one adjoint problem per iteration. In other words, you would two linear systems per iterations. One for the primal/forward evaluation and one for the adjoint problem. As such, your lambda (adjoint variable) is different in each iteration. Now thinking about it: one could say that the adjoint variable depends on theta. Why I was struggling with this at first is the fact that usually, in an implementation, you would never find the functional form of how lambda depends on theta. All operations you do are always at fixed theta values. But technically, you are right, there is a dependency.
Hope that helps 😊 Let me know if sth is unclear.
@@MachineLearningSimulation Thanks again. One last point. What exactly did you mean by "you calculate theta only once"? Is there any reason why it should be calculated more than once? I think this is the part that confused me the most.
Great, you're welcome 😊
That's the point where I'd need a time stamp. I can't recall in which context I said this.
Maybe I meant the favorable scaling properties of the adjoint method, i.e., that it scales constantly in the number of parameters. In other words, it does not matter how many entries your parameter vector has, the adjoint problem will always be as hard as the primal/forward problem.
Solving a batch of linear equation does not take O(PN^3) , You do LU decomposition once which is O(2/3N^3) and then for each P columns , we need to do forward and backward substitution which is O(PN^2)
@MachineLearningSimulation , am i missing something why solving a batch is so much more complex
Hi, yes you are right. Solving a dense system batched is (in a loose notation for complexity) O(N^3 + P*N^2) because (as you correctly mentioned) LU decomp only has to be done once. Still, the argument holds that the direct computation scales linearly in the number of parameters which can become problematic if P is large (think for instance of neural nets where you can easily have P approx N).
太清晰了
Thanks for the comment 😊
I used auto translate (which didn't turn out too well). I hope you liked the video 👍