Adjoint Sensitivities of a Non-Linear system of equations | Full Derivation

Machine Learning & Simulation

Просмотров 2,2 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 12 сен 2024

Комментарии • 16

@MachineLearningSimulation 3 года назад ⁺¹
My speaking is a little fast in this video. Sorry for that. Will be better again in the next videos :)
You can also set the playback speed to 0.9 if that helps in understanding. Let me know if you had difficulties.
@heyjianjing 3 года назад ⁺³
Thank you very much for this video, i'm a bit confused with regarding the complexity between the adjoint method and the naive forward method
My understanding is that the difference lies between the complexity of solving
1) lambda in (df/dx)_T*lambda=-(dJ/dx)_T ([N*N]*[N*1]=[N*1]) and
2) (dx/dtheta) in (df/dx)*(dx/dtheta)=-(df/dtheta) ([N*N]*[N*P]=[N*P])
Once solved, the remaining part does not differ in complexity
The point I would like to confirm is that
As you mentioned, once x is obtained through 1 forward simulation, we can evaluate df/dx at the point x, solve Eq. 1) and obtain lambda. But isn't that true that theta is also known at this point (otherwise we would not know x)? Which means all dx/dtheta and df/dtheta in Eq. 2) can be evaluated as well? So is the complexity difference really just that solving Eq. 2) requires p times more operation than solving Eq. 1) and the complexity has nothing to do with the number of forward simulation?
The reason I'm asking this is that from some other videos, such as this one ruclips.net/video/Yiz92Ekn7vUs/видео.html, I got the impression that adjoint method needs only one forward simulation while naive method need P simulations.
Thanks in advance for your answer!
@MachineLearningSimulation 3 года назад
Thank you for your feedback :)
I can understand your confusing. Assuming I got your question right:
When we are in the context of non-linear system:
The original non-linear system f(x, theta) = 0 has to be solved for both the forward and the backward approach, it requires for instance a Newton-Raphson algorithm which is also different from the algorithm we would need for the forward and backward system. Therefore, let's ignore it for now.
You correctly noted the systems that have to be solved either for the backward (=adjoint) (corresponds to (1) ) or forward (corresponds to (2) ) approach. They are both linear systems and have the same system matrix, the df/dx Jacobian (for (1) it is only transposed). The solution requires an algorithm for a linear system. Let's assume the systems are large and sparse, and you use a (preconditioned) conjugate gradient. The number of iterations (hence the computational cost) of this algorithm is dependent on the condition number of the system matrix. The condition number of a matrix does not change if you transpose the matrix. Hence, both systems are identically conditioned and one solution of the system takes an identical number of iterations to reach the prescribed tolerance. However, and that is the big point: for the forward system you not only have to solve one system, but P systems. This becomes infeasible if P is high and the solution of the linear system takes long
Throughout writing this comment, I had the feeling that probably there is some confusion on a "forward solve of a system". Let's differentiate the following for this context:
1) The solution of the non-linear original system (let's call it a forward solve)
2) The solution of the linear adjoint system (let's call it adjoint/backward sensitivity solve) - your system (1)
3) The solution of the linear forward sensitivity systems - mind the plural (let's call it forward sensitivity solve) - your system (2)
Now you want to do the following:
a) A classical solution without sensitivities: This takes only (1)
b) Sensitivities using the adjoint method: This takes (1) and (2) -> you get the classical solution for free and in total you have to solve one non-linear and one linear system
c) Sensitivities using the forward method: This takes (1) and (3) -> you get the classical solution for free and in total you have to solve one non-linear and P linear systems
d) Sensitivities using Finite Differences: This takes (1) and P times (1) -> you get the classical solution for free and in total you have to solve P+1 non-linear systems (and the derivatives are not exact)
I hope this put some more light on it :)
It might be a bit confusing, especially with the mentioned algorithms. Please feel free to ask follow-up questions.
@MachineLearningSimulation 3 года назад
And some nice resource to also take a look at, when trying to understand (the complexity of) adjoint methods: developer.nvidia.com/siggraph/2019/video/sig903
@heyjianjing 3 года назад
@@MachineLearningSimulation Hi, thanks for your quick reply.
The list of b) and c) makes it clear to me on the complexity between solving lambda and solving (dx/dtheta).
I also watched the video by Jos Stam (thanks for the video link!) and he mentioned similar insights at around 19:00, where computing del_x_out takes (mnk) operations and computing p_in takes (nm) operations.
We still need to multiply del_x_out by p_out, and p_in by del_x_in
Here, p_out is 1xm, so the total operations are (mk+mnk)
del_x_in is nxk, so the total operations are (nk+nm)
So when k, which is dim of theta, becomes dominant, using p_out*del_x_out is slower
I hope I get it right!
@MachineLearningSimulation 2 года назад
@@heyjianjing I just figured out, I sent you the wrong link. There is the actual talk of Jos Stam I meant: developer.nvidia.com/siggraph/2020/video/sigg06
But I am pretty sure, that they quite similar. I will check the point in the video that you were referencing soon and will then reply :)
@chrtzstn 2 года назад ⁺¹
Hi there. Enjoy your videos. Just a question: Why u0 depends on theta? It is just for generality? Muchas gracias.
@MachineLearningSimulation 2 года назад
Hey, thanks a lot for the feedback :)
Do you have a timestamp which you are referring to? In case you mean the loss function, J: The dependency of this one on theta is just for generality, e.g. a super simple quadratic loss function would not involve any parameters.
@user-ic1so6bb1p 11 месяцев назад
\partial f/ \partial theta seemed very confusing to me. To obtain this term would not be the case to build a tangent linear model?
@MachineLearningSimulation 11 месяцев назад
Hi, thanks for the comment.
For both tangent-linear approaches and adjoint approaches you need the Jacobians of the root-function ("f" in this video) with respect x and with respect to theta (though, for tangent linear-Mode JVPs are fine whereas for adjoint mode vJps are fine).
Take a look at line 3 of the summary table I created for autodiff rules: fkoehler.site/implicit-autodiff-table/
(It's still work in progress, let me know if you find any mistakes).
Regarding obtaining the Jacobian with respect theta: yes that can be confusing. If your model is implemented in an autodiff framework (like JAX) it should be straightforward. If that is not the case, one probably has to derive it by hand.
Hope that helps 😊
@mariusa.6579 2 года назад ⁺¹
Can you maybe also upload a video where you implement this strategy efficiently with AD (Julia or Python would be great)? And maybe also for sensitivities in ODEs, this would be really helpful, at least for me. I have already watched the Python implementation for the simple linear case, which was excellent, however, too simple to understand the generalization to efficiently implement harder real world cases. An extension how to feed an parameter optimization algorithm with the derivates would aswell be really interesting.
@MachineLearningSimulation 2 года назад ⁺¹
Hey,
that's indeed one of the next videos I am going to upload. I really want to show the connection between AD and the adjoint methods. For the AD I plan to do it JAX, but I might also do a version using Julia Zygote/ForwardDiff.
I can totally understand that an implementation in Python/Julia is an immense help in understanding the concepts. It's the same for me :D
Regarding the feeding to optimization algorithms, that's a great idea! I will add it to my To-Do list. I also plan to have a series on various optimization algorithms, and that could provide a great link between the playlists.
@mariusa.6579 2 года назад
@@MachineLearningSimulation Hi,
sorry for the late response! In Julia would be amazing, also because there is a lot going on there in the moment. I am really looking forward to see al the videos in the near future. Keep up the good work!!
@MachineLearningSimulation 2 года назад ⁺¹
Hey, no problem.
I think then you might enjoy the next video, which will be a fluid simulation in Julia. :)
But I will note that down to also have a Julia version in the Adjoint Playlist.
@mariusa.6579 2 года назад
@@MachineLearningSimulation Awesome! Really looking forward to it :)
@MachineLearningSimulation 2 года назад
Hey,
I just noticed that I couldn't hold my promise. The Playlist on the Adjoint methods got a bit into the background since I'm writing my thesis at the moment. There will definitely be more videos (also coding tutorials) after the thesis is done.
Until then, I I decided to upload the scripts I initially wrote in preparation for the videos. Maybe you find them helpful 😊 they are not too well documented and I will replace them with the proper versions once the video is uploaded.
Here for the Adjoint of a non-linear system (actually just one equation not a system) : github.com/Ceyron/machine-learning-and-simulation/blob/main/english/adjoints_sensitivities_automatic_differentiation/adjoint_univariate_nonlinear_system.py
The same, but with JAX for the jacobians: github.com/Ceyron/machine-learning-and-simulation/blob/main/english/adjoints_sensitivities_automatic_differentiation/adjoint_univariate_nonlinear_system_using_jax.py
And an example for the Adjoint of an ode: github.com/Ceyron/machine-learning-and-simulation/blob/main/english/adjoints_sensitivities_automatic_differentiation/adjoint_ode.py
Let me know what you think 😊
And sorry again for not keeping the promise on the videos in time

Следующие

Автовоспроизведение

Lagrangian Perspective on the Derivation of Adjoint Sensitivities of Nonlinear Systems