Thanks for the great video. In the graph at 6:26, for the incrementally scaled networks, is the “training cost” just considering the incremental cost relative to the previous iteration? Or the cumulative training cost inclusive of the compute expended on the preceding increments?
Thank you for the feedback! The training cost reported for the Tokenformer versions is cumulative, including both the compute spent on the preceding increments and the initial training of the 124M model, while the Transformer cost is reported for each version individually.
A potential downside is that, for now, Tokenformer has been tested on a relatively small scale, so its effectiveness for large models is still unproven.
Thanks for the great video. In the graph at 6:26, for the incrementally scaled networks, is the “training cost” just considering the incremental cost relative to the previous iteration? Or the cumulative training cost inclusive of the compute expended on the preceding increments?
Thank you for the feedback! The training cost reported for the Tokenformer versions is cumulative, including both the compute spent on the preceding increments and the initial training of the 124M model, while the Transformer cost is reported for each version individually.
Can you also explained what’s the potential down side of this new architecture?
A potential downside is that, for now, Tokenformer has been tested on a relatively small scale, so its effectiveness for large models is still unproven.
Great channel!
A Tokenformer forms tokens