The paradox mentioned in the video might be alleviated with some strategies. For instance, if the required steps of SimpleLogic exceed BERT's capacity, it shouldn't be able to infer correctly (and if it does, it might be relying on statistical features). Therefore, we could introduce an additional loss by using more steps, ensuring that BERT cannot infer correctly when its capacity is exceeded.
The paradox mentioned in the video might be alleviated with some strategies.
For instance, if the required steps of SimpleLogic exceed BERT's capacity, it shouldn't be able to infer correctly (and if it does, it might be relying on statistical features). Therefore, we could introduce an additional loss by using more steps, ensuring that BERT cannot infer correctly when its capacity is exceeded.