Explanation for why KNN is problematic in high dimensions starts at 8:53. Super helpful, surprised this doesn't have more views - thanks for putting up!
Another way of thinking about this is the central limit theorem. If you use euclidian distance you essentially sum up distances. Therefore you can imagine it like taking some kind of a mean through all dimensions (only conceptionally). Then you have the problem that the distances become more and more distributed like a gaussian/normal distribution (gets worse with more dimensions). Therefore they seem all more or less equidistant and meaning is lost. I don't think that this covers the whole hypersphere/hypercube effect. But I thought it would perhaps bring up a new perspective.
None, but the binary tree data structure has cheap insert (log n per insert) and you can perform binary search on it efficiently. I think the idea is to preprocess data usefully first, and then try and find nearest neighbours.
Explanation for why KNN is problematic in high dimensions starts at 8:53. Super helpful, surprised this doesn't have more views - thanks for putting up!
Agreed.
Great video! I understand way better the curse of dimensionality now.
I'm surprised that he could conduct a 16min explanation without any full stop in his sentence. Loads of information packed here
Another way of thinking about this is the central limit theorem. If you use euclidian distance you essentially sum up distances. Therefore you can imagine it like taking some kind of a mean through all dimensions (only conceptionally). Then you have the problem that the distances become more and more distributed like a gaussian/normal distribution (gets worse with more dimensions). Therefore they seem all more or less equidistant and meaning is lost. I don't think that this covers the whole hypersphere/hypercube effect. But I thought it would perhaps bring up a new perspective.
Which sorting algorithm can sort in sub linear time (< O(n)) ?
None, but the binary tree data structure has cheap insert (log n per insert) and you can perform binary search on it efficiently. I think the idea is to preprocess data usefully first, and then try and find nearest neighbours.