Data set | Euclidean | SNN distances |
Artificial data series, at d=10, d=160, d=640 | ||
All-Relevant | All-Relevant | 50 80 100 125 200 500 |
10-Relevant | 10-Relevant | 50 80 100 125 200 500 |
Cyc-Relevant | Cyc-Relevant | 50 80 100 125 200 500 |
Half-Relevant | Half-Relevant | 50 80 100 125 200 500 |
All-Dependent | All-Dependent | 50 80 100 125 200 500 |
10-Dependent | 10-Dependent | 50 80 100 125 200 500 |
Real data, at native dimension of feature vector | ||
ALOI | ALOI | 5 8 10 12 15 20 50 100 150 200 500 1000 |
Multiple Features (All) | Multifeat-all | 20 50 80 100 125 200 250 300 500 1000 |
Multiple Features (Pixel only) | Multifeat-pixel | 20 50 80 100 125 200 250 300 500 1000 |
Optical Digits | optdigits.pdf | 20 50 80 100 125 200 250 300 500 1000 |
All results in this series were done using Euclidean distance or a SNN distance based on Euclidean distance.
For the artificial data sets, distances were scaled by 1/sqrt(d), since the diagnonal of the unit cube in Euclidean distance grows with sqrt(d). This way, multiple dimensionalities can be compared in the same plot.
In Euclidean distance, it can be clearly seen that even in the 10 dimensional data sets (but also all the real-world data sets), distances were approximately Gaussian distributed. A way to explain this is by using the Central Limit Theorem. It does not apply for the SNN setup, since these are not based on a sum of axis components.
Even the correlated data sets are approximately normally distributed. However the normalization applied when plotting the graphs fails for these data sets, causing the curves to not overlap.