A Data Science Central Community
I am trying to solve a customer segmentation case study with the help of SAS STAT. I want to know if I use proc distance on a large data set, won't the number of columns proliferate to 100,000 if there are 100,000 rows for a dataset? Does it not imply proc distance cannot be used in the case of large datasets
a standard example from SAS website to illustrate my point:
|Albania 10.1 1||0||.||.||.||.||.|
|Belgium 13.5 9||2.60925||0||.||.||.||.|
|Denmark 10.6 1||3.45989||1.47093||5.26767||0||.||.|
|Finland 9.5 4.||4.25721||3.09614||5.36375||2.36661||0||.|
|Greece 10.2 3.||2.8391||3.19842||6.21726||4.2896||5.0896||0|
Depending on your system specifications running a hierarchichal clustering method like PROC CLUSTER for a 100,000 dataset might not be viable. You can use PROC FASTCLUS for a k-means optimization clustering method which can handle pretty large datasets.
The answer to your question is Yes, the number of columns would "proliferate" to 100,000 because PROC DISTANCE writes a lower triangular matrix or a square matrix to an output SAS data set. This would make the situation you describe as infeasible for analysis. Even if PROC DISTANCE wrote these pairwise distances between observations with only three variables [ID for the first observation, ID for the second observation, and the distance between these two observations], the number of pairwise distances for N observations would equal 0.5*N*(N-1).
The question then becomes, why would you want to calculate about 5,000,000,000 pairwise distances for the 100,000 observations? I doubt whether you could examine any but a small fraction of them. To reduce the number of distances calculated, Mr. Martinez has provided one solution. Another solution would be multiple random samples of say, 1,000 observations each. A third solution, if your data is amenable, would be to sort your data by variables that have low cardinality values [few distinct values] and use PROC DISTANCE's BY-variable statement to calculate distances between observations in the same BY-variable groups. A fourth solution would be to use another SAS procedure or a DATA step to calculate these distances.
Thanks a lot!