# AnalyticBridge

A Data Science Central Community

Hi,

I am trying to solve a customer segmentation case study with the help of SAS STAT. I want to know if I use proc distance on a large data set, won't the number of columns proliferate to 100,000 if there are 100,000 rows for a dataset? Does it not imply proc distance cannot be used in the case of large datasets

Views: 3087

### Replies to This Discussion

a standard example from SAS website to illustrate my point:

 Country Albania_10_1_1 Belgium_13_5_9 Czechoslovakia Denmark_10_6_1 Finland_9_5_4_ Greece_10_2_3_ Albania 10.1 1 0 . . . . . Belgium 13.5 9 2.60925 0 . . . . Czechoslovakia 5.584 5.06665 0 . . . Denmark 10.6 1 3.45989 1.47093 5.26767 0 . . Finland 9.5 4. 4.25721 3.09614 5.36375 2.36661 0 . Greece 10.2 3. 2.8391 3.19842 6.21726 4.2896 5.0896 0

Depending on your system specifications running a hierarchichal clustering method like PROC CLUSTER for a 100,000 dataset might not be viable. You can use PROC FASTCLUS  for a k-means optimization clustering method which can handle pretty large datasets.

The answer to your question is Yes, the number of columns would "proliferate" to 100,000 because PROC DISTANCE writes a lower triangular matrix or a square matrix to an output SAS data set.  This would make the situation you describe as infeasible for analysis.  Even if PROC DISTANCE wrote these pairwise distances between observations with only three variables [ID for the first observation, ID for the second observation, and the distance between these two observations], the number of pairwise distances for N observations would equal 0.5*N*(N-1).

The question then becomes, why would you want to calculate about 5,000,000,000 pairwise distances for the 100,000 observations?  I doubt whether you could examine any but a small fraction of them.  To reduce the number of distances calculated, Mr. Martinez has provided one solution.  Another solution would be multiple random samples of say, 1,000 observations each.  A third solution, if your data is amenable, would be to sort your data by variables that have low cardinality values [few distinct values] and use PROC DISTANCE's BY-variable statement to calculate distances between observations in the same BY-variable groups.  A fourth solution would be to use another SAS procedure or a DATA step to calculate these distances.

Thanks a lot!