Proc distance and Proc cluster in Large datasets - AnalyticBridge 2019-05-21T04:48:40Z https://www.analyticbridge.datasciencecentral.com/forum/topics/proc-distance-and-proc-cluster-in-large-datasets?feed=yes&amp%3Bxn_auth=no Thanks a lot! tag:www.analyticbridge.datasciencecentral.com,2013-03-12:2004291:Comment:236972 2013-03-12T00:52:49.264Z ratheen chaturvedi https://www.analyticbridge.datasciencecentral.com/profile/ratheenchaturvedi <p>Thanks a lot!</p> <p>Thanks a lot!</p> The answer to your question i… tag:www.analyticbridge.datasciencecentral.com,2013-03-11:2004291:Comment:237134 2013-03-11T12:24:53.856Z Matthew Zack https://www.analyticbridge.datasciencecentral.com/profile/MatthewZack <p>The answer to your question is Yes, the number of columns would "proliferate" to 100,000 because PROC DISTANCE writes a lower triangular matrix or a square matrix to an output SAS data set.  This would make the situation you describe as infeasible for analysis.  Even if PROC DISTANCE wrote these pairwise distances between observations with only three variables [ID for the first observation, ID for the second observation, and the distance between these two observations], the number of…</p> <p>The answer to your question is Yes, the number of columns would "proliferate" to 100,000 because PROC DISTANCE writes a lower triangular matrix or a square matrix to an output SAS data set.  This would make the situation you describe as infeasible for analysis.  Even if PROC DISTANCE wrote these pairwise distances between observations with only three variables [ID for the first observation, ID for the second observation, and the distance between these two observations], the number of pairwise distances for N observations would equal 0.5*N*(N-1).</p> <p>The question then becomes, why would you want to calculate about 5,000,000,000 pairwise distances for the 100,000 observations?  I doubt whether you could examine any but a small fraction of them.  To reduce the number of distances calculated, Mr. Martinez has provided one solution.  Another solution would be multiple random samples of say, 1,000 observations each.  A third solution, if your data is amenable, would be to sort your data by variables that have low cardinality values [few distinct values] and use PROC DISTANCE's BY-variable statement to calculate distances between observations in the same BY-variable groups.  A fourth solution would be to use another SAS procedure or a DATA step to calculate these distances.</p> Depending on your system spec… tag:www.analyticbridge.datasciencecentral.com,2013-03-08:2004291:Comment:236499 2013-03-08T20:28:28.599Z Guillermo Martinez https://www.analyticbridge.datasciencecentral.com/profile/GuillermoMartinez <p>Depending on your system specifications running a hierarchichal clustering method like PROC CLUSTER for a 100,000 dataset might not be viable. You can use PROC FASTCLUS  for a k-means optimization clustering method which can handle pretty large datasets.</p> <p>Depending on your system specifications running a hierarchichal clustering method like PROC CLUSTER for a 100,000 dataset might not be viable. You can use PROC FASTCLUS  for a k-means optimization clustering method which can handle pretty large datasets.</p> a standard example from SAS w… tag:www.analyticbridge.datasciencecentral.com,2013-03-06:2004291:Comment:234997 2013-03-06T11:48:00.686Z ratheen chaturvedi https://www.analyticbridge.datasciencecentral.com/profile/ratheenchaturvedi <p>a standard example from SAS website to illustrate my point:</p> <table border="0" cellspacing="0" summary="Page Layout" width="853"> <colgroup><col width="214"></col><col width="104"></col><col width="108"></col><col width="102"></col><col width="113"></col><col width="103"></col><col width="109"></col></colgroup><tbody><tr><td class="xl64" height="20" width="214">Country</td> <td class="xl65" width="104">Albania_10_1_1</td> <td class="xl65" width="108">Belgium_13_5_9…</td> </tr> </tbody> </table> <p>a standard example from SAS website to illustrate my point:</p> <table border="0" cellspacing="0" width="853" summary="Page Layout"> <colgroup><col width="214"></col><col width="104"></col><col width="108"></col><col width="102"></col><col width="113"></col><col width="103"></col><col width="109"></col></colgroup><tbody><tr><td height="20" class="xl64" width="214">Country</td> <td class="xl65" width="104">Albania_10_1_1</td> <td class="xl65" width="108">Belgium_13_5_9</td> <td class="xl65" width="102">Czechoslovakia</td> <td class="xl65" width="113">Denmark_10_6_1</td> <td class="xl65" width="103">Finland_9_5_4_</td> <td class="xl65" width="109">Greece_10_2_3_</td> </tr> <tr><td height="20" class="xl66" width="214">Albania 10.1 1</td> <td class="xl63" align="right" width="104">0</td> <td class="xl63" width="108">.</td> <td class="xl63" width="102">.</td> <td class="xl63" width="113">.</td> <td class="xl63" width="103">.</td> <td class="xl63" width="109">.</td> </tr> <tr><td height="20" class="xl66" width="214">Belgium 13.5 9</td> <td class="xl63" align="right" width="104">2.60925</td> <td class="xl63" align="right" width="108">0</td> <td class="xl63" width="102">.</td> <td class="xl63" width="113">.</td> <td class="xl63" width="103">.</td> <td class="xl63" width="109">.</td> </tr> <tr><td height="20" class="xl66" width="214">Czechoslovakia</td> <td class="xl63" align="right" width="104">5.584</td> <td class="xl63" align="right" width="108">5.06665</td> <td class="xl63" align="right" width="102">0</td> <td class="xl63" width="113">.</td> <td class="xl63" width="103">.</td> <td class="xl63" width="109">.</td> </tr> <tr><td height="20" class="xl66" width="214">Denmark 10.6 1</td> <td class="xl63" align="right" width="104">3.45989</td> <td class="xl63" align="right" width="108">1.47093</td> <td class="xl63" align="right" width="102">5.26767</td> <td class="xl63" align="right" width="113">0</td> <td class="xl63" width="103">.</td> <td class="xl63" width="109">.</td> </tr> <tr><td height="20" class="xl66" width="214">Finland 9.5 4.</td> <td class="xl63" align="right" width="104">4.25721</td> <td class="xl63" align="right" width="108">3.09614</td> <td class="xl63" align="right" width="102">5.36375</td> <td class="xl63" align="right" width="113">2.36661</td> <td class="xl63" align="right" width="103">0</td> <td class="xl63" width="109">.</td> </tr> <tr><td height="20" class="xl66" width="214">Greece 10.2 3.</td> <td class="xl63" align="right" width="104">2.8391</td> <td class="xl63" align="right" width="108">3.19842</td> <td class="xl63" align="right" width="102">6.21726</td> <td class="xl63" align="right" width="113">4.2896</td> <td class="xl63" align="right" width="103">5.0896</td> <td class="xl63" align="right" width="109">0</td> </tr> </tbody> </table> <p></p>