A Data Science Central Community
I recently came across a trade-press article with the headline “Mining the Cloud.” The cynic in me immediately issued a silent scoff: How is that different from “crawling the Web”? Are we just mapping old wine to shinier new bottles? Or is there something different here?
But, seeing as how I too like to proliferate discussions of mining this or that information type, I was willing to cut the reporter some slack. The article was from Redmond Developer, and concerns “Project Dallas” under Microsoft’s Azure cloud initiative. Essentially, “Project Dallas” (still in beta) supports discovery, manipulation, visualization, and analysis of data retrieved from multiple public, commercial, and private data sources via the Azure cloud. “Dallas” allows enterprises to provide users (via REST, Excel PowerPivot, and/or Visual Basic applications) with online access to aggregated feeds via Azure, which essentially operates as an online information marketplace. Also, “Dallas” allows customers to have Azure host their data for them, or simply continue to host it on their own premises while the cloud service connects securely to it.
That’s all cool, and the screenshots are compelling, but I don’t see any actual data mining, in the strictest sense of that term. In other words, “Dallas” has one data mining feature—interactive information discovery—in spades, but appears to lack some other essential features, such as clustering, classification, regression, and predictive modeling. It’s not as if Microsoft lacks those technologies. After all, the vendor provides decent predictive modeling and data mining through add-ons to SQL Server and Excel, but those features don’t seem to be integrated into this so-called “mining the cloud” service. In a very real sense, this “Dallas” beta is traditional BI in the Azure cloud, with a strong visualization layer. As such, it bodes well for any future plans that Microsoft might develop to make Azure a full-blown BI cloud in its own right.
Rather than quibble anymore on this point, I’d like to call attention to another “Dallas” feature that I find very interesting. “Dallas” incorporates an information syndication and licensing model, which frees users from having to separately set up and manage diverse subscriptions. Though you might say, “so what, that’s a standard component of any online content aggregation service,” I’d argue that that’s at the heart of any future service that promises to let you “mine the cloud” (however you define “mine,” but with specific emphasis on public clouds and federated public/private clouds). Considering that a cloud is a highly distributed information environment, and that many public clouds will federate with and among private enterprise clouds, it’s absolutely essential to have federated content syndication and licensing. Essentially, federation provides a web of trust to ensure that you’re only given access to data sets for which you have permissions, and that you’re prevented from accessing any that you’re not allowed to (perhaps because you didn’t pay or because you don’t have valid credentials).
That, of course, also points to the need for federated identity, authorization, digital rights management, and permissioning among content clouds. In a Microsoft context, I would expect to see them leverage their federated identity technologies, especially CardSpace, WS-Federation, and Identity Metasystem, in any federated cloud content permissioning environment. In a broader industry context, I would expect a role for such federation standards as SAML. Inasmuch as more and more clouds will involve peer-to-peer information provisioning through social networks and RSS feeds, I’d expect to see a role for user-centric federation standards such as OpenID. And, just to complete this thought, I’d expect to see all of this converged with BI-oriented data federation approaches, perhaps leveraging RDF-based ontology/taxonomy specifications for semantic interoperability.
But I’m not seeing that, at least not in Microsoft’s “Project Dallas,” nor, to be honest, in any industry initiatives aimed at making clouds a truly standards-based federated information environment. In terms of “mining” clouds, there are, of course, the Hadoop and MapReduce communities, who have developed a powerful framework for doing predictive analytics against complex, distributed information sources. I don’t see any clear commitment by Microsoft yet toward incorporating these technologies into Azure generally, or “Dallas” specifically.
I expect that Microsoft and others will evolve toward this comprehensive vision over time, but it’s not obvious right now. At least not based on what I’ve seen and heard.