Subscribe to DSC Newsletter

How big your data is depends on the quantity of information that it contains (measured using entropy metrics), rather than the number of terabytes. Huge data that is sparse or shallow is indeed not huge - and can be compressed very efficiently. What do you think?

Views: 1020

Reply to This

Replies to This Discussion

If I may cross-post the following from our blog at www.conradyscience.com, which speaks to the same point:

Learning = Data Compression
"It has long been understood that even when confronted with a ten-gigabyte file containing data to be statistically analyzed, the actual information-theoretic amount of information in the file might be much less, per haps merely a few hundred megabytes. This insight is currently most commonly used by data analysts to take high-dimensional real-valued datasets and reduce their dimensionality using principal components analysis, with little loss of meaningful information. This can turn an apparently intractably large data mining problem into an easy problem." [1]
As an alternative to dimension reduction, we can exploit existing regularities in the data to create a more compact and thus more tractable representation with Bayesian networks. "In context of Bayesian network learning, we describe the data using DAGs [Directed Acyclic Graphs] that represent dependencies between attributes. A Bayesian network with the least MDL [Minimum Description Length] score (highly compressed) is said to model the underlying distribution in the best possible way. Thus the problem of learning Bayesian networks using MDL score becomes an optimization problem." [2] Consequently, learning Bayesian networks is inherently a form data compression.

References:
[1]  Davies, S., and A. Moore. “Bayesian networks for lossless dataset compression.” In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, 391, 1999.
[2] Hamine, Vikas. “Learning Optimal Augmented Bayes Networks” (n.d.). http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.100.6100.

 

Thanks Stefan. Great answer.

Big data is generally used to discribe the massive amount of unstructured the data, which costs a lot of time and money for analysis. While large data and huge data may not have such special meaning, just refer to the volumn.

I didn't know that LinkedIn wouldn't reverse post...So, here it is again... ;-)

First, I agree with all comments that suggest, "It Depends."  The arbitrary aspect can be reduced if you apply Spatial Dimensionality (I.e.: length, width, & depth [or: xi, yi, zi) in respect to the Data Processing Appliance (xi, yi), Business Intelligence Platform (z1), and Software (z2) being used to turn the data into a useful asset, which varies by data owner creating the arbitrary responses.

Seeing that these 3 data handling items are directly targeting the ability to handle data in direct relationship to time, the definition of "Big Data" then head's down path of space-time bringing in to play the time element (I.e.: xi, yi, zi, ti). However, I would take this method one step further and add a monetary value to the data (I.e., xi, yi, zi, ti, mi).  The monetary calculation would be based on cost of data acquisition, storage, and relevance.

Therefore, Vince, in order to define "Big Data" in relationship to each evaluation on an independent data owner's current installation, one must first define their current data environment baseline by calculating the current continuum + monetary index.  Doing so would require that a subject process (or, sample of subject processes - test/control) be chosen capturing the current 5 dimensions to be measured.

If structured properly, the resulting mathematical computation can then be used to determine the big data in respect to the data owner’s relative configuration and monetary data asset value, thus removing the arbitrary aspect of “It Depends.”  An additional side benefit of this exercise would be the ability to determine the breakeven price point (ROI) on the purchase of new xi, yi, or zi.

RSS

On Data Science Central

© 2019   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service