# AnalyticBridge

A Data Science Central Community

I am attempting a POC for big data regression modeling. Since actual data is hard to come by can I actually use a smaller data set and replicate it in some way to get the large data set? Whats the best way to do that?

Views: 2308

### Replies to This Discussion

Yes, through simulation. You first have to compute the means for all of the variables as well as the correlation or covariance matrices.  What you do next depends on your software. e.g if you assume a multivariate normal distribution, you can the R mvrnorm function to generate as many samples as you would like.

http://stat.ethz.ch/R-manual/R-devel/library/MASS/html/mvrnorm.html

Another way to do it would be to assign a new variable as a weighting variable which represents the number of occurences of each sample observation.  Most stat packages can handle this.

But since you are framing this as a "big data" problem, sounds like using simulation to generate the actual raw data may be a better way to go.

Thanks for your reply Ralph. Just a correction - I know the ranges and but not have a small data set. this is by the way a predictive maintenance problem. I can use mvrnorm  for predictors(sensors) as you suggested .But how do I put the target variable (1,0) once I get the normal distribution for predictors. Any ideas or should I go for the higher ranges of values in Sensors and randomly generate 1,0 and 0 for the rest?

One possible way to simulate values for the dependent variable can beto use a conditional distribution estimated from the small data you have. This is somewhat extending Ralph's recommended method of using a suitable joint distribution to simulate values for the predictors.

Once you have a model built on the small data, and a set of simulated values for the independent variables, predict values/probabilities of the dependent variable and add an error term (may be dron from iid normal (0, 0.1). This is a method I have used to create datasets for POC/R&D/Training projects involving many different types of Generalized Linear Models.

Tejamoy sorry for the misleading opening statement. I do not have the small data set - rather the ranges of predictors. Now I have to put 0,1 and simulate a predictive maintenance problem so that I can i can use a classification method logistic regression or decision tree. Any ideas How do I generate the fault (0/1) columns of my data set?

In that case, what you can do is:

Create a linear combination of the variables (predictors), say, LC = a+b1*x1+...+bN*xN

a, b1,...bN being known numbers (as opposed to parameters to be estimated). For example:

LC = 12.64+0.32*x1+...-0.987*xN

Crate ELC = exp(LC)/[1+exp(LC)]

Then create the binary dependent as

if ELC < 0.4 then Y = 0

Else if ELC > 0.6 Y = 1

Else if Random Vbl (from Uniform dist) > 0.05 then Y = 1

Else Y = 0

(Use an appropriate Random Vbl generator function depending on which software you are using)

Now with simulated values of x1,...,xN you can have a dataset of any size.

Does this make sense?

It does! Thanks a ton

Ratheen, you had raised an interesting topic / discussion.

Tejamoy /  Is this proposed solution usable across many not necessarily Predictive Maintenance problems, ex. in   Insurance. Especially rare events (natural hazards modeling frequency say we have 1 -5 storms during  certain period  as an ex)  and  the severity of losses (given a distribution for sizes from history).

Similarly in Health Care Risk Assessment for certain diseases "Framingham Heart Study"  - Logistic Regression application Odds Ratios for Coronary problems based on various risk factors Age, family history etc  -- Thanks