Subscribe to DSC Newsletter

Theo: One idea is that you must purchase a number of transactions before using the paid service, and add dollars regularly. A transaction is a call to the API.

The service is accessed via an HTTP call that looks like

When the request is executed,

  • First the script checks if client has enough credits (dollars)
  • If yes it fetches the data on the client web server: the URL for the source data is yyy
  • Then the script checks if source data is OK or invalid, or client server unreachable
  • Then it executes the service zzz, typically, a predictive scoring algorithm
  • The parameter field tells whether you train your predictor (data = training set) or whether you use it for actual predictive scoring (data outside the training set)
  • Then it processes data very fast (a few secs for 1MM observations for the training step)
  • Then it sends an email to client when done, with the location (on the datashaping server) of the results (the location can be specified in the API call, as an additional field, with a mechanism in place to prevent file collisions from happening)
  • Then it updates client budget

Note all of this can be performed without any human interaction. Retrieving the scored data can be done with a web robot, and then integrated into the client's database (again, automatically). Training the scores would be charged much more than scoring one observation outside the training set. Scoring one observation is a transaction, and could be charged as little as $0.0025.

This architecture is for daily or hourly processing, but could be used for real time if parameter is not set to "training". However, when designing the architecture, my idea was to process large batches of transactions, maybe 1MM at a time.

Views: 9205

Replies to This Discussion


That is pretty much the architecture that we are implementing for Stillwater although we are focused on big data and deep computes. Quick note though and that is that due to the inertia of big data, the analytics need to move to the data, instead of data to the analytics. This is also the trend in commercial databases. Both Oracle and IBM are educating the IT universe that it is better to keep the data secure in their database and move the analytics in-database. Cloudera and the Hadoop universe tell a similar story but that is based on the distributed nature of their data container: Hadoop's basic architecture dispatches maps and reducers towards the data. This implies that the service needs to comprehend small data and big data and dispatch accordingly. Let's bring Mike Zeller in: his PMML service is running on Amazon and he and his team have solved similar problems.

Moving data to in-databases (from client to Oracle, IBM): these Oracle / IBM large databases will be attacked by hackers, much more so than "local" (out-) databases. However, in-databases will have better protection systems. There are advantages and disadvantages to in-databases. For clients with very strong forensics expertise, in-database are probably not worth the cost.

Sounds promising, but I think that the training step should not be automatic. There are many issues with both data quality and model selection, so human judgment (and expert judgment as a matter of fact) is a very important component of the whole modeling process. Using the trained model to make predictions online is great I agree, but I feel that an out-of-the-box training algorithm used on a set of "raw" data would produce a model of poor quality. Of course the clients would still be able to upload their data through the Net for the experts to model.

We have now some experience from the market and it has vindicated Vincent's original architecture. We found that the market is most interested in rapid consumption of a best know method for a particular business process problem. For consumption through a service (the acronym for Analytics as a Service just isn't right) heavy customization capability is not the pain point, the easy consumption of a BKM is.

(P.S. Typing this from the road in Beijing where we are testing these ideas....)

good example


On Data Science Central

© 2020 is a subsidiary and dedicated channel of Data Science Central LLC   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service