Subscribe to DSC Newsletter

Help Needed! Getting a user-friendly tool (or system) to prepare large data sets for analysis...

I have a 5-20K USD budget to get a tool (w/wo dedicated hardware), including training for 2 people, to prepare (clean, merge, etc...) and summarize (samples, grouping/aggregations, etc) large data sets (over 5GBs, over 4 million rows).

I need an efficient solution (fast data processing and fast to learn), and definitely easy to use. I would have done it with excel if excel could handle such large data files.

After preparing and summarizing the data, I'll use excel pivot tables for the analysis. I have no programming experience. The datasets are usually in csv or txt formats.

Any advice?

Views: 1918

Comment

You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge

Comment by Siva Karthikeyan on February 23, 2014 at 5:53pm

SAP Lumira Desktop Edition can be also of good help.

Comment by Ahmed on December 24, 2013 at 6:56pm

@ aliyu Dauda and Thomas J Bate:

I spent the last 50 days exploring excel PowerPivot, PowerQuery, Visokio Omniscope, Amazon EC2 Virtual Machines, etc..

PowerPivot and PowerQuery are must-have tools, I never thought I could fit 66 Million rows in an Excel Pivot Table! PowerQuery and PowerPivot add-ins come with Office 365 ProPlus, which will cost you only $12 a month (after a free one month trial), you can download them for free on Office 2010 but with limited functionality. They are very easy to use, but I struggled at the end with PowerQuery trying to sort my data, it took endless time! So I had to try another other solution (Omniscope).

Omniscope is exactly what I needed, a very powerful data management tool with great functionality and like Thomas said, same level of ease of use and accessibility as Excel. I tried it on my 4GB RAM laptop and it worked great for all files under 500MBs. For bigger files, I had to rent a powerful virtual machine from Amazon. The tool will consume all the available memory that you have; I rented a 244 GB RAM machine running on Windows Server 2012, and the tool consumed more than 95% of the available memory, even though the file I loaded was only 800MBs, but the tool is very efficient.

I also tried to run PowerQuery on the same 244 GB RAM machine, thinking that it will be faster than my 4GB laptop, but I was surprised that there was almost no difference (both running 64-bit office), I still couldn't sort my file! and it took less than 4 minutes to sort the same file on Omniscope! Now I prepare my data on Omniscope on AWS, and export the output to Excel for use in PowerPivot and (Pivots when working on samples).

It was a very exciting journey of trial and error, and quite nice discoveries (AWS, PowerPivot, Omniscope, etc..) Things I learned in 2013! Thank you!

Comment by aliyu Dauda on November 6, 2013 at 3:55am

Hi Ahmed. Apologies for the late response. PowerPivot for excel is free, you can download it as an add-in on microsoft's website. The limitation is as follows: for pwerpivot for excel 2010, the file size cannot be more than 2G for the 32bit version and 4G for the 64bit version. But if you already have excel 2013 running on your machines, then there is no limitation to the file size. The speed now depends on the configuration of your machines.

As for deploying powerpivot on AWS, I am not sure that can be done. If you can install excel on AWS virtual machine then you should be able to install the powerpivot add-in. You spend nothing extending the power of excel using powerpivot. Please see the links below:

Powerpivot for Excel 2013

PowerPivot Download

DAX Resources

You do not need a dedicated hardware, this should run pretty well on your PC. But if you want topnotch performance you may check out this hardware:

HP BI Decision Machine

Comment by Paulo Santos on October 29, 2013 at 9:35am

You can use pentaho data integration.

I have used it with 13 million records and done all the summaries needed for large data sets.

It can do most of the steps of excel pivot tables integrated in the process.

The programming is graphic in nature most of the times and you just drag and drop boxes that represent agrregations, joins and so on.

Comment by Paul Hiller on October 25, 2013 at 8:22am

User friendly? Say no more, Ahmed! I work for StatSoft, so, of course, I recommend the tool rated "#1 in User Satisfaction" by 2013 Rexer Data Miner Survey: that tool is STATISTICA. Yes, we have free trialware, but since you are looking for training in addition to software, you should contact our U.K. office in London and they can help you: http://www.statsoft.co.uk/.

Comment by Blaise F Egan on October 25, 2013 at 7:58am

I would recommend MariaDB, the replacement for MySQL. There are lots of good books on SQL and MySQL that you can learn from. The software is free.

Comment by Thomas J Bate on October 25, 2013 at 5:29am

You install Omniscope on a rented AWS Desktop the same as you would on your home machine, just download and install. Omniscope will do most of what PowerPivot does, but with no other (M$oft) software needed, just Omniscope. Omniscope functionally replaces MS Access, Excel and PowerPoint in one integrated in-memory visualisation and analytic application, and you do not need a relational database like SQL Server for your sources, so you can probably do the whole job for just the cost of renting the cloud AWS desktop for a month....peanuts.

Comment by Ahmed on October 25, 2013 at 5:20am

Thanks Vincent for the leaving the comment here.

@ Thomas, seems like a great idea. I spent sometime trying to learn more about AWS and Omniscope. (I don't have any experience in programming or databases), but how do I install Omniscope on the could (AWS)?

@aliyu great idea as well, but do I need buy dedicated hardware to use powerpivot?

Or combining aliyu and Thomas' ideas, can I use Powerpivot on AWS' virtual machines? If yes, how do I setup this?

Comment by aliyu Dauda on October 24, 2013 at 2:20pm
millions of rows easily using the in-memory data analytics engine Microsoft calls xVelocity. You can extend its capability with DAX; Data analysis expression language which is very easy to learn. You can create pivot tables on the fly while you clean you data at the same time. One caveat though, I don't know how structured or unstructured your data is. If it's unstructured and less amenable to tabular models, you may need to clean and structure it before moving it to powerpivot. Having worked on a project with data as large as yours using powerpivot, I can assure you it would be quite adequate for you.
Comment by aliyu Dauda on October 24, 2013 at 2:07pm
I think you can use excel powerpivot. It's a simple but powerful tool. It can handle

On Data Science Central

© 2019   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service