Subscribe to DSC Newsletter

Hello fellow Miners

I am currently reading "Getting Things done" by David Allen and I am still impressed by his simple but powerful system. Now I wondered whether there is a similar system (i.e. best practices) for daily data analysis.

In such a analysis a huge number of files is created: code snippets, more extensive parameterizable scripts, reports, model-files, remarks etc.. One the one side you want to be as flexible as possible, e.g. trying out parameters of complex models to get a feeling for the data, on the other side you have to focus on reproducible results. Nothing is more embarrassing then one these moments where you remember a remarkable result, but you cannot reproduce it because you have changed a dependant file (e.g. the sql-script which loads the data initially).

I am glad using RapidMiner which supports complete process descriptions in xml consisting of closed-code functions (by closed code I mean you cannot change the behaviour of functions on the fly in opposite to perform the analysis with a bunch of losely coupled python-scripts, RapidMiner is of course Open Source). But this is not enough because I still produce a lot of files.

David Allen says, that the key is to trick oneself in using defined and reliable procedures. What is your trick ?

Views: 438

Reply to This

Replies to This Discussion

The GTD system is, as far as I can tell (without having read the book), about prioritising your work with lists. I use the Prince2 "product breakdown structure" system to establish my clients' analytical requirements then prepare a list all of the data processing, calculations and visualisation that are/ may be needed. In terms of documented and repeatable procedures, my file system is littered with fragments of "how to" files that remind me exactly what scripts/ code I ran to get the results I wanted.
Thanks Robin for your reply

GTD also talks about how to organize your stuff ... and this is what I am interested in. As you said, your system is also "littered". If it is working it is ok (I am working in a similar way), but I wonder if there is easier approach which requires a smaller amount of time to regain overview of the last analysis process.

Maybe a tool creating a view on the top of your file - system so you can (at least) tag the files ?
The product breakdown structure provides the overview (see the penultimate image on this post about mind-mapping). This sort of documentation helps clients to understand what I'm providing (as it relates their requirements to my activities) and help me to remember what I've promised to do!

I've not yet felt the need to plunge into file system tagging. The unix search tools (find, locate, grep etc) and the file naming system (e.g. "heatmap.r.howto") that I use mean that I don't have too much difficulty retrieving these fragments. I do see the merit in having more meta-data.

An interesting aspect in my work is the extent to which methods from past projects are recycled in future ones. In almost all cases the overall method is different each time, with only the individual tasks being replicated from project-to-project. I've stopped organising project folders according to the clients requirements (e.g. folder per "deliverable output") and started organising according to the software package or method. It's easier to find and remember analytical methods when your working is organised from the perspective of techniques (which are consistent) rather than results (which vary each time).
Thanks for your insights and the link. I really like the idea of building your own method-dependent library.

However, I am not yet convinced regarding the file-naming-strategy. A standard crossvalidation results in one dataset per iteration (for score analysis), the process-description itself, the performance table and the log-file. Testing now various types of preprocessing steps with the focus on studying the impact instead of just keeping only the best result multiplies this filenumber by x.

But I guess you are a disciplined and hard-working documentation writer ... maybe it is just me.

kind regards,

Two factors help to motivate me in this regard: a) I'm the main (sole) audience for the documentation and b) I don't want to waste time re-solving problem/ rtfm!

I don't document everything. Often the files speak for themselves. A single summary explanation will often suffice for many iterations (and thus for many files).

I would recommend that you keep copies of everything. Your time is far more scarce than hard disk space (and data tables tend to compress quite well). If I'm doing lots of iterations I keep all associate files in one folder then duplicate this folder each time. The documentation then keeps track of the differences between each iteration (if required).

ps. You've made me curious about Rapid Miner now!
Thanks again Robin for your helpful recommendations. I will definitively integrate them into my concept.

Regarding RapidMiner: It can be download from , where also a helpful tutorial is available. If you have problems to get started visit the forum ( or feel free to write me a pm. I am glad to help.
To the rest of this community: What are your tricks to keep an overview and ensure always reproducible results ?
Just like any software project, it is best to "version" both your code, as well as the data you are using. The code should be no problem, you can always maintain multiple version changes of the code. It can get tricky if you are maintaining multiple versions of databases. Especially when they are large and you may be restricted in terms of the data that you can store. So at least your code comments need to reflect the changes needed to allow for reproduceable results. You can also run into problems if you extract the data from a database multiple times and the data has changed in between extracts. Those are among the worst kind of of data problems.

-Ralph Winters
Pretty funny picture, I hope it's not your office!

I have lots of tricks, but lists are death for me. I spend lots of time making them, and then never refer to them again. Whiteboards and chalkboards work (somewhat) for me, but paper and electronic versions always fail.

One trick for me is making sure that someone is depending on a result, and that the result has a due date.
Thanks both of you for your remarks.

@Ralph: Yes, a source control is essential. Tag the code and (please) accept the policy not to change tagged code and (*gasp*) save it back to the tag. I am still wondering why some system (at least subversion) allows this. Regarding the data problem: I am happy that I did not have this problem yet. The data I am currently dealing with has various time-related columns, which allows me to recover states from the past.

@Gene: No, that is not my office :). Your last statement reminds me of one of my favorite quotes:
I do not need time, what I need is a deadline (Duke Ellington)

However: If someone with extraterrestrial powers would offer me to create software in no time to solve the described problems exactly as I need it, I still could not describe the whole picture. It is getting clearer, but it is still fuzzy.
That sounds like the opposite of Inbox Zero... which would make it 'Inbox Infinity'!
What is the advantage of this system compared to a desktop search engine like google desktop?

Ok, you got a kind of backup service (but there is and you can access the documents from anywhere. But then, you have all to send all these mails..


On Data Science Central

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service