Subscribe to DSC Newsletter

Time issue in creating a huge data frame from MongoDB collection

I am using R for building a predictive model. My data is stored in mongoDB collections. So i connected to MOngoDB using R to get the data and creating a dataframe in R by converting the BSON values to list and storing about 64567 records using a cursor, which took about 15 minutes. 

Its taking more time to create the dataframe everyday. Did anyone over here worked in similar situations? How did you solve this time issue?

Views: 938

Tags: #coneecting, #predictive, R, modeling, mongoDB, to, using


You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge

Comment by rupesh p on July 30, 2015 at 2:45pm

Thank you for commenting Luca.

I was using the below code to load the data frame using the cursor --

mongo.bson.buffer.append(buf,"campaign","Daytona Speedway - Search")



while({,stringsAsFactors = F)

However, as i was running my R script on the system which has very less RAM and it might allocate less RAM to R studio, i ran the script in a Server with huge RAM, which improved my execution time too!

I am using a list here too, but i would like to know how can i create a separate environment to the list in the above code, i am new to R!

Comment by Luca on July 2, 2015 at 4:07am

Hi Rupesh, I've been in quite the same situation a few weeks ago, I had to query a collection with a final dataset resulting in nearly 3mln records. I had two main problems: memory, to store both mongodb result (~20GB) and the final dataset at the same time; the linearity in the algorithm of adding  elements to a list in R. The first code I wrote was stopped after 12 hours being at 20%. I ended up (after a lot of googling and different ways) with storing intermediate results in a MySQL table, and using the data.table package, all in less than 2 hour time. But the main trick was to create a separate environment for the list created from mongo, I really want to thank Ferdinand for this answer:

This is the code I eventually used, written off for confidentiality:

ADD <- function(item){
     if( .GlobalEnv$CNT == .GlobalEnv$SZ ) length(.GlobalEnv$DT) <- .GlobalEnv$SZ <- .GlobalEnv$SZ * 2
          .GlobalEnv$CNT <- .GlobalEnv$CNT + 1
          .GlobalEnv$DT[[.GlobalEnv$CNT]] <- item
libs <- c('data.table', 'rmongodb', 'RMySQL')
wrng <- lapply(libs, require, character.only = TRUE)

print('Querying mongodb...')
mng_conn <- mongo.create(, username=my.user.mng, password=my.pwd.mng, db=my.coll)
t <- mongo.find.all(
     query = '{
          "" : {"$gte":"yyyy-mm-dd"}
     fields = '{
wrng <- mongo.disconnect(mng_conn)

db_conn <- dbConnect(MySQL(),, username=my.user.sql, password=my.pwd.sql, dbname=my.db)
wrng <- dbSendQuery(db_conn, "TRUNCATE TABLE my.table;")
CNT <- 0
SZ <- 1
DT <- list(NULL)
print('Building data frames...')
for(idx in 1:length(t)){
     if( (idx%/%5e4)*5e4 == idx ){
          print('Saving results in MySQL...')
          DT <- rbindlist(DT)
          wrng <- dbWriteTable(db_conn, my.table, unique(DT[order(my.fields)]),                  row.names = FALSE, append = TRUE, overwrite = FALSE)
          CNT <- 0
          SZ <- 1
          DT <- list(NULL)
          print('Building data frames...')
     if( length(t[[idx]]$xxx1$xxx2b) > 0 ){
          for( sidx in 1:length( t[[idx]]$xxx1$xxx2b ) ){
               if( t[[idx]]$xxx1$xxx2b[[sidx]]$xxx3a %in% 1:4){
                         code = t[[idx]]$xxx1$xxx2b[[sidx]]$xxx3a,
                         description = t[[idx]]$xxx1$xxx2b[[sidx]]$xxx3b,
                         level = t[[idx]]$xxx1$xxx2b[[sidx]]$xxx3c,
                         parent = ifelse(sidx == 1, 0, t[[idx]]$xxx1$xxx2b[[sidx-1]]$xxx3d)
     if((idx%/%1000)*1000 == idx) print( paste('Records processed:', idx ) )
DT <- rbindlist(DT)
wrng <- dbWriteTable(db_conn, my.table, unique(DT[order(my.fields)]), row.names = FALSE, append = TRUE, overwrite = FALSE)

On Data Science Central

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service