Subscribe to DSC Newsletter

Time issue in creating a huge data frame from MongoDB collection

I am using R for building a predictive model. My data is stored in mongoDB collections. So i connected to MOngoDB using R to get the data and creating a dataframe in R by converting the BSON values to list and storing about 64567 records using a cursor, which took about 15 minutes. 


Its taking more time to create the dataframe everyday. Did anyone over here worked in similar situations? How did you solve this time issue?

Views: 866

Tags: #coneecting, #predictive, R, modeling, mongoDB, to, using

Comment

You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge

Comment by rupesh p on July 30, 2015 at 2:45pm

Thank you for commenting Luca.

I was using the below code to load the data frame using the cursor --

buf<-mongo.bson.buffer.create()
mongo.bson.buffer.append(buf,"campaign","Daytona Speedway - Search")
b<-mongo.bson.from.buffer(buf)

muf<-mongo.bson.buffer.create()
mongo.bson.buffer.append(muf,"_id",0L)
mongo.bson.buffer.append(muf,"__v",0L)
mongo.bson.buffer.append(muf,"createdAt",0L)
mongo.bson.buffer.append(muf,"updatedAt",0L)
m<-mongo.bson.from.buffer(muf)

library(plyr)

cursor=mongo.find(mongo,ns=DBNS,query=b,fields=m)
i=1
while(mongo.cursor.next(cursor)){
tmp=mongo.bson.to.list(mongo.cursor.value(cursor))
tmp.df=as.data.frame(t(unlist(tmp)),stringsAsFactors = F)
adwordsd=rbind.fill(adwordsd,tmp.df)
}

However, as i was running my R script on the system which has very less RAM and it might allocate less RAM to R studio, i ran the script in a Server with huge RAM, which improved my execution time too!

I am using a list here too, but i would like to know how can i create a separate environment to the list in the above code, i am new to R!

Comment by Luca on July 2, 2015 at 4:07am

Hi Rupesh, I've been in quite the same situation a few weeks ago, I had to query a collection with a final dataset resulting in nearly 3mln records. I had two main problems: memory, to store both mongodb result (~20GB) and the final dataset at the same time; the linearity in the algorithm of adding  elements to a list in R. The first code I wrote was stopped after 12 hours being at 20%. I ended up (after a lot of googling and different ways) with storing intermediate results in a MySQL table, and using the data.table package, all in less than 2 hour time. But the main trick was to create a separate environment for the list created from mongo, I really want to thank Ferdinand for this answer:

http://stackoverflow.com/questions/17046336/here-we-go-again-append...

This is the code I eventually used, written off for confidentiality:

ADD <- function(item){
     if( .GlobalEnv$CNT == .GlobalEnv$SZ ) length(.GlobalEnv$DT) <- .GlobalEnv$SZ <- .GlobalEnv$SZ * 2
          .GlobalEnv$CNT <- .GlobalEnv$CNT + 1
          .GlobalEnv$DT[[.GlobalEnv$CNT]] <- item
}
libs <- c('data.table', 'rmongodb', 'RMySQL')
wrng <- lapply(libs, require, character.only = TRUE)
rm(libs)

print('Querying mongodb...')
mng_conn <- mongo.create(host=my.host.mng, username=my.user.mng, password=my.pwd.mng, db=my.coll)
t <- mongo.find.all(
     mng_conn,
     'my.coll.xxx0',
     query = '{
          "xxx0.date" : {"$gte":"yyyy-mm-dd"}
     }',
     fields = '{
          "xxx1.xxx2a.myID":1,
          "xxx1.xxx2b.xxx3a":1,
          "xxx1.xxx2b.xxx3b":1,
          "xxx1.xxx2b.xxx3c":1,
          "xxx1.xxx2b.xxx3d":1,
          "_id":0
     }'
)
wrng <- mongo.disconnect(mng_conn)

db_conn <- dbConnect(MySQL(), host=my.host.sql, username=my.user.sql, password=my.pwd.sql, dbname=my.db)
wrng <- dbSendQuery(db_conn, "TRUNCATE TABLE my.table;")
CNT <- 0
SZ <- 1
DT <- list(NULL)
print('Building data frames...')
for(idx in 1:length(t)){
     if( (idx%/%5e4)*5e4 == idx ){
          print('Saving results in MySQL...')
          DT <- rbindlist(DT)
          wrng <- dbWriteTable(db_conn, my.table, unique(DT[order(my.fields)]),                  row.names = FALSE, append = TRUE, overwrite = FALSE)
          CNT <- 0
          SZ <- 1
          DT <- list(NULL)
          print('Building data frames...')
     }
     if( length(t[[idx]]$xxx1$xxx2b) > 0 ){
          for( sidx in 1:length( t[[idx]]$xxx1$xxx2b ) ){
               if( t[[idx]]$xxx1$xxx2b[[sidx]]$xxx3a %in% 1:4){
                    ADD(list(
                         code = t[[idx]]$xxx1$xxx2b[[sidx]]$xxx3a,
                         description = t[[idx]]$xxx1$xxx2b[[sidx]]$xxx3b,
                         level = t[[idx]]$xxx1$xxx2b[[sidx]]$xxx3c,
                         parent = ifelse(sidx == 1, 0, t[[idx]]$xxx1$xxx2b[[sidx-1]]$xxx3d)
                         )
                    )
               }
          }
     }
     if((idx%/%1000)*1000 == idx) print( paste('Records processed:', idx ) )
}
DT <- rbindlist(DT)
wrng <- dbWriteTable(db_conn, my.table, unique(DT[order(my.fields)]), row.names = FALSE, append = TRUE, overwrite = FALSE)

On Data Science Central

© 2019   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service