Big Data Analysis using ff and ffbase

Native R stores everything into RAM. For more, please visit R memory management. R objects can take memory upto 2-4 GB, depends on hardware configuration. Beyond this, it returns “Error: cannot allocate vector of size ……” and leaving us handicapped to work with big data using R.
data storage using native R
              Data Storage with standard R Object
Thanks to R open source, group of scholars who continuously strives in creating R packages which help us to work effectively with big data.
ff package developed by Daniel Adler, Christian Gläser, Oleg Nenadic, Jens Oehlschlägel, Walter Zucchini and maintained by Jens Oehlschlägel is designed to overcome this limitation. It uses other media like hard disk, CD and DVD to store the native binary flat files rather than its memory. It also allows you to work on very large data file simultaneously. It reads the data files into chunk and write that chunk into the external drive.
Data storage with ff
                              Data Storage with ff
Read csv file using ff package
>options(fftempdir = [Provide path where you want to store binary files])
>file_chunks <- read.csv.ffdf(file=”big_data.csv”, header=T, sep=”,”, VERBOSE=T, next.rows=500000, colClasses=NA)
It read big_data csv file chunk by chunk as specified in next.rows. It reads the chunks and write binary files in any external media and store the pointer of file in RAM. It perform this step until csv file left with no chunks.
ff working
Functioning of ff package
In the same way, we can write csv or other flat files in chunk. It reads chunk by chunk from HDD or any other external media and write it into csv or other supported format.
>write.csv.ffdf(File_chunks, “file_name.csv”)
ff provides us the facility with ffbase package to implement all sorts of functions like joins, aggregations, slicing and dicing.
>Merged_data = merge(ffobject1, ffobject2, by.x=c(“Col1″, “Col2″), by.y=c(“Col1″,”Col2″), trace=T)
Merge function of ff, ffbase package works similar as it worked for data frame but it allows inner and left join only.
>library(“doBy”)
>AggregatedData = ffdfdply(ffobject, split=as.character(ffobject$Col1), FUN=function(x) summaryBy(Col3+Col4+Col5 ~ Col1, data=x, FUN=sum))
To perform aggregation, I used summaryBy function which is available under doBy package. In the above ffdfdply function we split the data on the basis of some key column. If key column contains combination of 2 or more fields, we can generate key columns using ikey function
>ffobject$KeyColumn <- ikey(ffobject[c("Col1","Col2","Col3")])
With all sorts of advantages like working with big data and less dependency on RAM, ff has few limitations, such as
1. Sometimes, we need to compromise with the speed when we are performing complex operations with huge data set.
2. Development is not easier using ff.
3. Need to care about flat files that stores in the disk otherwise your HDD or external media left with little or no space.

#####################

Opening Large CSV Files in R

Before heading home for the holidays, I had a large data set (1.6 GB with over 1.25 million rows) with columns of text and integers ripped out of the company (Kwelia) Database and put into a .csv file since I was going to be offline a lot over the break. I tried opening the csv file in the usual way:
all <- read.csv("file.csv")
However it never finished even after letting it go all night. I also tried reading it into a SQLlite database first and reading it out of that, but the file was so messy it kept coming back with errors. I finally got it read in to R by using the ff package and the following code:
library("ff")
x<- read.csv.ffdf(file="file.csv", header=TRUE, VERBOSE=TRUE, first.rows=10000, next.rows=50000, colClasses=NA)
Because the file was so messy, I had to turn off column classes (colClasses=NA) to have the read ignore giving each column a class on the first 10,000. After reading the first 10,000 rows, the script then reads in chunks of 50,000 so as to not completely overload the ram in my laptop. I also turned Verbose because it would drive me nuts to not be able to follow the progress.

7 comments:

  1. I really appreciate information shared above. It’s of great help. If someone want to learn Online (Virtual) instructor lead live training in AWS Big Data, kindly contact us http://www.maxmunus.com/contact
    MaxMunus Offer World Class Virtual Instructor led training on AWS Big Data. We have industry expert trainer. We provide Training Material and Software Support. MaxMunus has successfully conducted 100000+ trainings in India, USA, UK, Australlia, Switzerland, Qatar, Saudi Arabia, Bangladesh, Bahrain and UAE etc.

    For Free Demo Contact us:
    Name : Arunkumar U
    Email : arun@maxmunus.com
    Skype id: training_maxmunus
    Contact No.-+91-9738507310
    Company Website –http://www.maxmunus.com


    ReplyDelete
  2. Great Post! Much thanks to you such a great amount for sharing this pretty post, it was so great to peruse and helpful to enhance my insight as refreshed one, continue sharing.. Hadoop Training in Chennai |
    Selenium Training in Chennai

    ReplyDelete
  3. It's interesting that many of the bloggers to helped clarify a few things for me as well as giving.Most of ideas can be nice content.The people to give them a good shake to get your point and across the command.
    Hadoop Training in Chennai

    ReplyDelete
  4. Excellent Blog very imperative good content, this article is useful to beginners and real time
    employees.Thank u for sharing...
    Hadoop Training in Hyderabad

    ReplyDelete
  5. It is really a great work and the way in which you are sharing the knowledge is excellent.Thanks for your informative article

    Hadoop Online Training
    Data Science Online Training|
    R Programming Online Training|

    ReplyDelete
  6. It is amazing and wonderful to visit your site.Thanks for sharing this information,this is useful to me...
    Android Training in Chennai
    Ios Training in Chennai

    ReplyDelete
  7. ueful information about R and hadoop https://onlineitguru.com/data-science-online-training-placement.html


    ReplyDelete