A Data Science Central Community

Java doesn’t support set operations directly, so nested loops have to be used to realize the operations of intersection, union, complement and etc. between text files. If there are many text files, or the file to be computed is too big to be loaded into the memory, or it is required to perform set operations according to multiple fields, the code will become even more complicated. However, with the assistance of esProc, which supports set operations directly, Java can realize these operations more easily. Through the following example, you will learn in detail how esProc works.

There are two small files: *f1.txt* and *f2.txt*, with their first rows being the column names. You are required to get intersection of the *Name* fields of the two files. Part of the data is as follows:

File *f1.txt*:

File *f2.txt*:

esProc code:

Import the two files into the memory using *import* function, with *tab* being the separator by default. Here function option *@t* means importing the first row as the column names, thus the subsequent computation can use *Name* and *Dept* directly to reference the corresponding columns. If the first row is not the column names, then the default column names such as *_1* and *_2* should be used to make the reference.

After computing, values of A1 and B1 are respectively as follows:

*import* function can import specified columns. In this case, for instance, only *Name* field will be computed, so you can merely import this column. The code for importing it is file("E:\\f1.txt")[email protected](Name).

A2: *isect* function performs intersection operation. A1.(Name) means drawing out A1’s Name field to create a set, B1.(Name) means drawing out B1’s Name field. The final result of this example is as follows:

A3: The code result A2 means outputting the computed result of A2 to JDBC. You can also get the final result by combining A2 and A3 into one step, with the code being written as result [A1.(Name),B1.(Name)].isect().

The above is the process of intersection operation. For union operation, you just change the function in the code, that is, [A1.(Name),B1.(Name)].union(). The computed result is as follows:

The code for complement operation is [A1.(Name),B1.(Name)].diff(). The computed result is as follows:

The concatenation operation, which is equal to union operation with duplicates retained, is special. The code for it is [A1.(Name),B1.(Name)].conj(). The computed result is as follows:

You can also replace the functions with operators to make the code simpler. Thus the above code for operations of intersection, union, complement and concatenation can be rewritten as:

A1.(Name) ^ B1.(Name)

A1.(Name) & B1.(Name)

A1.(Name) \ B1.(Name)

A1.(Name) | B1.(Name)

Or you can perform set operations on multiple files. For instance, the corresponding variables of *f1.txt*, *f2.txt* and *f3.txt* after they are imported into the memory are A1, B1 and C1 respectively. The code for getting their intersection is as follows:

A1.(Name) ^ B1.(Name) ^C1.(Name) 或 [A1.(Name),B1.(Name),C1.(Name)].isect()

Sometimes the file is so big that it would degrade performance. In this occasion, *sort* function can be used to sort data first, and then perform set operations using *merge* function. In this way, the performance will be improved significantly. Function options *@i*, *@u* and *@d* need to be used respectively in performing operations of intersection, union and complement. Their corresponding code is respectively as follows:

=[A1.(Name).sort(),B1.(Name).sort()][email protected]()

=[A1.(Name).sort(),B1.(Name).sort()][email protected]()

=[A1.(Name).sort(),B1.(Name).sort()][email protected]()

*merge* function can also be used in performing set operations on multiple fields. Suppose different *Dept* corresponds to the same *Name*, now it is required to perform intersection operation by taking the two fields as a whole. The code will be written as: [A1.sort(Dept,Name),B1.sort(Dept,Name)][email protected](Dept,Name)

The computed result is as follows:

For files that are too big to be loaded into the memory, *cursor* function can be used to import them and *merge* function to perform set operations on them. For example, the code for intersection operation is as follows:

A1=file("e:\\f1.txt").cursor()

=B1=file("e:\\f2.txt").cursor()

=A2=[ A1.sortx(Name),B1.sortx(Name)][email protected](Name)

Note: *cursor* function won’t import all the data into the memory at once. It opens a file in the form of cursors (or flow). esProc engine will allocate appropriate buffer for each imported batch of data. The operation thus runs until it is completed.

Different from in-memory computing, the use of cursor requires *cursor* functions. For instance, *sortx* function should be used for sorting. Here *merge* function uses two function options: *@i*, which means performing intersection operation, and *@x*, which means the object involved in the computation is cursor instead of in-memory data. Besides, functions like *union* can only perform set operations on in-memory data; hence they cannot be used to deal with big files.

The above esProc script has completed all data processing. The rest of the work is to integrate the script with Java program through JDBC. Java code is as follows:

// create a connection using esProc JDBC

Class.forName("com.esproc.jdbc.InternalDriver");

con= DriverManager.getConnection("jdbc:esproc:local://");

// call esProc script, whose name is *test*

st =(com.esproc.jdbc.InternalCStatement)con.prepareCall("call test()");

st.execute();//execute esProc stored procedure

ResultSet set = st.getResultSet();//get the result

© 2019 AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Statistics -- New Foundations, Toolbox, and Machine Learning Recipes
- Book: Classification and Regression In a Weekend - With Python
- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge