Optimizing ETL Processes for Working with Relational Consumers
Despite tackling highly diverse tasks, any analytics solution needs data in the first place to tap into that data and turn it into valuable insights. Prognoz Platform enables you to tap into available external data sources and those stored in internal structures of the platform. In this article, I’ll show you best ways to load data into internal structures and export data to external relational sources using the embedded ETL tool — the ETL Task.
This object enables you to perform a wide variety of tasks, including:
- Import data from a wide spectrum of data sources like Excel, Access, text and other files; and from data sources for which third-party OLEDB providers exist
- Transform data using a range of standard tools to filter, group, join, split, and edit data
- Export data to Excel, Access, and other files, as well as to relational storage structures
Nowadays, data analysis more often involves gigabytes of data. Coming from various sources, oftentimes, these data need to be consolidated into a single source. Let’s see what is the best way to achieve that. First, decide what consumer you are going to use and why. To load data into one source, you can use either of two consumers offered by the ETL Task— the repository consumer or OLEDB consumer. Let’s take a quick look at pros and cons of each consumer.
The undeniable advantage of the repository consumer is the use of optimized SQL queries for data loading. This speeds up loading tens or even hundreds of times. The repository consumer also allows updating existing table records.
The need in extra objects — a table object or a linked table — in the platform repository in addition to the ETL Task is a drawback.
Speaking of pros of the OLEDB consumer, it should be noted that it supports a wider range of DBMSs capable to support almost any DBMS with an adapted OLEDB provider. Moreover, it does not require any additional repository objects. All connection settings are specified directly in the ETL Task consumer.
However, the OLEDB consumer has apparent cons, namely the use of open connection string. This implies that the login and password for access to the database are stored explicitly. As a universal object OLEDB does not allow optimizing the loading process using methods that are specific for each particular DBMS. As a result, the OLEDB consumer comes without the update records feature since it is not supported by all OLEDB providers.
Let’s consider a simple example of data loading. We have a small file containing 1,000 records. Note how speed differs. It took less than a second to perform calculations and load data into the repository.
While, the same operation with the OLEDB consumer took about one and a half seconds.
So, we have a half second difference per 1,000 records. At first, the difference might seem insignificant, but it soars as the amount of data grows thousandfold.
In this example, we’ve learned how to load data into an empty table. Now, let’s consider a situation when the table is not empty and instead of rewriting data, we need to add new data and update these data.
Here, the repository consumer will help us out. Open it for editing. Go to the Data Consumer page and uncheck the Clear Consumer Before Loading Data checkbox. Skip the Error Handling page and go down to the Update Records page. On this page, specify unique records fields such as table primary and unique keys.
During data loading, new records will be added and the COUNTRYOWNER and VALUE fields will be updated for the existing records.
Now, I’ll tell you how to configure the consumers using the ETL Task in case when a source contains incorrect data, for example when a string appears in a table column among integers. The Error Handling page of the ETL Task is specifically designed to handle such cases. Let’s see how it is done by the example of the repository consumer.
As you remember, our file contains a string instead of an integer. The exception handling strategy offers you options, namely stop task calculation; skip only incorrect records or skip all records in one transaction. Let’s dwell on each option.
- The first option is most often used when critical data is loaded. It stops task performance on exception occurrence.
- The second one is used to insert records one by one while skipping those in which exceptions occur. This option best fits when you need to load maximum amount of data.
Note that skipping only incorrect records takes as much time as data loading using the OLEDB consumer. In this case, quick insert is disabled.
- The third option skips all records in one transaction on exception occurrence. Managing the number of records in a transaction is also possible through the use of the Number of Records Processed in One Transaction parameter on the Error Handling page. For example, let’s specify the number of records in a transaction equal to 10 and run the task. As a result, we’ll get info on 10 missed records.
Another way of optimizing the ETL task in the Prognoz Platform is batch processing. Based on a client model, the ETL task at first reads all data from sources of a thread into memory, and then passes these data through the execution thread. When it comes to gigabytes of data, this leads to high memory usage on a computer where you run the ETL task. To overcome this problem, you can use batch processing. Reading source data by small-size batches it passes data through execution tread by small portions. The use of batch data processing helps process big data by the ETL Task even on a slow x86-based computer with limited RAM. To enable the use of batch data processing, select the Task item from the main menu of the ETL Task, and select Properties in drop-down menu. You can find this option on the Task Options page.
The number of records in one batch is specified in the Batch Size box. For autodetection of a batch size, specify ‘-1’ in the Batch Size box. Or you can adjust the batch size at your discretion. Note that batch processing might require more time to perform the ETL task, for example in case of data updates in the repository consumer. Therefore, enable batch processing when it is really needed. Batch processing is ignored when transformations such as Sort or Group are used in the execution thread, since they require the complete data set.
In this article, we’ve learned how to select a suitable consumer to load data and how to handle errors. There are no the only consumer option that is accurate or settings that are suitable for all cases. However, I hope that guidelines provided in this article will help you select the best way to load data that fits your particular needs.