Hello from Prognoz!

Filtering strategies

04.07.16 Alexey Selikhin
1358 0
colored dices

Whatever business you’re in, there are always people who use the first set of results without giving a thought about how these results are produced and whether they can be better. Fortunately, there are people who will readily dive into the details of the process to make the most of it. This article will be of interest to people who tend to scrutinize the process that they are dealing with and milk it for all it’s worth.

If you are one of these people, one of us, then let’s dive into the Prognoz Platform!

You might wonder whether working with cubes in the Prognoz Platform is always efficient enough, so that fine-tuning the process for specific data isn’t necessary. In other fields, say health or astronautics, human intervention is still needed depending on the particular situation, since machines just aren’t capable of making a wide range of decisions.

The answer is that our platform provides a variety of capabilities that give you greater flexibility to configure data sources. This time, we’ll talk about one of these capabilities—a filtering strategy.

What is a filtering strategy? It’s a set of parameters that you can use to create queries to retrieve data as you build a cube. Managing these parameters, while taking into account the specifics of the data, speeds up the data retrieval process.

Sounds great! Want to try? Go ahead; settings are before us!

Everything is pretty simple, with nothing in excess. On the Dimensions page of the Cube Edit wizard, you can find only those settings that are really needed. They are divided into three parts:

1

  • Cube dimensions. Here you can select particular dimensions to which you want to apply your filtering strategy.
  • Filtering method. Here you can specify the desired filtering method and its options.
  • Temporary table. Here you can enable or disable the use of temporary tables.

Selecting cube dimensions is clear; while filtering methods and temporary tables require further clarifications.

What are they for and how to select the right options? It’s simpler than it seems.

In the Filtering Options dialog, selecting the right filtering method is most challenging. The selected filtering method defines how a part of query that is responsible for filtering elements by dimensions will be created.

You are offered a choice of methods. Your choice must depend on the particular problems that you solve and how you expect to use data. We’ll use the same input data. Let’s select a filtering method based on the expected data use. Our data set will include a small cube of the Territory dimension with several hundred elements.

2

 

If you know that you’re going to use data for large countries only, then it’s better to select the Filter by IN Condition method, which is offered by default. Since the size of your selection will always be small and probably scattered, an optimal query will look like this:

select a.* from (select a.DATE1,a.DATE2,a.COUNTRY,a.COUNTRYOWNER,a.INDICATOR,a.VALUE from dbo.OBJ20021359DATA a) a where ((a.DATE1 = 1)) and ((a.DATE2 in ({d'2010-01-01'},{d'2011-01-01'},{d'2012-01-01'},{d'2013-01-01'},{d'2014-01-01'},{d'2015-01-01'}))) and ((a.COUNTRY in (1,80, 138, 164))) and ((a.COUNTRYOWNER = 1)) and ((a.INDICATOR = 1)).

Excellent. Of course, this method is less effective as your dictionary gets longer, with small selection size causing the query to grow with all-entailing overheads.

Also, remember that when you use a default value instead of selecting the IN Condition method, the BETWEEN operator will be used, for example, for the Calendar dimension—specifically, if you use the default value to filter elements of different dimensions for building the Calendar dimension using a common primary key (with values for all levels stored in a single table) and to filter elements of daily frequency.

If at any time you plan to use the complete territory dictionary, apply Extract All Records Despite the Selection. This method extracts all records despite the selection and then filters them on the client side. The query will contain no operators associated with filtering of dimension elements, as you can see in this example:

select a.* from (select a.DATE1,a.DATE2,a.COUNTRY,a.COUNTRYOWNER,a.INDICATOR,a.VALUE from dbo.OBJ20021359DATA a) a where ((a.DATE1 = 1)) and ((a.DATE2 in ({d'2010-01-01'},{d'2011-01-01'},{d'2012-01-01'},{d'2013-01-01'},{d'2014-01-01'},{d'2015-01-01'}))) and ((a.COUNTRYOWNER = 1)) and ((a.INDICATOR = 1)).

It works well for dictionaries that are relatively small; for more cumbersome dictionaries, complete sampling makes no sense and requires some kind of a compromise.

So, if we know that, most of the time, the dictionary selection size will be neither minimal nor complete, then it’s better to use either Filter Subject According to the Number of Selected, or Filter Subject According to the Selection Size. In fact, the underlying method is the same. The only difference is in the way of setting a bound to change from a query with the IN operator to a query with the BETWEEN operator. In the first case, as a bound, you specify the exact number of selected elements; in the second case, you specify it as a percentage of the total number of selected elements.

The query is then generated according to well-known rules. When the selection size is smaller than the specified bound, the query is generated using the IN operator. When it is equal to the bound, the BETWEEN operator is used relative to the first and last selected elements. When the selection size exceeds the specified bound, then the BETWEEN operator is used relative to all dimension elements.
Let’s illustrate how to set the desired bound using a number, say 10. When less than 10 elements are selected, you get something like this:

select a.* from (select a.DATE1,a.DATE2,a.COUNTRY,a.COUNTRYOWNER,a.INDICATOR,a.VALUE from dbo.OBJ20021359DATA a) a where ((a.DATE1 = 1)) and ((a.DATE2 in ({d'2010-01-01'},{d'2011-01-01'},{d'2012-01-01'},{d'2013-01-01'},{d'2014-01-01'},{d'2015-01-01'}))) and ((a.COUNTRY in (1,2,3,4,5,6,7))) and ((a.COUNTRYOWNER = 1)) and ((a.INDICATOR = 1)),

For more than 10 elements, you get this instead:

select a.* from (select a.DATE1,a.DATE2,a.COUNTRY,a.COUNTRYOWNER,a.INDICATOR,a.VALUE from dbo.OBJ20021359DATA a) a where ((a.DATE1 = 1)) and ((a.DATE2 in ({d'2010-01-01'},{d'2011-01-01'},{d'2012-01-01'},{d'2013-01-01'},{d'2014-01-01'},{d'2015-01-01'}))) and ((a.COUNTRY between 1 and 16)) and ((a.COUNTRYOWNER = 1)) and ((a.INDICATOR = 1)).

You might ask how to select the right bound. There is no unambiguous answer. You need to run tests, try bounds using different magnitudes, model which selected elements will be used more or less often, and then, based on these grounds, make your final choice.

The last question to answer is whether to use a temporary table or not.

There’s no a straightforward answer here as well. Whether to use a temporary table or not is decided in each particular case. This checkbox is enabled when you have selected filtering methods such as Filter Subject According to the Selection Size, and Filter Subject According to the Number of Selected. By default, this checkbox is unchecked. When checked, it enables you to use a temporary table to transfer your selection. This speeds up the process of the IN operator generation. At the same time, the use of a temporary table might slow down overall performance.

It’s as simple as that! What’s next?

Next, take a bow for a job well done! Thanks to the work that you put in up front, the analysists who come after you will spend less time and effort fine-tuning the data.

Comment
0