|
Optimizing Mondrian Performance
By Sherman Wood and Julian Hyde; last updated November, 2007. Contents
IntroductionAs with any data warehouse project, dealing with volumes is always the make or break issue. Mondrian has its own issues, based on its architecture and goals of being cross platform. Here are some experiences and comments. From the Mondrian developer's mailing list in February, 2005 - an example of unoptimized performance:
Since this time, Mondrian has been extended to use aggregate tables and materialized views, which have a lot of performance benefits that address the above issue. From Julian:
From Sherman:
A generalized tuning process for MondrianThe process for addressing performance of Mondrian is a combination of design, hardware, database and other configuration tuning. For really large cubes, the performance issues are driven more by the hardware, operating system and database tuning than anything Mondrian can do.
Recommendations for database tuningAs part of database tuning process, enable SQL tracing and tail the log file. Run some representative MDX queries and watch which SQL statements take a long time. Tune the database to fix those statements and rerun.
Mondrian currently uses 'count(distinct ...)' queries to determine
the cardinality of dimensions and levels as it starts, and for your
measures that are counts, that is,
Aggregate Tables, Materialized Views and MondrianThe best way to increase the performance of Mondrian is to build a set of aggregate (summary) tables that coexist with the base fact table. These aggregate tables contain pre-aggregated measures build from the fact table. Some databases, particularly Oracle, can automatically create these aggregations through materialized views, which are tables created and synchronized from views. Otherwise, you will have to maintain the aggregation tables through your data warehouse load processes, usually by clearing them and rerunning aggregating INSERTs. Aggregate tables are introduced in the Schema Guide, and described in more detail in their own document, Aggregate Tables. Choosing aggregate tablesIt isn't easy to choose the right aggregate tables. For one thing, there are so many to choose from: even a modest cube with six dimensions each with three levels has 64 = 1296 possible aggregate tables! And aggregate tables interfere with each other. If you add a new aggregate table, Mondrian may use an existing aggregate table less frequently. Missing aggregate tables may not even be the problem. Choosing aggregate tables is part of a wider performance tuning process, where finding the problem is more than half of the battle. The real cause may be a missing index on your fact table, your cache isn't large enough, or (if you're running Oracle) the fact that you forgot to compute statistics. (See recommendations, above.) Performance tuning is an iterative process. The steps are something like this:
AggGen
Optimizing Calculations with the Expression CacheMondrian may have performance issues if your schema makes intensive use of calculations. Mondrian executes calculations very efficiently, so usually the time spent calculating expressions is insignificant compared to the time spent executing SQL, but if you have many layers of calculated members and sets, in particular set-oriented constructs like the Aggregate function, it is possible that many thousands of calculations will be required for each cell. To see whether calculations are causing your performance problem, turn on SQL tracing and measure what proportion of the time is spent executing SQL. If SQL is less than 50% of the time, it is possible that excessive calculations are responsible for the rest. (If the result set is very large, and if you are using JPivot or XML/A, the cost of generating HTML or XML is also worth investigating.) It caches cell values retrieved from the database, but it does not
generally cache the results of calculations. (The sole case where mondrian
caches expression results automatically is for the second argument of the Since calculations are very efficient, this is generally the best policy: it is better for mondrian to use the available memory to cache values retrieved from the database, which are much slower to re-create. The expression cache only caches expression results for the duration of a single statement. The results are not available for other statements. The expression cache also takes into account the evaluation context, and the known dependencies of particular functions and operators. For example, the expression
depends on all dimensions besides [Store] and [Time], because the expression overrides the value of the [Store] and [Time] dimensions inherited from the context, but the implicit evaluation of a cell pulls in all other dimensions. If the expression result has been cached for the contexts ([Store].[USA], [Time].[1997].[Q2], [Gender].[M]), the cache knows that it will return the same value for ([Store].[USA].[CA], [Time].[1997].[Q3], [Gender].[M]); however, ([Store].[USA], [Time].[1997].[Q2], [Gender].[F]) will require a new cache value, because the dependent dimension [Gender] has a different value. However, if your application is very calculation intensive, you can use the
Author: Sherman Wood & Julian Hyde; last updated November, 2007. |