Pentaho 5.0 blends right in!

September 12, 2013

Dear Pentaho friends,

Ever since a number of projects joined forces under the Pentaho umbrella (over 7 years ago) we have been looking for ways to create more synergy across this complete software stack.  That is why today I’m exceptionally happy to be able to announce, not just version 5.0 of Pentaho Data Integration but a new way to integrate Data Integration, Reporting, Analyses, Dashboarding and Data Mining through one single interface called Data Blending, available in Pentaho Business Analytics 5.0 Commercial Edition

Data Blending allows a data integration user to create a transformation capable of delivering data directly to our other Pentaho Business Analytics tools (and even non-Pentaho tools).  Traditionally data is delivered to these tools through a relational database. However, there are cases where that can be inconvenient, for example when the volume of data is just too high or when you can’t wait until the database tables are updated.  This for example leads to a new kind of big data architecture with many moving parts:

Evolving Big Data Architectures

Evolving Big Data Architectures

From what we can see in use at major deployments with our customers, mixing Big Data, NoSQL and classical RDBS technologies is more the rule than the exception.

So, how did we solve this puzzle?

The main problem we faced early on was that the default language used under the covers, in just about any business intelligence user facing tool, is SQL.  At first glance it seems that the worlds of data integration and SQL are not compatible.  In DI we read from a multitude of data sources, such as databases, spreadsheets, NoSQL and Big Data sources, XML and JSON files, web services and much more.  However, SQL itself is a mini-ETL environment on its own as it selects, filters, counts and aggregates data.  So we figured that it might be easiest if we would translate the SQL used by the various BI tools into Pentaho Data Integration transformations. This way, Pentaho Data Integration is doing what it does best, not directed by manually designed transformations but by SQL.  This is at the heart of the Pentaho Data Blending solution.


The internals of Data Blending

In other words: we made it possible for you to create a virtual “database” with “tables” where the data actually comes from a transformation step.

To ensure that the “automatic” part of the data chain doesn’t become an impossible to figure out “black box”, we made once more good use of existing PDI technologies.  We’re logging all executed queries on the Data Integration server (or Carte server) so you have a full view of all the work being done:

Data Blending Transparency

Data Blending Transparency

In addition to this, the statistics from the queries can be logged and viewed in the operations data mart giving you insights into which data is queried and how often.

We sincerely hope that you like these new powerful options for Pentaho Business Analytics 5.0!



–If you want to learn more about the new features in this 5.0 release, Pentaho is hosting a webinar and demonstration on September 24th – Two options to register:  EMEA & North America time zones.

Matt Casters
Chief Data Integration, Kettle founder, Author of Pentaho Kettle Solutions (Wiley)

Big Kettle News

January 30, 2012

Dear Kettle fans,

Today I’m really excited to be able to announce a few really important changes to the Pentaho Data Integration landscape. To me, the changes that are being announced today compare favorably to reaching Kettle version 1.0 some 9 years ago, or reaching version 2.0 with plugin support or even open sourcing Kettle itself…

First of all…

Pentaho is again open sourcing an important piece of software.  Today we’re bringing all big data related software to you as open source software.  This includes all currently available capabilities to access HDFS, MongoDB, Cassandra, HBase, the specific VFS drivers we created as well as the ability to execute work inside of Hadoop (MapReduce), Amazon EMR, Pig and so on.

This is important to you because it means that you can now use Kettle to integrate a multitude of technologies, ranging from files over relational databases to big data and NoSQL.  You can do this in other words without writing any code.  Take a look at how easy it is to program for Hadoop MapReduce:

In other words, this part of the big news of today allows you to use the best tool for the job, whatever that tool is. You can now combine the large set of steps and job entries with all the available data sources and use that to integrate everything. Especially for Hadoop the time it takes to implement a MapReduce job is really small taking the sting out of costly and long training and testing cycles.

But that’s not all…

Pentaho Data Integration as well as the new big data plugins are now available under the Apache License 2.0. This means that it’s now very easy to integrate Kettle or the plugins in 3rd party software. In fact, for Hadoop, all major distributions are already supported including: Amazon Elastic MapReduce, Apache Hadoop, Cloudera’s Distribution including Apache Hadoop (CDH), Cloudera Enterprise, EMC Greenplum HD, HortonWorks Data Platform powered by Apache Hadoop, and MapR’s M3 Free and M5 Edition.

The change of Kettle from LGPL to Apache License 2.0 was broadly supported by our community and acts as an open invitation for other projects (and companies) to integrate Kettle. I hope that more NoSQL, Big Data and Big Search communities will reach out to us to work together to even broaden our portfolio. The way I see it, the Kettle community just got a whole lot bigger!

Where are the goodies?

The main landing page for the Big Data community is placed on our wiki to emphasize our intention to closely work with the various communities to make Pentaho Big Data a success. You can find all information over there, including a set of videos, PDI 4.3.0 preview download (including Big Data plugins), Hadoop installation instructions, PRD configuration information and much more.

Thanks for your time reading this and thanks for using Pentaho software!


*This blog was originally appeared on matt casters on data integration


Get every new post delivered to your Inbox.

Join 12,434 other followers