Pentaho and DataStax

February 28, 2012

We announced a strategic partnership with DataStax today.

DataStax provides products and services for the popular Apache No-SQL database Cassandra. We are releasing our first round of Cassandra integration in our next major release and you can download it today (see below).

Our Cassandra integration includes open source data integration steps to read from, and write to Cassandra. So you can integrate Cassandra into your data architecture using Pentaho Data Integration/Kettle and avoid creating a Big Silo – all with a nice drag/drop graphical UI. Since our tools are integrated, you can  create desktop and web-based reports directly on top of Cassandra. You can also use our tools to extract and aggregate data into a datamart for interactive exploration and analysis. We are demoing these capabilities at the Strata conference in Santa Clara this week.


James Dixon

This post originally appeared on James Dixon’s Blog

150,000 installations year-to-date for Pentaho

November 15, 2010

Our most recent figures show 156,000 copies of Pentaho software were installed so far this year. These numbers are not download numbers, but installed software that has been used. This includes Pentaho servers and some Pentaho client tools. These numbers do not represent only long-term installations, but also do not represent all Pentaho’s software distributions or installations. Since these numbers are not absolutel

An analysis by country of these numbers shows interesting results.

The Long Tail

This chart shows the number of new installations year-to-date for each country. Our data shows new Pentaho installations in 176 countries so far this year. That’s out of a total of 229 countries.

This is clearly a classic long tail. In fact after the first 20 or 30 countries it is difficult to read values from the chart. This second chart uses a log scale. The line on this chart is almost perfectly linear, showing that the distribution by country is pretty much logarithmic.

Read more on James Dixon’s Blog

  • Geographic spread of the installations
  • New Installations Per $Billion GDP
  • New Installations Per 100k Labor Force
  • New Installations Per 100k Internet Users
  • About This Analysis

Pentaho, Hadoop, and Data Lakes

October 15, 2010

Earlier this week, at Hadoop World in New York,  Pentaho announced availability of our first Hadoop release.

As part of the initial research into the Hadoop arena I talked to many companies that use Hadoop. Several common attributes and themes emerged from these meetings:

  • 80-90% of companies are dealing with structured or semi-structured data (not unstructured).
  • The source of the data is typically a single application or system.
  • The data is typically sub-transactional or non-transactional.
  • There are some known questions to ask of the data.
  • There are many unknown questions that will arise in the future.
  • There are multiple user communities that have questions of the data.
  • The data is of a scale or daily volume such that it won’t fit technically and/or economically into an RDBMS.

In the past the standard way to handle reporting and analysis of this data was to identify the most interesting attributes, and to aggregate these into a data mart. There are several problems with this approach:

  • Only a subset of the attributes are examined, so only pre-determined questions can be answered.
  • The data is aggregated so visibility into the lowest levels is lost

Based on the requirements above and the problems of the traditional solutions we have created a concept called the Data Lake to describe an optimal solution.

If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.

For more information on this concept you can watch a presentation on it here: Pentaho’s Big Data Architecture

James Dixon
Chief Geek
Pentaho Corporation

Originally posted on James Dixon’s blog

Improving product quality the open source way

July 7, 2010

Originally posted on 7/7/10

If we look at the differences between closed and open source software development processes, we can identify aspects that can be generalized and applied to other industries and domains.

Open source development—that combination of transparency, iterative development with early-and-often releases, and open participation—leads to higher quality products. When we’re talking about software, people tend to think of quality in terms of bugs. But this is only part of the story of open development.

Defects can be anywhere within the requirements, design, implementation, or delivery processes, and we need to consider all of these areas to assess the full quality picture.

  • Requirements defects result in a product that does not meet the needs of the market or misses significant opportunities.
  • Design defects result in a product that tries, but fails, to meet the users’ needs.
  • Implementation defects result in a product that has lower customer satisfaction.
  • Delivery defects result in a product that no one hears about or can get hold of.

The earlier these defects arise in the process, and the longer they are unresolved, the more they cost to fix. When you compound defects in requirements, design, and implementation, the result is an expensive mess. (Windows Vista, anyone?)

A closer look at how this works inside the world of software development will yield larger principles to be applied to any project that aspires to use an open development model.

Under the closed model

Sales reps and account reps relay requirements to product managers, who then describe the required features to software engineers, who then design and implement the features and pass them to quality engineers, who try to test the features and report bugs that need fixing. After all this has happened, customers eventually receive the software. The lack of transparency means defects in the requirements and design don’t get spotted until after the product has been implemented and delivered. Another major problem is that, typically, the quality engineers don’t have any personal use for the software, so it is hard for them to imagine the different use cases that real users will have.

The final product suffers from the lack of connection between the software engineers and the software users.

Under the open model

A transparent requirements process includes consumers adding their own requirements and perhaps open voting to determine the most popular features. An open design process means consumers can ask questions about the design to validate it against their use case. Early-and-often releases during implementation mean that consumers can help find defects, which can be fixed early on. Fixing these defects during early development means features built later are not layered upon resolved defects from the earlier development.

Moving beyond software

So how do we apply these open principles outside of the software industry? Following are some good examples (and one bad one).

Open requirements

Some companies manage to meet unanticipated needs by enabling consumers to create new products for them to sell.

Amazon: As an independent author, Amazon allows you to sell your own books through their service. My mother wrote a book about British birth and death certificates. She uses a print shop in her village, and through Amazon UK she sells to a global market. Amazon sends her the customers’ addresses to mail her books to, and a check to cash, with Amazon’s commission already deducted.

Cafe Press: Create a cool slogan or logo, then upload it to Cafe Press and sell it on a wide array items. The designer needs almost no investment other than time and talent. Cafe Press gets an huge product set, over 250 million unique products—and a portion of each sale.

There are services with similar models for bands, musicians, photographers…

Open design

Lego: Using a free CAD design tool, Digital Designer, Lego customers can design new models, then order that kit from Lego. The creator can also upload the design to Lego’s Design by Me, so that other people can build it. The creator gets satisfaction and kudos, while Lego gets all the money. This builds community and revenue.

Nike: Nike ID lets you customize your own sports shoes. By allowing customization of the product appearance, consumers can create a unique-looking shoe that very few, if any, other people have. The Air Jordan basketball shoe has so many colors and customizable parts that even if every person on Earth bought five pairs, every pair could still be unique. Nike could take this further by letting people name their designs and allow voting for the best.

Local Motors: An open car company, Local Motors holds competitions for the concept and the design of their cars with open voting. Then they hold more competitions for the interior design, parts designs, exterior skins, and accessories. Then they put the vehicle into production. Their first is the Rally Fighter. They also encourage owners to participate in the manufacturing of their own cars. Their vision is to have small manufacturing facilities in most cities, hence their name. The effort put in by the contributors is stunning. The designs are awesome and it’s a highly supportive community.

Open delivery

Transparency and participation can also be used to help spread a message or engage consumers.

T-Mobile, UK: T-Mobile UK started with a successful advert where they staged a flash mob dance in London’s Liverpool St Station, an idea they must have borrowed from a Belgian radio station. Then they broadcast an open invitation to be part of their next event. Over 13,000 people showed up to find out the event was mass-karaoke. The result is really quite touching if you watch it all. It’s not often you can say that about a commercial.

Mountain Dew: Mountain Dew’s Dewmocracy was an open voting system for its next flavor. On their web sites you can see how the voting went down to the county level.

Kraft, Australia: An example of how to do it badly. When coming out with a new variant of their popular Vegemite spread, they had a naming competition. Fifty thousand people submitted entries. Unfortunately the winner was picked by a closed panel of “experts.” They selected “iSnack 2.0” as the name, thinking it was edgy and cool. Public reaction was swift and very uncool. Within days Kraft announced they were revoking the name and opened a new poll to allow the public to choose the new name. The selected name was “Vegemite Cheesybite.”

Both the T-Mobile and Kraft campaigns involved large numbers of people participating of their own free will. The difference is that everyone participating in the T-Mobile event was part of the final product; if only 10 people showed up the result would have been very lame. In the Kraft case the closed selection panel proved to be the flawed element.

In all of these examples, there are similarities and differences. Some cases require a very flexible manufacturing process, while in others the inventory is electronic. Sometimes the individual contributors do their own manufacturing. In some cases the participants are highly skilled; while for others, little or no skills are required. But in all these cases (well, except the unfortunate Aussie Kraft example) the companies provide more choices, better products, or a better message by enabling open participation of individuals or communities.

Six reasons why Pentaho’s support of Apache Hadoop is great news for ‘big data’

May 19, 2010

Earlier today Pentaho announced support for Apache Hadoop – read about it here.

There are many reasons we are doing this:

  1. Hadoop lacks graphical design tools – Pentaho provides plug-able design tools.
  2. Hadoop is Java –  Pentaho’s technologies are Java.
  3. Hadoop needs embedded ETL – Pentaho Data Integration is easy to embed.
  4. Pentaho’s open source model enables us to provide technology with great price/performance.
  5. Hadoop lacks visualization tools – Pentaho has those
  6. Pentaho provides a full suite of ETL, Reporting, Dashboards, Slice ‘n’ Dice Analysis, and Predictive Analytics/Machine Learning

The thing is, taking all of these in combination, Pentaho is the only technology that satisfies all of these points.

You can see a few of the upcoming integration points in the demo video (above). The ones shown in the video are only a few of the many integration points we are going to deliver.

Most recently I’ve been working on integrating the Pentaho suite with the Hive database. This enables desktop and web-based reporting, integration with the Pentaho BI platform components, and integration with Pentaho Data Integration. Between these use cases, hundreds of different components and transformation steps can be combined in thousands of different ways with Hive data. I had to make some modifications to the Hive JDBC driver and we’ll be working with the Hive community to get these changes contributed. These changes are the minimal changes required to get some of the Pentaho technologies working with Hive. Currently the changes are in a local branch of the Hive codebase. More specifically they are a ‘Short-term Rapid-Iteration Minimal Patch’ fork – a SHRIMP Fork.

Technically, I think the most interesting Hive-related feature so far is the ability to call an ETL process within a SQL statement (as a Hive UDF). This enables all kinds of complex processing and data manipulation within a Hive SQL statement.

There are many more Hadoop-related ETL and BI features and tools to come from Pentaho.  It’s gonna be a big summer.

James Dixon
Chief Geek
Pentaho Corporation

Learn more - watch the demo


Get every new post delivered to your Inbox.

Join 12,434 other followers