7 Things to Ask When Looking for Self-Service Analytics for Big Data Stores

December 18, 2012

Self-service analytics – the ability for non-technical business users to intuitively perform ad hoc reporting and analysis on business data residing in corporate databases and spreadsheets has been a staple of BI and business analytics tools for years.  However, these traditional tools simply don’t work against the new breed of “big data” platforms such as Hadoop and NoSQL databases, which have rejected the traditional relational SQL interface in return for massive scalability and the flexibility to store unstructured data.

Meanwhile, some new specialized but limited big data analytics tools have been released to the market, that are designed specifically to work with the new breed of big data platforms.

So what are the questions you should be asking when looking to provide your data analysts and business users easy self-service analytics for big data platforms such as Hadoop, MongoDB, Cassandra or HBase?

1.Do you have more than one kind of big data store, for example Hadoop as well as HBase, MongoDB or Cassandra?

A:   Chances are you do, or will in the future, to take advantage of the relative strengths of these big data platforms. Consider the fact that most new “big data analytics” tools are capable of self-service analytics against a single big data platform, most often just Hadoop.

2.Would you prefer to use the same tool for big data stores in addition to your traditional relational data stores?

A:   Most new big data analytics tools can only access the big data platform they were designed for, and force you to load data from traditional stores into the big data store. For example, they force you to move data from a low-latency relational database into high-latency (but of course massively scalable) Hadoop, or hard-to-query Cassandra or MongoDB. This makes no sense.

3. Are you ok waiting minutes or even hours to access your big data?

A:   Many traditional BI tools have taken the lowest common denominator type of approach to integrating with big data platforms, for example using Hive with Hadoop. These “batch oriented” interfaces make it impossible to perform speed-of-thought analysis – it’s likely you’ve forgotten the question you asked by the time the data comes back.

4. Are you ok using a spreadsheet-like interface to access and analyze your data?

A:  This, plus maybe basic dashboards, is all that most of the new breed of big data analytics tools offer.  Most business users are much more comfortable using a much more intuitive drag & drop graphical interface for interacting and visualizing their data across different dimensions and measures. For example, many users are stumped when it comes to typing in arcane spreadsheet formulae to work with their data.

5. Do you need complete BI capabilities, including reporting, interactive visualization, and predictive analytics?

A: Most new big data analytics tools offer just a basic subset of these capabilities – for example a spreadsheet-like interface and some lightweight dashboard visualizations.  They don’t let you build highly formatted reports, drag & drop data items in a graphical data visualization interface, or make predictions based on prior history.

6. Do you need to enrich your big data with data from outside of the big data platform?

A: Most big data platforms simply leave it up to you to do this manually, or at best force you to inefficiently load copies of the enrichment data, for example customer demographic attributes such as age, income and location, into your big data platform.

7. Is the big data you want to analyze bigger than the amount of memory you have available?

A: Some new big data analytics tools resolve the big data access latency issue by copying data into an in-memory data store. This works well when the data volumes are low, but blows-up when your data is bigger than your memory. Does the big data analytics tool provide alternatives to in-memory, such as switching to a more scalable and high-performance MPP or columnar analytic database as the speed-of-thought data cache?

Here at Pentaho, we are striving to provide the industry’s most mature and comprehensive big data analytics product, and we think you’ll like our answers we have to every one of the questions listed above.

Let me know what you think. Leave a comment below or @ian_fyfe

Ian Fyfe
Big Data Product Marketing


Impala – A New Era for BI on Hadoop

November 30, 2012

With the recent announcement of Impala, also known as Cloudera Enterprise RTQ (Real Time Query), I expect the interest in and adoption of Hadoop to go from merely intense to crazy.  We applaud Cloudera’s investment in creating Impala as it moves Hadoop a huge step forward in making Hadoop accessible using existing BI tools.

What is Impala?  Simply put, it enables all of the SQL-based BI and business analytics tools that have been built over the past couple of decades to now work directly on top of Hadoop, providing interactive response times not previously attainable with Hadoop, and many times faster than Hive, the existing SQL-like alternative. And Impala provides pretty complete SQL support, including join and aggregate functions – must-have functions for analytics.

For enterprises this analytic query speed and expressiveness is huge – it means they are now much less likely to need to extract data out of Hadoop and load it into a data mart or warehouse for interactive visualization.  Instead they can use their favorite business analytics tool directly against Hadoop. But of course only Pentaho provides the integrated end-to-end data integration and business analytics capability for both ingesting and processing data inside of Hadoop, as well as interactively visualizing and analyzing Hadoop data.

Over the past few months Cloudera and Pentaho have been partnering closely at all levels including marketing, sales and engineering.  We are proud of the role we played in assisting Cloudera with validating and testing Impala against realistic BI workloads and use cases.  Based on the extremely strong interest we’ve seen, as evidenced by the lines at our booth at the recent Strata big data conference in New York City, the combination of Pentaho’s visual development and interactive visualization for Hadoop with the break-through performance of Cloudera Impala is very compelling for a huge number of enterprises.

– Ian Fyfe, Chief Technology Evangelist, Pentaho


Is your big data protected?

October 17, 2012

This morning I participated in a panel discussing the topic of big data privacy at the San Francisco ISACA Fall Conference. The Information Systems Audit and Control Association (ISACA) is a professional association of individuals interested in information systems audit, control and security, with over 50,000 members in 141 countries.  Other representatives on the panel were from eBay, Pricewaterhouse Coopers, and CipherCloud.

Researching this topic and today’s discussion raised some interesting questions about the intersection of personal privacy and big data in this new age where it is becoming technically and economically viable to store and analyze enormous data volumes, such as every click on a website and every commerce transaction. All this big data, from both internal systems as well as externally sourced enrichment data can now be streamed into giant “data lakes” using open source based big data management platforms such as Hadoop and NoSQL databases. Using visual big data development tools and end-user visualization technology such as Pentaho makes it easier and easier for organizations to ingest, prepare and analyze this data resulting in previously unattainable insights that can be used to optimize revenue streams and reduce costs.

However, how can we ensure this “big data” is protected and never used in ways that intrude on individual privacy? Mature and integrated big data analytics platforms such as Pentaho can enforce data access controls as well as audit data usage, but today there is no industry standard for tagging how specific data elements may be used on a permanent basis. This has the potential to leading to risks down the road that data collected with individual personal consent may ultimately be used for purposes beyond the scope of the original consent policy. Is it time for government and industry standards bodies to tackle this issue with new technical standards that enforce data usage and aging policies on an ongoing basis? I have in mind something like an “XBRL” for data privacy, a standard taxonomy and semantics that enforces data usage policies regardless of source and platform.

Let me know what you think. Leave a comment below or @ian_fyfe

Ian Fyfe
Chief Technology Evangelist

How fast is lightening fast?

February 15, 2011

A huge congratulations to our partners at Ingres, who today announced that their lightening fast database VectorWise has set a new record for the Transaction Processing Performance Council’s TPC-H benchmark at scale factor 100. Not only did Vectorwise set a new standard, but it blew the previous record holder out of the water, delivering 340% of the previous record.

Equally outstanding about this news is the fact that VectorWise has not only changed the game in terms of performance, but the database also comes in at a fraction of the price of its competitors. Forward-thinking innovation, high performance, and low cost… sound familiar? It should.

What does this mean to Pentaho users?

Pentaho and Ingres established a partnership last October, with the goal to combine enterprise-class business intelligence with the speed and performance of the fastest analytical database on the market. With over 250,000 QphH (Queries per hour) for 100 GB of data, VectorWise is the epitome of agility at the database level. This means lightening fast query response times, more iterative cycles, and at essence, even more agile business intelligence.

For more

Thoughts on last week’s Strata big data conference

February 8, 2011

Last week I attended the O’Reilly’s Strata Conference, in Santa Clara, California where Pentaho was an exhibitor. I gave a 5-minute lightning talk during the preceding Big Data Camp “un-conference” on the topic, The importance of the hybrid data model for Hadoop driven analytics, focusing on the importance of combining big data analytic results with the data elements already in firm’s existing systems to give business units the answers to questions that were previously not possible or economic to answer (something that of course Pentaho now makes possible). I also sat down for an interview with Mac Slocum, Online Managing Editor at O’Reilly, you can see the video below where we discuss  what kinds of businesses can benefit from big data technologies such as Hadoop, and what is the tipping point for adopting big data technologies.

The high quality of attendees and activity at this sell-out conference I think further confirms that although development work on solutions for big data has been happening for a few years, this area is undergoing a quantum leap in adoption at businesses both large and small. Simply put this technology allows them to glean “information” from the enormous quantities of often unstructured or semi-structured data that in the past was simply not possible, or was eye-wateringly expensive to achieve using conventional relational database technologies.

I found that the level of “Big Data” understanding maturity among attendees was quite varied. Questions spanned the entire spectrum with a few people asking things like “What is Hadoop?” to many along the lines of “Exactly how does Pentaho integrate with Hadoop’s Map-Reduce Framework, HDFS, and Hive?” Some attendees were clearly still in the discovery and learning phase, but many were confidently moving forward with the idea of leveraging big data, and were looking for solutions that make it easier to work with big data technologies such as Hadoop to deliver new information and insights to their businesses. In fact, it is clear that the emergence of a new type of database professional: the data scientist is rapidly becoming mainstream. This person combines the skills of software programmer, statistician and storyteller/artist to extract the nuggets of gold hidden under mountains of data.

Ian Fyfe
Chief Technology Evangelist
Pentaho Corporation

Here are some in-action photos of our booth at the Strata Conference

Former Jaspersoft exec joins Pentaho

January 13, 2011

I am very excited to welcome, Ian Fyfe to the Pentaho team as our Chief Technology Evangelist. Ian recently joined us after spending six years at Jaspersoft as their senior director of product management and product marketing. However, I know him best from working with him several years ago at an executive information systems start-up that he co-found called Intelligent Office Company (IOC, Inc). Ian’s knowledge of the industry, proven leadership skills, and existing relationships with the executive team are a home run for Pentaho. I like to welcome and introduce all new team members on the blog by asking the same question, “What brought you to Pentaho?” Welcome Ian!

Guest Blogger: Ian Fyfe

There are a number of things that brought me to Pentaho, including great people and great products, but perhaps one of the most compelling was the opportunity to work with the founding principals of the company again (Richard Daley-CEO, James Dixon-CTO, Marc Batchelor- Chief Engineer, and Doug Moran-VP Community), having worked with them several years ago at an earlier software startup.  Coming to Pentaho in some ways brings me full-circle, back to working with this group of extremely talented, innovative, and motivated individuals and helping take the extraordinary company they have built to the next level through a continuation of its high growth in customers, community, and capabilities. Even better, Pentaho has emerged from the swamps of Florida to establish their sales headquarters in downtown San Francisco, the northern anchor of California’s Silicon Valley, and my longtime hometown.

Through my long career in Business Intelligence and related technologies, I’ve been fortunate to work at a number of venture-backed startups such as Jaspersoft and Epiphany, as well as more mature larger enterprise software business such as Business Objects, Informix, and PeopleSoft.  I’ve worked closely with sales, marketing, and engineering, and have become fluent at real-time, bi-directional “translation” between these constituencies, enabling me to bridge the gap between understanding market requirements and customer problems, deriving the products and features to meet those requirements, and marketing and selling the resulting products.  From these experiences I’ve learned well the ingredients critical to success are a modern and well architected product, smart and motivated people, and a high growth or rapidly changing market with unmet requirements.  Pentaho truly has it all, and sits at the confluence of the BI market which continues year after year to be at or near the top of the list of CIO spending priorities, and the spectacular growth and mainstream acceptance of commercial open source technology.

I’m excited by my new position at Pentaho as the Chief Technology Evangelist, which builds solidly on my prior experiences, while also introducing the welcome challenge of a new role.  As one of Pentaho’s primary spokespeople I’m looking forward to communicating and interacting with you through events such as talks, webinars, articles, demonstrations, briefings, and social media.


Ian Fyfe
Chief Product Evangelist
Pentaho Corporation


Get every new post delivered to your Inbox.

Join 12,434 other followers