Pentaho Labs has been busy in the lab and we are excited to reveal their latest innovation: Pentaho integration for Apache Spark.
James Dixon, Pentaho CTO and Lab Scientist shares his thought on Apache Spark on his blog: Pentaho Labs Apache Spark Integration. Overall he thinks Spark has promise, but it needs some work. Read more about:
• What he/Pentaho thinks Sparks solves
• Some promising use cases
• What Pentaho supports today
Today at Pentaho we announced our first official support for Spark . The integration we are launching enables Spark jobs to be orchestrated using Pentaho Data Integration so that Spark can be coordinated with the rest of your data architecture. The original prototypes for number of different Spark integrations were done in Pentaho Labs, where we view Spark as an evolving technology. The fact that Spark is an evolving technology is important.
Let’s look at a different evolving technology that might be more familiar – Hadoop. Yahoo created Hadoop for a single purpose indexing the Internet. In order to accomplish this goal, it needed a distributed file system (HDFS) and a parallel execution engine to create the index (MapReduce). The concept of MapReduce was taken from Google’s white paper titled MapReduce: Simplified Data Processing on Large Clusters. In that paper, Dean and Ghemawat describe examples of problems…
View original post 555 more words