Cloudera Impala: based real-time query of open source Hadoop

The news from New York is the big data technology conference Strata Conference + Hadoop World, ClouderaRelease of real-time query open source projectImpala 1 beta Edition, Said than the original query MapReduce Hive SQL based on speed increased 3 ~ 90 times (the details can refer to this article in the "How much faster are Impala queries than Hive ones, really?"), and more flexible and easy to use. Impala is the Impala meaning, this kind of antelope are mainly distributed in East Africa.
At the same time, this project will also beCloudera Enterprise RTQ(Real-Time Query)The name into the CDH distribution. Can the production environment to the first quarter of 2013 version will be ready to deploy. However, according to ComputerWorld and MarketWatch reports, Capgemini financial services, Karmasphere, MicroStrategy, Pentaho, Qlikview and Tableau have done a few months of the actual product testing in Impala.
As everyone knows, Hadoop and HBase, HDFS is developed in Google MapReduce, inspired by BigTable and GFS three papers. The infrastructure of Google in recent years is a new wave of reform, the media called post Hadoop era of the three carriagesCaffeine, Pregel and Dremel. Of course, this is a confused generations too, but not very scientific.
Pregel is a graph database, said outside MapReduce take another 20% data processing tasks, no such relationship with three papers. Grzegorz Malewicz, founder of the project came to Beijing last year, Hadoop in China keynote speaker. Join the Facebook this year. A few days ago I asked his status in GTalk, he said that the open source version is the development of Pregel. In fact, to some extent, Caffeine is the evolution of MapReduce, the fire at this year's OSDI Spanner can be regarded as the evolution of BigTable, while Dremel is a new.
In any case, with the good stuff, the open source community would soon follow, modeled on the Dremel Apache Drill project has been a long time. While Cloudera in the official blog also expressly acknowledges, "for each Hadoop user has a revolutionary technology" Impala is developed in Dremel inspired. That is to say, Impala will no longer use slow Hive+MapReduce batch, but with similar commercial parallel distributed query engine in a relational database (Query Planner, Query Coordinator and Query Exec Engine of three parts), you can use the SELECT, JOIN and statistical functions directly from HDFS or HBase to query the data, thus greatly reducing the delay. Its architecture as shown below.

The architecture of the Impala (from ZDNet)
Impala uses ODBC metadata, SQL syntax, consistent with the Hive driver program and the user interface (Hue Beeswax), so that the use of CDH products, batch and real-time query platform is unified. Currently supported file format is a text file and SequenceFiles (can be compressed as Snappy, GZIP and BZIP, the best performance). Other formats such as Avro, RCFile, LZO text and Doug Cutting Trevni will be supported in the official version.
The blog also compares Impala with Dremel. The paper says:
Dremel can realize the response speed of interaction in the data, is because the use of the two technologies: one is the nested relational data using the nested structure column store format new, one is a distributed scalable statistical algorithm to parallel computing, query results on thousands of machines.
The latter is borrowed from the parallel relational database. With the 2010 Dremel paper can only deal with single table query compared, Impala has been able to support the complete JOIN operation. In addition, in addition to the Trevni column type storage format, Impala also supports other wide format. That is to say:
Impala+Trevni has been a Dremel query performance in the paper, and in the SQL function is also more than it.
The article also emphasizes that the Impala will not replace the traditional data warehouse and MapReduce+Hive. Data warehouse is still more suitable in making analysis of complex structured data processing on a limited number of sets, and the long running data conversion of MapReduce load or play.
Interestingly, one of the authors of the official post is Impala architect Marcel Kornacker, before joining Cloudera, is the main developer Google F1 query engine, the F1 project task, is the storage of AdWords from MySQL to Spanner.

The main resource of Impala
The source code download:
Mailing list: mailto:

This article from the ChinaUnix news channel, if you look at the original point

Started by Alina at December 11, 2016 - 12:43 AM

Understanding of the,

Posted by Ferdinand at December 25, 2016 - 12:51 AM

Development is too slow, narrow application scope, but also have the same product competition, lack of competitiveness.

Posted by Silvester at December 28, 2016 - 1:14 AM