Big Data Technology, Analysis and Case Study 

magnify
magnify
Home 101 Series 101 Series #2 – SORTING in Hive Query
formats

101 Series #2 – SORTING in Hive Query

As you already know the basic concepts of Hive from 101 Series #1, you should know by now that the way Hive query works are fundamentally different from query in RDBMS. Each query is actually transformed to one or many MapReduce jobs behind the scene then submit to the Hadoop cluster; therefore, the sorting behavior is certainly different.

Hive introduce “SORT BY” clause. The syntax and usage are similar to “ORDER BY” clause; however, the “ORDER BY” will pass all the results to a single reducer to do the query result sorting. This may take a long time for a very large resultset; where “SORT BY” ordering data only within each reducer. This may raise an issue for correct result if multiple reducers are involved.

This is where “DISTRIBUTE BY” clause comes to rescue. It makes the generated MR job send the same key to the same reducer to make the sorting result correct. Consider the following example.

Original:
SELECT key, column-1, column2
FROM tableA
ORDER BY key ASC, column-1 DESC;

Modified:
SELECT key, column-1, column2
FROM tableA
DISTRIBUTE BY key
SORT BY key ASC, column-1 DESC;

 
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn
No Comments  comments