Breakdown of Heirarchial queries and building these using different techniques
Source : https://www.linkedin.com/pulse/step-by-step-guide-creating-sql-hierarchical-queries-bibhas-mitra

As you can see, the results from the CTE query are exactly the same as the combined results of the Anchor and Recursive queries above.
1. Introduction
Hierarchical query is a type of SQL query that is commonly leveraged to produce meaningful results from hierarchical data. Hierarchical data is defined as a set of data items that are related to each other by hierarchical relationships. Hierarchical relationships exist where one item of data is the parent of another item. The real-life examples of the hierarchical data that are commonly stored in databases include the following:
- Employee hierarchy (employee-manager relationship)
- Organizational hierarchy
- Graph of links between web pages
- Connected social networking graph
- A set of tasks in a project
- File system
Unless your database is a hierarchical database like IBM Information Management System (IMS), you will need to write hierarchical SQL to retrieve relational data in hierarchical format that establishes parent-child relationships between entities for better end-user understanding.
There are three types of hierarchical SQL query syntaxes that could be leveraged on different commercially-used databases:
- CONNECT BY: widely available on Oracle database SQL
- Common Table Expression (CTE): commonly used with Microsoft SQL Server
- ANSI SQL
Now let us explore the features and examples for each of the types.
2. CONNECT BY
This method is supported by Oracle, EnterpriseDB (PostgreSQL), CUBRID, IBM Informix and DB2. The standard syntax for the CONNECT BY query is like the one below. As you can see, SELECT, FROM and CONNECT BY are the only mandatory clauses in the statement.
SELECT select_list
FROM table_expression
[ WHERE … ] [ START WITH start_expression ] CONNECT BY [NOCYCLE] { PRIOR child_expr = parent_expr | parent_expr = PRIOR child_expr }
[ ORDER SIBLINGS BY column1 [ ASC | DESC ] [, column2 [ ASC | DESC ] ] …
[ GROUP BY … ] [ HAVING … ]
FROM table_expression
[ WHERE … ] [ START WITH start_expression ] CONNECT BY [NOCYCLE] { PRIOR child_expr = parent_expr | parent_expr = PRIOR child_expr }
[ ORDER SIBLINGS BY column1 [ ASC | DESC ] [, column2 [ ASC | DESC ] ] …
[ GROUP BY … ] [ HAVING … ]
There are a few important points to note about the syntax:
- START WITH specifies the root or parent row(s) of the hierarchy.
- CONNECT BY specifies the relationship between the parent rows and child rows of the hierarchy.
- The NOCYCLE parameter instructs the database to return rows from a query even if a cyclic relation (CONNECT BY LOOP) exists in the data. You can use this parameter along with the CONNECT_BY_ISCYCLE pseudo column to see which rows contain the loop.
- In a hierarchical query, one expression in condition must be qualified with the PRIOR operator to refer to the parent row. For example,
… PRIOR expr = expr
or
… expr = PRIOR expr
or
… expr = PRIOR expr
2.1 CONNECT BY Working Mechanism
A database evaluates hierarchical queries in the following order:
- A join, if present, is evaluated first, whether the join is specified in the FROM clause or with WHERE clause predicates
- The CONNECT BY condition is evaluated
- Any remaining WHERE clause predicates are evaluated
The database engine then uses the information from these evaluations to form the hierarchy using the following steps (referred from Oracle 10g documentation):
- It selects the root row(s) of the hierarchy, meaning those rows that satisfy the START WITH condition.
- It selects the child rows of each root row. Each child row must satisfy the condition of the CONNECT BY condition with respect to one of the root rows.
- It selects successive generations of child rows. The database engine first selects the children of the rows returned in step 2, and then the children of those children, and so on. The database always selects children by evaluating the CONNECT BY condition with respect to a current parent row.
- If the query contains a WHERE clause without a join, then the database eliminates all rows from the hierarchy that do not satisfy the condition of the WHERE clause. The database evaluates this condition for each row individually, rather than removing all the children of a row that does not satisfy the condition.
- The database returns the rows in the order shown in Figure 1 below. In the diagram, children appear below their parents.
Figure 1: Hierarchical Query Rows Return Order (from Oracle 10g documentation)

CONNECT BY format of hierarchical query has operators (especially on Oracle), pseudo columns and functions to help retrieve and present hierarchical data in a user-friendly format:
- LEVEL: The position in the hierarchy of the current row in relation to the root node.
- CONNECT_BY_ROOT: Returns the root node(s) associated with the current row.
- SYS_CONNECT_BY_PATH: Returns a delimited breadcrumb from root to the current row.
- CONNECT_BY_ISLEAF: Indicates if the current row is a leaf node.
- ORDER SIBLINGS BY: Applies an order to siblings, without altering the basic hierarchical structure of the data returned by the query.
2.2 CONNECT BY & LEVEL Example
To find the children of a parent row, the database evaluates the PRIOR expression of the CONNECT BY condition for the parent row and the other expression for each row in the table. Rows for which the condition is true are the children of the parent. The CONNECT BY condition can contain other conditions to further filter the rows selected by the query, however, it cannot contain a subquery.
We will leverage familiar employee-manager hierarchical data structure from the EMP table for the examples. Here is the example of hierarchical SQL query with CONNECT BY and LEVEL at its simplest form.
SELECT EMPNO, ENAME, MGR, LEVELFROM EMP
CONNECT BY MGR = PRIOR EMPNO
ORDER SIBLINGS BY MGR;
CONNECT BY MGR = PRIOR EMPNO
ORDER SIBLINGS BY MGR;
As you notice, there is no START WITH clause; as a result, we do not get well-organized, meaningful data for effective reporting. So let us introduce START WITH into the query and see the results.
2.3 START WITH Example
Use A START WITH clause to specify a root row for the hierarchy and an ORDER BY clause using the SIBLINGS keyword to preserve ordering within the hierarchy. In a hierarchical query, use ORDER (or GROUP BY) SIBLINGS BY to sort rows of siblings of the same parent; this will retain the hierarchical order of the CONNECT BY results.
SELECT EMPNO, ENAME, MGR, LEVEL
FROM EMP
START WITH MGR IS NULL
CONNECT BY MGR = PRIOR EMPNO
ORDER SIBLINGS BY MGR;
FROM EMP
START WITH MGR IS NULL
CONNECT BY MGR = PRIOR EMPNO
ORDER SIBLINGS BY MGR;

2.4 NOCYCLE Example
It is possible for a hierarchical data set to be cyclical, which would pose a problem when querying the data. However, NOCYCLE clause with CONNECT BY instructs the database not to traverse cyclical data hierarchies. In this case, CONNECT_BY_ISCYCLE function shows the record that is responsible for the cycle. Note that CONNECT_BY_ISCYCLE can be used only when there is NOCYCLE clause specified.
SELECT EMPNO, ENAME, MGR, LEVEL, CONNECT_BY_ISCYCLE AS Cycle
FROM EMP
START WITH MGR IS NULL
CONNECT BY NOCYCLE MGR = PRIOR EMPNO
ORDER SIBLINGS BY MGR;
FROM EMP
START WITH MGR IS NULL
CONNECT BY NOCYCLE MGR = PRIOR EMPNO
ORDER SIBLINGS BY MGR;
2.5 CONNECT_BY_ROOT Example
This clause returns the root node(s) associated with the current row. The following example returns the name of each employee in department 20, each manager above that employee in the hierarchy, the number of levels between manager and employee, and the path between the two:


If you need to calculate total salary for every manager and all the employees under him/her, use the CONNECT_BY_ROOT clause. Here is a simple example for department 20 employees from EMP table:

2.6 Additional Hierarchical Query Example
If you need to add tabs to visually represent hierarchical data, you can leverage padding functions as shown in the example below. The values in the TREE field have padding to visually show the reporting hierarchy (linkage) of employees. In addition, PATH gives the linkage between employees in delimited format.


3. Common Table Expression (CTE)
A CTE can be thought of as a temporary result set and are similar to a derived or subquery table in that it is not stored as an object and lasts only for the duration of the query. However, a CTE is more powerful than a derived table as it can be self-referencing or even referenced multiple times in the same query.
The CTE syntax is supported by Teradata, DB2, Firebird, Microsoft SQL Server, Oracle, PostgreSQL, SQLite, HyperSQL, H2 databases. CTE is called “recursive subquery factoring” on the Oracle database.
3.1 Building Recursive CTE
A recursive CTE must have four elements in order to function properly:
- Anchor query – runs once and the results seed the recursive query
- Recursive query – runs multiple times and is the criteria for the remaining results
- UNION ALL – statement to bind the Anchor and Recursive queries together
- INNER JOIN – statement to bind the Recursive query to the results of the CTE
Below is a complete CTE query that generates hierarchical data from the EMP table. Let us identify each of the above four elements of CTE query.
WITH OURCTE (EMPNO, ENAME, MGR, EMPLEVEL)
AS (SELECT EMPNO, ENAME, MGR, 1 EMPLEVEL –Initial Subquery
FROM Emp
WHERE MGR IS NULL
UNION ALL
SELECT E.EMPNO, E.ENAME, E.MGR, CTE.EMPLEVEL + 1 –Recursive Subquery
FROM EMP E
INNER JOIN OURCTE CTE ON E.MGR = CTE.EMPNO
WHERE E.MGR IS NOT NULL)
SELECT *
FROM OURCTE
ORDER BY EMPLEVEL;
AS (SELECT EMPNO, ENAME, MGR, 1 EMPLEVEL –Initial Subquery
FROM Emp
WHERE MGR IS NULL
UNION ALL
SELECT E.EMPNO, E.ENAME, E.MGR, CTE.EMPLEVEL + 1 –Recursive Subquery
FROM EMP E
INNER JOIN OURCTE CTE ON E.MGR = CTE.EMPNO
WHERE E.MGR IS NOT NULL)
SELECT *
FROM OURCTE
ORDER BY EMPLEVEL;
3.1.1 Anchor Query
The employees who do not have a manager (MGR IS NULL), for example the CEO of a company, are the top level employees of the organization. (The CEO likely reports to the Chairman of Board of Directors unless he or she holds dual positions, but let’s not worry about that for now). This group represents the Level 1 employees. The Anchor query identifies these employees and seeds that data to the Recursive query that follows.
SELECT EMPNO, ENAME, MGR –Initial Subquery
FROM EMP
WHERE MGR IS NULL;
FROM EMP
WHERE MGR IS NULL;

3.1.2 Recursive Query
The job of the Recursive query is to find all of the employees that are at Level 2 or below and have a manager defined (MGR IS NOT NULL). The below query is our Recursive query that will eventually have INNER JOIN to refer to the results of Anchor query recursively.
SELECT E.EMPNO, E.ENAME, E.MGR –Recursive Subquery
FROM EMP E
WHERE E.MGR IS NOT NULL;
FROM EMP E
WHERE E.MGR IS NOT NULL;

3.1.3 UNION ALL
We can transform Anchor and Recursive queries by placing a UNION ALL statement between them, putting parentheses around them, adding the declaration WITH OURCTE AS with columns aliases and finally adding SELECT * FROM OURCTE after the closing parentheses.
WITH OURCTE (EMPNO, ENAME, MGR)
AS (SELECT EMPNO, ENAME, MGR –Initial Subquery
FROM EMP
WHERE MGR IS NULL
UNION ALL –Union All combining Anchor and Recursive queries
SELECT E.EMPNO, E.ENAME, E.MGR –Recursive Subquery
FROM EMP E
WHERE E.MGR IS NOT NULL)
SELECT *
FROM OURCTE;
AS (SELECT EMPNO, ENAME, MGR –Initial Subquery
FROM EMP
WHERE MGR IS NULL
UNION ALL –Union All combining Anchor and Recursive queries
SELECT E.EMPNO, E.ENAME, E.MGR –Recursive Subquery
FROM EMP E
WHERE E.MGR IS NOT NULL)
SELECT *
FROM OURCTE;

As you can see, the results from the CTE query are exactly the same as the combined results of the Anchor and Recursive queries above.
The Anchor query inside the CTE represents everyone at Level 1 and the Recursive query represents everyone at Levels 2 and above. In order to visualize each level in a result set, you will need to add an expression field to each query as below:
WITH OURCTE (EMPNO, ENAME, MGR, EMPLEVEL)
AS (SELECT EMPNO, ENAME, MGR, ‘1’ EMPLEVEL –Expression to track level
FROM Emp
WHERE MGR IS NULL
UNION ALL
SELECT E.EMPNO, E.ENAME, E.MGR, ‘2 or above’ EMPLEVEL –Expression to track level (just a placeholder here)
FROM EMP E
–INNER JOIN OURCTE CTE ON E.MGR = CTE.EMPNO
WHERE E.MGR IS NOT NULL)
SELECT *
FROM OURCTE
ORDER BY EMPLEVEL;
AS (SELECT EMPNO, ENAME, MGR, ‘1’ EMPLEVEL –Expression to track level
FROM Emp
WHERE MGR IS NULL
UNION ALL
SELECT E.EMPNO, E.ENAME, E.MGR, ‘2 or above’ EMPLEVEL –Expression to track level (just a placeholder here)
FROM EMP E
–INNER JOIN OURCTE CTE ON E.MGR = CTE.EMPNO
WHERE E.MGR IS NOT NULL)
SELECT *
FROM OURCTE
ORDER BY EMPLEVEL;
3.1.4 INNER JOIN
As you can see, I had to put a hardcoded value for EMPLEVEL field in the Recursive section of the query above. This was because I did not join the Recursive section of the query with our recursive CTE. Now, in order to make the entire hierarchical query work with actual hierarchical level of records assigned, we need to complete the INNER JOIN as in the below query. Notice the difference in value of EMPLEVEL field. This is the complete hierarchical query in CTE format. Please note that ORDER BY is optional here.
WITH OURCTE (EMPNO, ENAME, MGR, EMPLEVEL)
AS (SELECT EMPNO, ENAME, MGR, 1 EMPLEVEL –Initial Subquery
FROM Emp
WHERE MGR IS NULL
UNION ALL
SELECT E.EMPNO, E.ENAME, E.MGR, CTE.EMPLEVEL + 1 –Recursive Subquery
FROM EMP E
INNER JOIN OURCTE CTE ON E.MGR = CTE.EMPNO –Inner Join
WHERE E.MGR IS NOT NULL)
SELECT *
FROM OURCTE
ORDER BY EMPLEVEL;
AS (SELECT EMPNO, ENAME, MGR, 1 EMPLEVEL –Initial Subquery
FROM Emp
WHERE MGR IS NULL
UNION ALL
SELECT E.EMPNO, E.ENAME, E.MGR, CTE.EMPLEVEL + 1 –Recursive Subquery
FROM EMP E
INNER JOIN OURCTE CTE ON E.MGR = CTE.EMPNO –Inner Join
WHERE E.MGR IS NOT NULL)
SELECT *
FROM OURCTE
ORDER BY EMPLEVEL;
If you need to display the manager name next to the manager ID, refer to the following query:
SELECT CTE.EMPNO, CTE.ENAME, CTE.MGR, E.ENAME “MGR NAME”, CTE.EMPLEVEL
FROM (
WITH OURCTE (EMPNO, ENAME, MGR, EMPLEVEL)
AS (SELECT EMPNO, ENAME, MGR, 1 EMPLEVEL –Initial Subquery
FROM Emp
WHERE MGR IS NULL
UNION ALL
SELECT E.EMPNO, E.ENAME, E.MGR, CTE.EMPLEVEL + 1 –Recursive Subquery
FROM EMP E
INNER JOIN OURCTE CTE ON E.MGR = CTE.EMPNO
WHERE E.MGR IS NOT NULL)
SELECT *
FROM OURCTE
ORDER BY EMPLEVEL) CTE
LEFT OUTER JOIN EMP E ON CTE.MGR = E.EMPNO
ORDER BY CTE.EMPLEVEL,E.ENAME;
FROM (
WITH OURCTE (EMPNO, ENAME, MGR, EMPLEVEL)
AS (SELECT EMPNO, ENAME, MGR, 1 EMPLEVEL –Initial Subquery
FROM Emp
WHERE MGR IS NULL
UNION ALL
SELECT E.EMPNO, E.ENAME, E.MGR, CTE.EMPLEVEL + 1 –Recursive Subquery
FROM EMP E
INNER JOIN OURCTE CTE ON E.MGR = CTE.EMPNO
WHERE E.MGR IS NOT NULL)
SELECT *
FROM OURCTE
ORDER BY EMPLEVEL) CTE
LEFT OUTER JOIN EMP E ON CTE.MGR = E.EMPNO
ORDER BY CTE.EMPLEVEL,E.ENAME;

4. ANSI SQL
CONNECT_BY and Common Table Expression (CTE) are the two most widely used methods to build hierarchical results from relational data. However, the features of CONNECT_BY that are discussed above are predominantly available on Oracle database SQL and they are not all ANSI SQL compatible. On the other hand, although CTE syntax follows the ANSI SQL 99 standard, it is not very flexible nor fully compatible with every database, such as HP Vertica. As a result, you may need to produce hierarchical results from relational data structures using ANSI SQL on database systems that do not fully support CONNECT_BY and CTE syntaxes.
There is fairly simple way to accomplish this using SQL’s LEFT OUTER JOIN. In this method, we flatten out hierarchical data that has a parent-child relationship. The output of this query is a wide and sparsely populated data set that has extra columns to hold the node IDs and values at different levels of the hierarchy. The number of the extra columns with non-empty values is directly related to the depth of the hierarchy.
Once again, we will use the EMP table to build hierarchical ANSI SQL query to build a flat data set with the help of the LEFT OUTER JOIN clause. This the simplest form of the query, however, the principle elaborated here can be applied to data source(s) with multiple hierarchies built within them. With that being said, here are the steps that are followed to develop the hierarchical query below:
1. Identify the root of the hierarchy. In the case of EMP table, record with MGR IS NULL represents the root of the manager-employee data hierarchy. This is the level 1 or top-most node of the entire hierarchy. Let us assign alias T1 to this in-line result set.
2. Next, traverse one level down at a time recursively using LEFT OUTER JOIN, in which left outer join T1 with EMP table (T2) on EMPNO and MGR fields respectively. This query returns the employees that report to level 1 employee(s). Now we have the list of level 2 employees along with level 1 employees. Note that each new recursion adds extra columns to the result set making it flatter and sparser. Let us assign this result set alias T2.
3. Keep on repeating step 2 until we find a level with all the field values empty. This is when we know there are no more levels; we have traversed even one level below the deepest leaf of the hierarchy. In our example, level 4 is that final level because all the fields of level 5 are empty.
4. At this point, you will be thinking, “how am I going to know where to stop and/or how many levels do I need to define in my query in order to ensure my query does not break if new nodes are added to the hierarchy?”. In most practical-use cases, the depth of the hierarchy will be defined or fall within a limit. However, in certain situations such as organizational changes or merger and acquisition, hierarchy may change. Therefore, it is good practice to define a few extra levels although it will add more columns with empty values to the result set. But, the extra fields will not change the end result, because they are empty and therefore do not change the effective length of the branch of the hierarchy tree.
Below is our query and result set. As you see, level 5 is added for redundancy but that does not change the final query outcome in actual sense.

5. Conclusion
Each of these hierarchical query formats have their places and use cases. If you are using Oracle database 10g or later, you should use the CONNECT BY method because Oracle’s hierarchical SQL features are easy to use, mature and versatile. However, if you are using a database (e.g. Microsoft SQL Server) that does not support the CONNECT BY method, you need to use CTE syntax to build the hierarchical result set. Finally, if you work with one of the Big Data SQL platforms like HP Vertica, you may need to leverage ANSI SQL to derive hierarchical data in linear format (like Excel) from relational data. At Clarity Solution Group, our data analysts tackle this and many more complicated tasks in our day-to-day work. Learn more about Clarity’s Data and Analytics Services.
This piece was originally published on the Clarity Solution Group Blog at http://clarity-us.com/guide-sql-hierarchical-queries/
Comments