Adventures in Machine Learning

Mastering Hierarchical Data: Storing Querying and Using Recursive CTEs

Are you working with complex data that has structures within it? Hierarchical data can be found in employee hierarchies, family trees, drinks menus, the taxonomy of living things, or even just the folders on your computer.

What sets hierarchical data apart is the parent-child relationship that exists between its components. This relationship allows for the creation of a hierarchy, where parents exist above children, and children exist below parents.

In this article, we’ll go over how to recognize hierarchical data, store it in a database, and query it using recursive queries and Common Table Expressions.

Recognizing Hierarchical Data

Hierarchical data has a parent-child relationship, like a tree structure or the branches of a tree. The most common examples of hierarchical data include employee hierarchies, family trees, drinks menus, the taxonomy of living things, and the folders on your computer.

In employee hierarchies, each employee reports to a manager who, in turn, reports to a higher-level manager. In family trees, individuals are descendants of their parents, grandparents, and great-grandparents.

In drinks menus, cocktails have ingredients that make up the “parent” category, while specific drinks are the “child” category.

Storing Hierarchical Data in a Database

To store hierarchical data in a database, we use a parent-child relationship. In this relational database, we store the data in tables with columns.

Each row represents a single node (a parent or child) in the hierarchy, and each column represents a piece of data related to that node. The parent-child relationship is formed by adding a column that references another node in the same table.

This creates a hierarchy linking the rows, with parents linked to their children.

Querying Hierarchical Data Using a Self-Join

Now that we have stored hierarchical data in a database, we want to query that data. To do this, we can use a self-join.

A self-join is when a table is joined to itself. In other words, we combine the rows in the same table to retrieve information about the parent-child relationship.

For example, in an employee hierarchy, we can use a self-join to get the direct subordinates for a manager. We would join the employee table to itself by matching the manager’s ID to the employee’s manager ID.

By doing this, we can return a list of the manager’s direct reports.

Recursive Queries and Common Table Expressions

Recursive queries and Common Table Expressions (CTEs) are used when there is a hierarchy with an indefinite number of levels. For example, in an employee hierarchy, we might want to return not only the direct subordinates but also all the indirect subordinate employees.

What are Recursive Queries and CTEs?

A recursive query is a type of SQL query that works on a subset of data and repeatedly processes that data, adjusting and modifying it along the way. This process continues until a specific condition is met.

A Common Table Expression (CTE) is a named temporary result set used to define a complex SQL query. CTEs are especially useful when we are working with hierarchical data because they allow us to break down the query into manageable steps.

The Recursive CTE Syntax

To use a recursive CTE, we use the “WITH RECURSIVE” statement followed by the name of the CTE and the columns we want to retrieve. We then define the anchor member, which is the starting point of the recursive query.

This is the first row that the CTE will use. We then define the recursive member, which is the next step in the recursive query.

This step is repeated for every row that is returned, creating a loop until the condition is met.

Example: Querying the Employee Hierarchy

Let’s look at an example of how to query an employee hierarchy using a recursive CTE.

We’ll start with the employee table, which contains an ID column, a name column, and a manager ID column. Our goal is to return all the direct and indirect bosses for a given employee.


WITH RECURSIVE bosses AS (
SELECT ID, name, manager_id
FROM employees
WHERE name = 'John Smith'
UNION
SELECT e.ID, e.name, e.manager_id
FROM employees e
INNER JOIN bosses b ON b.manager_id = e.id
)
SELECT * FROM bosses;

In this example, we start with the employee named John Smith. We use the “WHERE” clause to select the row where the name is John Smith.

We then use the “UNION” operator to combine this row with those returned by the recursive member. The recursive member joins our temporary table (bosses) to the employees table on the manager ID column.

We end the CTE with a select statement that retrieves all columns from our temporary table (bosses). This returns a table of all the bosses for John Smith, including both direct and indirect bosses.

Example: Querying the Folder Hierarchy

Another example of using a recursive CTE is to query the folder hierarchy on your network drive. Folders on your network drive are often arranged in a branched tree structure, making it difficult to quickly navigate through the folders and get to the files you need.

Using a recursive CTE helps to simplify this process. In this example, we have a folder table, which contains an ID column, a name column, and a parent ID column.

Our goal is to return the folder structure for a folder with a given ID.


WITH RECURSIVE folder_structure AS (
SELECT ID, name, parent_id, name as path
FROM folders
WHERE id = 1
UNION
SELECT f.ID, f.name, f.parent_id, CONCAT(fs.path, '/', f.name) as path
FROM folders f
JOIN folder_structure fs ON fs.id = f.parent_id
)
SELECT * FROM folder_structure;

In this example, we start with the folder with an ID of 1. We create a temporary table (folder_structure) that contains the ID, name, parent ID, and path of the folder.

We set the path to the name of the folder since it is the first step in the hierarchy. We then use UNION to join this temporary table with the folders table.

This is repeated for every row that we retrieve from the folders table, creating a loop until we’ve retrieved every folder in the hierarchy. The SELECT statement then retrieves all columns from our temporary table (folder_structure).

To create the path for every folder, we use the CONCAT function to join the path from the previous folder with the current folder’s name. This gives us a complete path for each folder in the hierarchy.

Conclusion

In conclusion, hierarchical data has a parent-child relationship that allows for the creation of a hierarchy, where parents exist above children, and children exist below parents. Storing hierarchical data in a database requires creating a parent-child relationship, with a column that references another node in the same table.

To query hierarchical data, we can use a self-join to retrieve information about the parent-child relationship. Recursive queries and Common Table Expressions are used when there is a hierarchy with an indefinite number of levels.

Using the recursive CTE syntax helps to break down the query into manageable steps, making it easier to retrieve specific information from the data set. This article explained the basics of hierarchical data and how to recognize it in various examples such as employee hierarchies, family trees, and network drives.

We also learned how to store hierarchical data in a database by creating a parent-child relationship and querying it using a self-join. Moreover, we explored how to query hierarchical data using recursive queries and Common Table Expressions with syntax and examples.

Being able to work with hierarchical data is essential for organizing and analyzing complex information. By learning these techniques, we can make our work more efficient and effective.

Understanding hierarchical data will also help in evaluating relationships between entities and creating comprehensive solutions.

Popular Posts