This project aims to compare different methods of retrieving hirercharical data using Python (Flask Framework) and PostgreSQL. It will benchmark query, response time and also provides brief note on pros and cons of each methods.
Following tools are used for benchmarking
- Python (3.5.2)
- Flask (0.11.1)
- Flask-SQLAlchemy (2.1)
- PostgreSQL (9.5.4)
- Marshmallow (2.9.1)
Run the following commands to set up the project and environment
git clone https://github.com/SohaibFarooqi/hierarchical-data-model.git
virtualenv <your-env>
pip install -r requirements.txt
flask run
If you wish to insert dummy records in your table, Please go through the next section before running the application.
There is a module in this application that can be use to insert test data in any specific model. task.py
defines this job. It uses PyInvoke
to execute.
Example Usage: invoke build --type aj
For all possible options please see invoke --help build
.
By default it inserts 10 rows, However you can modify this behaviour by updating NUM_RECORDS in config.py
. If you are inserting significantly large number of rows you should also consider updating CHUNK_SIZE variable in config.py
. This variable defines the number of rows after which SQLAlchemy should issue a commit command to the db. After successful completion of the job you can see records in your specified table.
Note: This job currently support single tree insertions. Future releases of this project will include multi-tree support.
Following methods are used for benchmarking response and query time.
In the adjacency list model, each item in the table contains a pointer to its parent. The topmost element, in this case electronics, has a NULL value for its parent. The adjacency list model has the advantage of being quite simple While the adjacency list model can be dealt with fairly easily in client-side code, working with the model can be more problematic in pure SQL.
Example Data:
| id | parent_id |
|-------------|-------------|
| 1 | NULL |
| 2 | 1 |
| 3 | 2 |
| 4 | 2 |
| 5 | 2 |
| 6 | 1 |
| 7 | 6 |
| 8 | 7 |
| 9 | 6 |
| 10 | 6 |
Lineage Column defines a path
column which contains path of that node till root node. Path can be seperated with any character. Example None.1.2.3
. This trick is really handy when searching for child and ansestor nodes. SQL WITH
and ANY
are really useful with this approach.
With Postgres one can define GIN or GIST indexes on LTREE dtype to perform really fast filtering of data.This trick can be used in creating related searches and creating hashtags.
Drawbacks of this approach are, inserts, update and moving a node is really expensive operation since every time you have to recalculate path and update it.
Example data:
| id | parent_id | Path |
|-------------|-------------|------------------|
| 1 | NULL | None |
| 2 | 1 | None.1.2 |
| 3 | 2 | None.1.2.3 |
| 4 | 2 | None.1.2.4 |
| 5 | 2 | None.1.2.5 |
| 6 | 1 | None.1.6 |
| 7 | 6 | None.1.6.7 |
| 8 | 7 | None.1.6.7.8 |
| 9 | 6 | None.1.6.9 |
| 10 | 6 | None.1.6.10 |
The nested set model is to number the nodes according to a tree traversal, which visits each node twice, assigning numbers in the order of visiting, and at both visits. This leaves two numbers for each node, which are stored as two attributes. Querying becomes inexpensive: hierarchy membership can be tested by comparing these numbers. Updating requires renumbering and is therefore expensive. Refinements that use rational numbers instead of integers can avoid renumbering, and so are faster to update, although much more complicated
Nested sets are very slow for inserts because it requires updating left and right domain values for all records in the table after the insert. This can cause a lot of database thrash as many rows are rewritten and indexes rebuilt
Example Data:
| id | lft | rgt |
|-------------|-----|-----|
| 1 | 1 | 20 |
| 2 | 2 | 9 |
| 3 | 3 | 4 |
| 4 | 5 | 6 |
| 5 | 7 | 8 |
| 6 | 10 | 19 |
| 7 | 11 | 14 |
| 8 | 12 | 13 |
| 9 | 15 | 16 |
| 10 | 17 | 18 |
Application provides following end-points:
function | Route | Description |
---|---|---|
get_subtree | /subtree/<id> |
Get sub tree based on parent id provided in path params. It returns whole tree if no parent id is provided |
get_root | /root/ |
Get list of all root nodes. Accepts no argument |
get_leaf | /leaf/<id> |
Get leaf nodes based on a parent id. If parent id has no direct leaf nodes it recursively traverse tree and return leaf nodes of decendents |
get_child | /child/<id> |
Get all immidiate childs based on parent id provided in path params |
Note: All routes are relative to base url |
Query Params: To test the requests mentioned above, you need to provide type = aj
, type = mp
and type = ns
for adjcency list, materialized path views and nested set. If type
param is absent, the request will throw an error.
Example Request: www.mydomain.com/subtree/1/?type=ns
HTTP request time benchmarking is performed using wrk. Test is done on
getSubTree()
with 100 tree nodes.