BLOG

Prepare Your Hierarchical Data for Analysis

Updated: Mar 10

Nowadays data is more and more exchanged in a semi-structured or unstructured way through the usage of JSON, which is replacing XML. This data structure is not necessarily analytic-tool friendly and therefore requires some preparation in order to be used in more traditional BI & Analytics tools.

Mobile, Social Network, Marketing tools and Web APIs generate huge amounts of JSON content daily.

JSON may sound unfamiliar to you, but it's the structure used today behind almost all our web interactions. If you have for instance a TMall or JD store, and want to extract your data to create your analytics dashboard, you will for sure end up facing this JSON data structure.

Or if you want to extract data from services like Google Analytics, or any other Web Service APIs, you will for sure meet this JSON data structure. What if you want to do some social network (Twitter, Facebook, Linkedln, etc.) analysis to know and understand what your customers think about you, you will again find this JSON data structure.


Technically, JSON, short for JavaScript Object Notation, is a semi-structured text based and human readable data structure aiming to replace XML. JSON is now a popular choice among developers and commonly used in public APIs on the web.



All JSON documents are not equal!


One of our clients had some hierarchical data structures stored in the JSON format. In order to conduct further analysis on these data, the client wanted to transform this hierarchical JSON document type into a more traditional tabular data format and structure, and store it in a relational database.


Even the current modern relational databases like Oracle (used by our client) are able to store and query JSON document using SQL, but it is difficult or impossible to easily manage complex hierarchical and recursive JSON data structures.

This is how a hierarchy in JSON may look like.

The main challenge we faced on this project was that we had a limited set of tools (and tool versions) available. Fortunately, among those tools, we had the scripting language Python, which is well know for the data analyst and data scientist community.


We had to quickly make some changes in the overall architecture and data flow during the conception phase of the project to integrate Python as the data preparation component. This component became the most important part of the architecture, as it was is charge of transforming this JSON data structure to a tabular data structure that can then be used by the reporting tool.


In the end, choosing Python allowed us to create a customisable, flexible, and more maintainable solution than we could do with other tools (like Oracle JSON integration and IBM Datastage 8.7), which were at our disposal.



Other solutions?


Oracle Database can natively store and query data structured in the JSON format. In case of more complex hierarchical and recursive data structures, Python can be brought in to transform the data to a appropriate format for further analysis.


In fact, Oracle is not the only relational database to handle JSON. Others like Microsoft SQL Server and IBM DB2 are also able to handle such data structure. Nevertheless, they all suffer the same problem when it comes to more complex JSON data structures.


The main advantage for the business to store data in a relational database, either from a JSON structure as it is or in tabular way, is to have all the data in the same place. This grandly simplify the elaboration of reports and dashboards and enhances the ability to get valuable insights from all available data, not matter what the data structure is. The problem is the also same with most of the data preparation tools on the market, which are able to handle JSON but become short when more complex JSON structures are involved.


Other solutions will have to store those JSON documents in a more appropriate data store, like MongoDB and CouchDB, but this will add another datastore in the business architecture of the client, without bringing any real benefit for the business or the IT department.


Microsoft SQL Server in the 2017 version offers a hybrid alternative allowing users to create stored procedures using Python or R language, both loved by data analysts and data scientists. This brings the flexibility and power of the two languages directly inside the database, closer to where the data is stored, and avoids moving data around from one tool (or platform) to another in the data preparation stage. This allows data analysts and scientists to quickly try out new ideas, algorithms, and provide business users with the needed insights to optimise their decision making process.

There are other RDBMSs to handle JSON.


Have a similar problem in front of you?


All in all, no matter you are doing e-Commerce platform data extraction, accessing Web APIs from various data providers, tracking customer activities online, or using different social media for marketing purposes, you’ll have great chance to encounter data stored in the JSON format waiting to be parsed and analyse. This article only suggests some potential solutions to deal with JSON documents, and how it can be to mix the data with existing relational data to get more accurate insights and therefore bring more value to business users.


We have successfully implemented and delivered this to our clients. If you are interested to know more on these topics, please contact Explora Consulting and we will be glad to discuss with you about solutions that can bring value to both your IT department and your business users. v If you are interested to know more on these topics, please contact Explora Consulting and we will be glad to discuss with you about solutions that can bring value to both your IT department and your business users.

Copyright © 2019 Explora Consulting. All Rights Reserved. 

  • YouTube
  • Grey Facebook Icon
  • Grey LinkedIn Icon