Data hub: build and implement API connections with just one script

Jeff-vierkant_klein

Jeffrey Slort

29 Apr 2022

Internal projects / 18min

Whilst we were developing a brilliant idea, the BI club hit an issue: implementing an E(T)L script takes more time than writing the script itself. Scheduling, logging, maintenance, monitoring: it’s a lot of hassle and for a junior BI person it’s often too complex. 

You’d think there’d be a solution that would make building an API connection as simple as writing the script & job done. After trying various solutions, we came to the conclusion that there was nothing for it but to build it ourselves.

In this blog I will tell you about how we now build API connections with just one script and how it works. I will also take you on the journey to the end result. I will delve a little deeper into the choices we made along the way and what it looks like on the back.

//* Data hub is a working name whilst we search for a better one.

1. The issue for many clients with lots of different systems and different architectures 

Before I start on the solution, let’s explore the problem a little more.

As a relatively small organisation we manage quite a few different environments. Sometimes on our own infrastructure, sometimes on the client’s. We unlock many different systems and our clients themselves have wide-ranging skill sets.

You’re soon looking at quite a mishmash of different cloud and ‘on-premise’ systems, individual scripts, and mini-applications. Documenting all these solutions is difficult enough, never mind the complexity for a developer to run OPS (operations) across all these systems, programming languages, and runtimes. In essence, we created single points of failures for specific clients with specific developers and OPS was virtually impossible for our juniors.

As a growing BI club, we benefit from the golden mean between flexibility in the systems/tools we use and standardisation in the way of working so that we can use anyone for any project. Particularly the less experienced developers.

2. The solution: write a script and job done 

Our starting point for the eventual solution was that building an API connection should not be more difficult than writing a script.

Other requirements: things like monitoring, logging etc. have to be automatic. Plus: high output quality, low entry costs, no maintenance, no vendor lock-in, a high level of customisation, and easy to monitor.

This is how the Data hub works for a BI professional  

For example, you come across the issue that the client has a source system that cannot be approached directly via Azure Data Factory (ADF). The source system does have a fairly simple REST-API where you can ask all the required information.

A simple python script should do the trick. So…you copy our standard python project (or use an empty project) and you start your script. After a little development you have a script that is read by the REST-API and files are written to a data lake; we have a standard python package for that too.

The standard project template has a default Dockerfile that can be used to build the script with one command to a container image, and you can upload to our private container registry. Then you shoot a request to the Data hub API to turn the container image into a job. You can add some configuration, such as data lake credentials, API tokens etc. These are saved encrypted of course and given during the run of the container image.

Finally, you have to make sure that the script runs in the right place in the data flow. So, you log into ADF, add a standard “Datahub pipeline” to a data flow and you enter the ID of the job you just created. ADF makes sure the script is started. Track whether the script has been completed and report when the script has failed. And job done!

How the Data hub works on the back is explained in chapter 6.

3. It all started with a SaaS platform 

The brilliant idea I referred to in the intro and where the journey of our tool started, was a SaaS platform with standard connectors. A project we started two years ago, but where we never really moved on properly. Simply because at the time we were building up a BI club with just 6 people, and we were developing this project. Both at the same time was a hassle, because our focus was on our BI club.

The result was that we had a half-operational platform, no solution for the problem of many clients with different environments and in the end lots of scripts and different solutions built around our own platform. It just about worked, but we were asking for trouble.

After using the platform, or not, for about a year, we’d learned a lot and tightened up the vision to what it is now.

We wanted to tackle our biggest problem by building a product for internal use. With the idea that if it did work properly, we could always productise it or make it part of the bigger SaaS platform.

4. A new round, new opportunities, and above all new staring points & requirements 

Early 2021, we held a few sessions with our full team to develop the new version of the Data hub. This produced a number of starting points and technical requirements; I explain those briefly below.

Implementing a connector should not be more difficult than writing simple scripts.

In the end, it is about the process of “Data in system X –> Data in Data Lake / DWH”. We want to make that as easy as possible, so that juniors can also write scripts.

 
The system is manageable and you have complete control over the data flow.

The system has to facilitate starting and stopping data flows from ‘orchestration tools’ so that the data flows can be managed completely outside the system. It means you’re not bothered by ‘timing issues’ or other nondescript bugs caused by managing your data flows in several tools. 
 

Low entry costs. 

It may not be too expensive to ensure it has low-entry use for every new project.

 
It has to be scalable. 

As BI club we have a growth ambition (vacancies are here)): that means the solution has to be able to grow with us. Not to Google size, but from the current hundreds of jobs/day to thousands of jobs/day next year.

No vendor lock-in!!  

To us flexibility is the be-all and end-all – in the systems we use and in the way in which we provide our services. So, vendor lock-ins are out.

 
The connections have to be operated from a host of different systems. 

We use Azure Data Factory to manage ETL flows. The system we will use will have to fit this flow. We do not work with ADF for all our clients, so it has to be possible to manage those connections from other systems too. 
 

Monitoring, reliability, and quality have to be good by design.  

It may be obvious, but monitoring is often a Cinderella issue and is only given attention after the first time the connection did not do its work properly. That is often the result of simple scripts that are written quickly and rolled out somewhere.

The new system has to ensure that monitoring is added ‘automatically’. And that we notice smart things, such as long-running jobs and spontaneously increasing loading times without having to invent the wheel again for every script/connection.

We have to be able to manage whether a sync may run parallel or absolutely not. 

For some systems it’s not convenient to query the API parallel due to time-based restrictions or a token that changes every minute, etc. You have to be able to exercise control here.

Standard connections for standard systems 

We want to make the route from the source system to the data lake/data warehouse as short as possible. We also want to develop standard connections for standard source systems so they can be reused for various clients. So that means it ought to be possible to pass on configuration, secrets etc from the system to the ‘scripts’.

5. Not a single available option meets these requirements 

We looked for existing solutions that met these requirements before we started building ourselves. Not a single existing product was a good choice for the long term.

Below I will deal with the three best options we could find and why they did not meet the requirements.

Azure Functions 

There are all sorts of solutions to meet the requirements. Azure Functions was a really strong candidate. It is scalable, it has HTTP API to start and stop syncs. There is integration in Azure Data Factory. Unfortunately, all that glitters is not gold. Azure Functions is primarily intended to carry out ‘short-running’ functions. Synchronisation scripts are often not short running, and therefore not the best match for a system like Azure Functions.

We tested comprehensively with Azure Functions and the platform was seriously unstable. Trigger runs would be cut off for the minutest reason. The development SDKs were not the best they could be. Durable functions have a complex set-up. They create the lock-in on Azure and Microsoft.

 
Databricks 

You could consider a Databricks environment for every client with standard scripts. Again, there is integration with Azure Data Factory, but not necessarily with any other orchestration tooling. The clusters for running spark jobs are not cheap. In essence, Apache Spark is intended for data-analysis work and that means it is optimised for dataframe operations and calculations. Querying data via REST or SOAP is not ‘easy’ to express as a dataframe operation. It also doesn’t quite match the requirements of the system.

 
All-in-one platformen 

Then you have the all-in-one platforms, such as FiveTran, Stitch, AirByte etc. Awesome tools, but they originate in America and have connections with the big global players, such as Salesforce, Facebook, Hubspot. But not with Exact Online, Twinfield, Nmbrs, Visma Severa. When you start using those platforms, you always need something to create the unknown system connections.

These systems often have their own scheduling mechanism and configuration options in a web platform. That makes it more difficult, but not necessarily impossible, to have full control over your data flow from front to back.

6. Our approach and choices during building 

Make the solution environment-agnostic 

We started with our most important requirement: a solution that is environment-agnostic and can be operated from a range of different systems. What all services – such as Apache Airflow or a virtual machine of the client – have in common is that they speak HTTP.

The web service has to be the central point of contact for all these services and has to make sure that the scripts are completed, that only authenticated users can start a sync (no crypto miners please), and it has to deal with monitoring and logging. We chose to make this functionality available via a REST-API.

Run scripts without configuring servers 

The next challenge we mean to tackle is to run a script/application for which we don’t want to configure any servers.

The connections we write are usually very simple: copy all data or a subset of the data from one system to a data warehouse or a data lake. Generally, this can be programmed with really simple scripts – apart from a few exceptions, I’m looking at you Twinfield!

But…developing on or for a ‘platform’ can be tricky. Azure Functions is an example. It is simple to start off, but when you want to make a durable function, you’ve spent half a day before you know it on setting up Orchestrator functions, Activity functions and Client functions…so much for your simple script. It has to be possible to carry out the script without too much hassle in a controlled environment.

The solution is container images. Simple and effective. With properly developed Base images and project templates it is also really easy to use for less experienced developers. 
 

Making the whole lot scalable 

Then it all has to be scalable. A web service on a single server that runs container images on API requests has only limited scalability. Fortunately, we are not the only player that has faced this problem and there are lots of ways to make those types of workloads scalable with Container Scheduling engines.

The three major platforms are Kubernetes, Nomad and Docker Swarm. Kubernetes is really popular, but unfortunately also really complex. Docker Swarm is really simple, but is not developed actively. In terms of complexity, Nomad is between Docker Swarm and Kubernetes, but it is a relatively new player in the market.

They all have advantages and disadvantages and there is plenty of ongoing development to improve these platforms.

I wouldn’t like to predict which platform will come out on top in five years time. In our use case we only need a small subset of functionalities of these platforms. That makes it possible to build a modular system that could run on all three platforms in principle; for now only Docker Swarm has been implemented. It means it would be relatively easy to move when a new player turns up, when Kubernetes becomes easier to manage or when they definitely stop developing Docker Swarm. Without having to adjust all the implementations for all our clients for every sync script we ever built.

The start of All Your BI Labs 

All Your BI Labs was born from the Data-hub project. We love developing our own tools, and we like to make our own work as easy as possible. It also fits within our vision to develop more products, as we grow, alongside our project work.

Everyone in our organisation can pitch projects for our Lab. If the team thinks it has mileage and the business case is positive, we get on with it.

It started with a few small projects around the Data hub. The Data hub provides us with enormous amounts of flexibility. That is a blessing and a curse. So we started with a few new utilities, such as:

  • A python package with a pipeline builder/executor for easy unlocking of HTTP APIs;
  • A standard python project template with all the functions to be able to use the Data-hub Engine;
  • A standard Azure Data Factory template for starting a Job in the Data-hub Engine;
  • A “Way of Working”. Delta first, raw loads to data lake, processing in DWH.

By now we are also developing several major projects that also impact our work. Two examples:

  1. We are developing a data warehouse automation framework that takes out 80% of the standard work (the plumbing) when you configure a Data Vault. All you have to do is translate the business logic into SQL statements.
  2. We are building a platform where everyone in the organisation has safe access to all data. It means they don’t have to wait for the developers.

I want to write another article on that next summer. At the moment, we’re in the middle of selecting and testing a number of tools. Let me know what you think about these types of articles in the comments! Or if you have questions. 

0 0 votes
Artikelbeoordeling
Subscribe
Notify of
guest
0 Reacties
Inline Feedbacks
View all comments

Always the first to be notified

Every month we have new events, stories, dashboards and cases.
Together with 400 other you can be the first to be notified.