Jeffrey Slort
BI Developer

Datahub: build and deploy API links with just one script

AP_connector_businessintelligence-1 (1)

While developing a great idea, we as a BI club ran into a problem: implementing an E(T)L script takes more time than writing the script itself. Scheduling, logging, maintenance, monitoring; it's mostly a lot of hassle, and for a junior BI person, it's often too complex.

You would think there must be a solution that makes building an API connection as simple as writing a script & done. After trying different solutions, we concluded that we needed to build it ourselves.

In this blog, I share how we can now build API links with just a script and how it works. I also take you through the journey to the result. In doing so, I'll go a little deeper into the choices we made along the way and how it was built on the back end.

//* Datahub is a working name in the search for a better name.

1. The problem of many customers with many different systems and different architectures

Before I kick off with the solution, let me delve a little deeper into the problem.

As a relatively small organization, we manage a lot of different environments. Sometimes on our infrastructure, sometimes on the customer's. We access many different systems and have to deal with varying "skills" of the client.

So you soon have a mixture of different cloud and "on-premise"-systems, loose scripts, and mini-applications. Documenting all these solutions is already difficult, not to mention the complexity of running OPS (operations) as a developer across all those systems, programming languages, and runtimes. As a result, we were creating single points of failure for specific customers with specific developers and OPS was virtually impossible for our juniors.

As a growing BI club, we benefit from the happy medium between flexibility in the systems/tools to be used and standardization in the way of working so that everyone is easily deployable on any project. In particular also the less experienced developers. Especially with new BI Implementations, it is helpful to accelerate with a tool like this.

2. The solution: write a little script and <name> is done

For the final solution, our guiding principle was that building an API link should be no more difficult than writing a script.

Other requirements: things like monitoring, logging, etc., must be handled automatically. Plus: high-quality output, low entry cost, no maintenance, no vendor lock-in, high customization, and easy monitoring.

Here's how the Datahub works for a BI professional

For example, you run into the problem that the customer has a source system that cannot be accessed directly through Azure Data Factory (ADF). The source system does have a fairly simple REST-API where you can get all the information you need.

A simple Python script should then suffice. So you make a copy of our standard Python project (or just an empty project) and you start your script. After some development, you have a little script that reads the REST-API and can write files into a data lake (we also have a standard Python package for this).

The standard project template automatically includes a Dockerfile that allows one command to build the script into a container image and upload it to our private container registry. Then you fire a request into the Datahub API to turn the container image into a job. In this, you can provide some configuration such as data lake credentials, API tokens, etc. These, of course, are stored encrypted and passed along during the run of the container image.

Finally, you need to make sure the script runs at the right place in the data flow. So you log into ADF, you add a standard "Datahub pipeline" to dataflow, and you enter the ID of the job you just created here. ADF causes the script to be started. Keep track of whether the script is already finished and report if the script has failed. And we're done or

How the Datahub works on the back end, we'll tell you in Chapter 6.

3. It started with a SaaS platform

The biggest idea that I cited in the intro and where our tool's journey begins was a SaaS platform with standard connectors. A project we started two years ago, but never really followed through on. Simply because at that time we were only 6 people both building a BI club and developing this project. That didn't go well together, because the focus was on our BI club.

The result was that we had a half-working platform, no solution to the problem of many customers with different environments, and ended up building many scripts and other solutions around our platform. It was just about right, but it was asking for trouble.

After a year of using the platform (or rather not using it), we learned a lot and sharpened the vision to what it is today.

We wanted to tackle our biggest problem by building a product for internal use. With the thought that if it works well then, we can always productize it or make it part of the larger SaaS platform.

4. New round, new opportunities, and especially new principles & requirements

In early 2021, we held some sessions with our entire team for the development of the new version of the Datahub. This yielded several principles and technical requirements: brief explanations below.

Implementing a connector should be no more difficult than writing simple scripts.

What ultimately matters is the process of "Data in system X --> Data in Data Lake / DWH." We want to make this as simple as possible, so that even juniors can easily write scripts.

The system is manageable and you have complete control over the data flow.

The system must ensure that data flows can be started and stopped from "orchestration tools" so that the complete management of data flows can be performed outside the system. As a result, you won't suffer from "timing issues" or other opaque bugs by managing your dataflows in multiple tools.

Low entry costs.

It should not be too expensive to get into so that it can be easily deployed on any new project.

It has to be scalable.

As a BI club, we have a growth ambition (job openings can be found here): which means the solution must be able to grow with us. Not to a Google size, but from the current hundreds of jobs/day to thousands of jobs/day next year.

No vendor lock-in!!!

We think flexibility is an absolute must, both in the systems we use and in the way we offer our services. So no vendor lock-ins.

The links must be controlled from numerous different systems.

We use Azure Data Factory to manage ETL flows. So the system we will use must fit into this flow. But we don't work with ADF for all customers, so it should be possible to drive those links through other systems as well.

Monitoring, reliability, and quality must be good by design.

It goes without saying, but monitoring is often an understudy and only gets attention after the first time the clutch has not done its job properly. This is often the result of simple scripts written quickly and rolled out somewhere.

The new system should provide for monitoring to be added "automatically. And that we notice smart things like long-running jobs and spontaneous load times without having to reinvent this for every little script/link.

We should be able to control whether a sync is allowed to run in parallel or not.

For some systems, it is not convenient to query the API in parallel because of time constraints or a token that changes every minute etc. So you have to be able to exercise control over this.

Standard links for standard systems

We want to make the path from the source system to the data lake/data warehouse as short as possible. So in doing so, we also want to develop standard source systems, and standard links so that we can reuse them for multiple customers. Therefore, it should be possible to pass configuration, secrets, and the like from the system to the "scripts.

5. No available option meets these requirements

We looked for existing solutions that met these requirements before building our own. Indeed, no existing product was a good choice for the longer term.

Below I discuss the three best options we saw and why they do not meet the requirements.

Azure Functions
To fulfill the requirements, numerous solutions are possible. A very good candidate is Azure Functions. Namely, it is scalable, it provides an HTTP API to start and stop syncs. And there is an integration within Azure Data Factory. But unfortunately, it's not all "sunshine & rainbows. Indeed, Azure Functions is primarily designed to perform "short-term" functions. Synchronization scripts are usually not short-lived and thus not the best match for a system like Azure Functions.

Synchronization scripts are usually not short-lived and thus not the best match for a system like Azure Functions. Trigger runs could be cut short for the smallest of reasons. The development SDKs were not optimal. Durable functions have a complex setup. And it creates the lock-in on Azure and Microsoft.

Databricks
In addition, you can think of a Databricks environment for each client with standard scripts. Again, there is integration into Azure Data Factory, but not necessarily into other orchestration tooling. The clusters to run spark jobs are not cheap. Apache Spark is also essentially designed for data analysis work and is thus optimized for Dataframe operations and calculations. Therefore, requesting data via REST or SOAP is not "easily" expressed as a data frame operation. In doing so, it does not quite match the demands of the system.

All-in-one platforms
Then you also have all-in-one platforms such as FiveTran, Stitch, AirByte, etc. Fantastic tools, but they are originally from America and have links to major global parties such as Salesforce, Facebook, and Hubspot. But not with Exact Online, Twinfield, Nmbrs, or Visma Severa. So if you're going to use these platforms, in addition, you always need something to create the very unfamiliar system links.

Furthermore, these systems often have their scheduling mechanism and configuration capabilities in a Web platform. This makes it more difficult (but not always impossible) to have complete control over your data flow from front to back.

6. Our approach and choices during construction

Making the solution environment-agnostic
We started with our main requirement: a solution that is environment-agnostic and thus can be controlled from numerous different systems. What all services - such as Apache Airflow or a client virtual machine - have in common is that they speak HTTP.

The Web service must be the central point of contact for all these services, ensuring that scripts are executed, that only authenticated users can start a sync (no crypto miners please) and providing monitoring and logging. We chose to make this functionality available via a REST-API.

Running scripts without setting up servers
The next challenge we want to tackle is running a script/application that we don't want to set up servers for.

The links we write are usually very simple: copy all or a subset of the data from a system to a data warehouse or data lake. In most cases, this can be programmed with very simple scripts (the exceptions aside, I'm looking at you Twinfield!).

But developing on or for a "platform" can sometimes be tricky. Azure Functions is an example of this. It's simple to start with, but if you want to create a durable function, suddenly you spend half your day setting up Orchestrator functions, Activity functions, and Client functions... There goes your simple little script. It should be possible to execute the little script in a controlled environment without too much "fuss".

The solution to this is container images. Simple and effective. And with well-developed Base images and project templates, it is also tremendously easy for less experienced developers to use.

Making the whole thing scalable
Finally, it all has to be scalable. A Web service on a single server that runs container images on API requests has limited scalability. Fortunately, we are not the only party with this problem and there are numerous ways to make such workloads scalable via Container Scheduling engines.

The three major platforms are Kubernetes, Nomad, and Docker Swarm. Kubernetes is enormously popular and, unfortunately also enormously complex. Docker Swarm is very simple but is not very actively being developed. In terms of complexity, Nomad is the golden mean between Docker Swarm and Kubernetes but is still a relatively new player in the market.

All have advantages and disadvantages, and hard work continues to improve these platforms.

Which platform will be the winner five years from now, I dare not predict. In our use case, we only need a small subset of functionalities from these platforms. Therefore, it is possible to build a modular system that can run on all three platforms (currently, only Docker Swarm is implemented). This allows us to switch relatively easily when a new player emerges, Kubernetes becomes easier to manage, or when they stop developing Docker Swarm permanently. Without then having to change all implementations at all clients for every sync script we've ever built.

The beginning of All Your BI Labs
From the Datahub project, All Your BI Labs was born. We not only think it's cool to develop our tools, we like to make our work as easy as possible It also fits with our vision, as we grow, to develop more products in addition to all the project work.

Within our Lab, anyone from the organization is allowed to pitch projects. If the team sees bread in it and the business case is positive, we'll start working on it.

That started with some small projects around the Datahub. The Datahub offers us tremendous flexibility. This is both a blessing and a curse. Therefore, we started with some new utilities such as:

  • A Python package with a pipeline builder/executor for easy access to HTTP APIs;
  • A standard Python project template with all the necessities to use the Datahub Engine;
  • A standard Azure Data Factory template for starting a Job in the Datahub Engine;
  • A “Way of Working”. Delta first, raw loads to the data lake, processing in DWH.

Meanwhile, we are developing several larger projects that also have a greater impact on our work. Two examples:

  1. We are developing a data warehouse automation framework that takes away 80% of the standard work (the plumbing) in setting up the Data Vault. So that you are only concerned with translating the business logic into SQL statements.
  2. We realize a platform where everyone in the organization has secure access to all data. That way they don't have to wait for the developers.

I plan to write a new article next summer about the latter project. We are currently in the middle of selecting some tools and testing them. Above all, let me know what you think of articles like this in the comments! Or if you have any questions.