Join Emmanuel Akpan and Antonio Martinez Garin as they shed light on how leveling up the pipelines helped them build effective and engaging data products that are easy to utilize across teams at Vista.
At Vista, we evolve fast, and that requires growing along with the latest trends and constantly adapting the best solutions to improve. With constantly changing times, it has become important for us to curate promising products meeting the industry benchmarks. Thus we need reliable, high-quality data products to sustain this vision at Vista. It helps us constantly address data reliability and security factors and positively impacts its trustworthiness and ease of use.
We recognize that scalable and maintainable pipelines are crucial in realizing this goal and support creating effective and engaging data products. Hence, we invest in running clean, component-based pipelines at a large scale. This requires engineers across multiple teams to build towards a shared goal and deliver interoperable code and process components to help us achieve widespread adoption.
Fostering collaboration
As Data Engineers (DEs), we always design and build systems for our data products, focusing on scalability. We do this in collaboration with expert data architects, who skillfully facilitate communication between stakeholders and the data product teams to identify use cases and everyday needs. This natural partnership of DEs and architects encourages the reuse of design components based on patterns of work and a shared desire to scale responsibly.
As explained in our previous post, we embrace the Data Mesh approach, as it has more satisfactory technical and organizational implications. Teams are enabled and encouraged to work independently within their domain. This allows them to speed up the development process and also remove dependencies.
Although working independently led to some great innovation–over time–we also noticed that different teams were tackling similar problems with slightly different solutions. Reflecting on these problem statements led us to investigate a more balanced approach to pipeline execution. We needed to level up our pipelines and develop the best possible solution to make it production-ready.
Also, there were some other challenges with this approach:
- Solutions could not be easily leveraged across teams, so we were exposed to the risk of reinventing the wheel every time a standard request or issue popped up.
- Because work was tough to scale, engineers and analysts had to learn entirely new ways of working if they supported another team or wanted to rotate.
- Tools and pipeline approaches were so different that they became difficult to manage and had issues referencing one another (sequencing cross-team jobs).
Adopting orchestration and standardization (gather, abstract, and re-use)
In order to reach the economies of scale that we were looking for, it was imperative to come to a consensus on a tool that would provide enough flexibility to cover the different cases we had while aligning our core teams to a definite standard. While auditing pipelines and analyzing different approaches, we also highlighted common mechanisms and scoped reusable components. Armed with a clear understanding of the high priority and most common use cases and necessary features/components needed, the team settled on Airflow.
This pairing of Airflow and in-house standard code components meant our engineers spent less time building one-off components for shared use cases. It was easier to onboard new engineers and data analysts/scientists, and pipeline health and stats transparency also increased. We had to run workshops to encourage our teams to adopt this new pipeline approach and help them migrate seamlessly.
With more teams adopting this new way of working, the community was able to support each other and considerably improve their end products (cascade appropriate jobs, raise data quality alarms, increase uptime, etc.). Our solution leverages both vendor tooling and internal standard code components to accelerate teams further and meet organizational needs.
Example components and their impact
Vista-DnA-airflow Databricks operator
One of the components built was to orchestrate Databricks jobs using the job names rather than the conventional job-id, which are long strings. This string format made it impractical for our engineers to remember each job-id after deployment. The new Databricks component offered the flexibility of passing the job name across various environments.
Besides, the process also runs the job strictly for the specified environment. This makes the job names human-readable and the owners more transparent, reducing the time it takes engineers to troubleshoot jobs.
Vista-DnA-airflow DBT operator
Another component was built to orchestrate DBT transformations using the various components embedded in DBT, such as dbt-run, dbt-test, dbt-compile, etc. We also built out additional components that help run our dbt tests and models independently as well as together. In the image below, we have all our models running independently of their respective tests and vice versa.
Equally, we have all our models run together alongside their tests as defined in their dbt dependency. Implementing this operator gives visibility into dependencies and the level of connectivity between different tasks. This transparency enhances the team’s ability to target (re)processing needs, offers the flexibility to run test vs. model components (they don’t need to be batched together), and captures more detailed internal logs.
Next steps
Our teams have adopted some or all of these components to simplify their workflows while keeping their own solutions for their domain’s problems. With this adoption, we are getting immediate value in detecting and reducing the impact of orchestration issues within and across domains. Based on feedback from these teams, we also know that our current approach would still benefit from developing more cool stuff!
One of the most tangible measurement outcomes of implementing this solution is that we currently have our pipeline in a scalable stage of about 80%, which was previously 20%. It has also helped us reduce the number of issues in the pipeline. Previously we had zero data-quality checks, but now about 70% of our products have efficient data-quality checks.
We are now more proactive as we have reduced the development cycle significantly. Earlier, it took us around 72 hours to finish up the libraries to verify the code. It now takes us less than 10 minutes to run the code block. Our teams have migrated easily, as all the components are very modular. So they just could come together, and they will just work seamlessly.
We will likely add more specific mechanisms to benefit pipeline use cases further, enable our engineers to build re-useable components, and explore more knowledge areas. We anticipate that this will come with the increasing usage of the toolset and reviews shared by the users.
This project has proven the feasibility and usefulness of building a standard data toolset. Moreover, it also enabled our engineers to jump-start their pipelines and processes with Airflow easily. Besides, it develops an environment that fosters creativity and knowledge-sharing across domains while preserving quality and resilience.
P.S. We would like to thank our data engineers within the Vista team for their support in implementing the features outlined in this post. We also want to thank the wider Airflow user community for helping us develop some of the concepts which served as an example.
https://airflow.apache.org/community/
Want to help us apply data and analytics to solve more data engineering challenges? Explore our career opportunities in Vista Data & Analytics.
Interested in data engineering? Learn more in previous installments of this series!
#1 – Building a Culture of Engineering
#2 – Day in the Life of a Data Engineer
#3 – How a wonky data table helped overhaul data quality at Vista