Want to build trustworthy data products every time? Joshua Sucsy, Principal Data Engineer at Vista, unwraps the journey from bare-bones SQL scripts to mature engineering process managed as code.
A jumble of code
Meet the customer order table in 2020. That September, I was the new lead data engineer in the Customer and Business Performance (CBP) domain and it could take 45 minutes to surface an error-prone SQL view of 17 component tables. Without speed and reliability, answering business questions felt impossible. Plus, my time was spent figuring out what went wrong yesterday when our mission was about the future.
Vista’s recently formed data-and-analytics team, DnA, has embraced data mesh architecture, and we started work on a data product that was as correct as your bank account – something product owners would have unshakable faith in. En route, we established a best practice open-source framework to move Vista forward confidently.
Smashing the black box
Week one: I saw how frustrating it was to produce such critical output using a manual legacy approach – the view gave black box results with no ability to examine data. Errors couldn’t be replicated or investigated. Algorithms delivered chunks of data people no longer trusted, so we broke the model down.
By week two, we’d found the tool I’d been searching for, thanks to the customer order table’s product owner. He floated a proof of concept that transformed our data process into a software engineering track. Enter dbt – an open-source allowing automated validation and test-driven development at speed. In short, we now had the opportunity to treat SQL queries as managed Python code.
The potential was obvious: we had two other data engineers, three data scientists, and could manage about three data products in the domain using manual slog. We’d never get ahead that way. Next, we needed buy-in for this major pivot to make progress fast. Data scientist Mark Andersen became one of the fiercest internal proponents of the new framework and the genesis of the data engineering role makes that unsurprising – ensuring clean, consistent, repeatable data is available for data science.
If we fix it, they will come
We’ve been busy: using the dbt tool but structuring up the 60 or so data pipelines to run on the framework to get this data engineering architecture across domains. Because DnA exists to produce data products that people can trust – honed to the point you can build a business on them.
So you could say we specialize in leveraging good ideas. Today, not only is each component data set in the customer order table calculated and output individually, but data engineers across Vista build data sets using the new framework. It’s how one fix became many solutions; a gold standard, transformative. On the ground, Vista benefits from automated testing to validate source data and calculations along the way, issues can be found and solved in minutes without corrupt data ever reaching an end user, and underlying data errors or anomalies are easily rooted out.
How it works
- Version control
- Test and document
- Develop
Benefits so far
Eighteen months in, the velocity of improvement is remarkable, and we’re only getting faster.
- Customer order table: zero quality defects since 09/2021
- Troubleshooting in minutes not days
- In view: 100+ mobilized data products
- We’ve rebuilt the data universe of a 27 year old company
What now?
The framework has continued to grow and expand. There are some automated pieces, and I’m writing underlying software so both engineers and analysts can use it. It’s secure, and all the data pipelines are run in a standardized way – I’m building out a logger to support systematic tracking. We don’t just have one or two believable data products; there’s a whole suite that can talk to each other across domains. Not perfectly, but we can check results, provide entry points with rich context, and debug upstream problems.
And as the team has grown from 10 to 45 engineers, we’ve rebuilt the data universe of a company with 5,000 employees to be 10-20 times larger than it was a year ago. Now we’re in position to unlock many new data sources, so in context, this is transformation just starting. I’m talking about dbt with colleagues throughout our parent company, Cimpress. With 99designs by Vista, for example, acquired in 2020. As DnA begins to deliver tools that mean we can truly understand the customer – with an Uber-style approach to grooming data sets and data science – the real end result of this little table fix is still to come.
Stay tuned to our new dbt blog series to hear about features in development. Have you been engineering change? Connect with me on LinkedIn – I’m keen to hear from experts making data shine within their organizations. And if DnA sounds like a good fit, check out Vista careers.