Explore how theWho team at Vista DnA solved key ML challenges and built data products that are helping drive jaw-dropping value to our internal and external customers.
At Vista, our customer industry segmentation data product drives customer value by providing hyper-personalization capabilities.
Customer industry segmentation serves as a lens to understand our customers. For example, comparing key report metrics across customer industry segments facilitates insights that help understand customer needs better.
Stakeholders across Vista also use industry segmentation to differentiate marketing strategies (offers, content, messaging, products) and provide more personalized experiences to our customers dynamically and in a scalable manner.
For instance, if the customer’s industry is agriculture and farming, they see more products and designs related to their industry. This personalized recommendation is not limited to landing pages and website experiences but every customer touchpoint, whether marketing communications or design processes within the studio environment.
The challenges: overview
Since its inception, industry segmentation has quickly iterated over the data product life cycle– from proof of concept to minimum viable product (MVP) and further to a stable and scalable data product. Along this transition, the architecture has evolved to solve challenges related to development workflow, system performance and robustness.
Here are the three challenges typical to most ML systems and our approach to their solution:
- Lack of training data
- Iteration time – model training time
- MLOps Challenges
- Model metrics and artefacts storage
- Reproducing training/inference runs
Addressing the challenges
How does industry segmentation work?
The industry segmentation model is a multi-class classification model that takes input features flowing from multiple data domains, such as transactions, product categories and designs, to predict the industry accurately. Segmenting customers based on their similarities helps create more value through better customization.
Looking a bit further into data sources, transactions and product categories correspond to the frequency, order value, type of products and other transactional features. However, the most valuable feature which helps predict the industry is customer order designs.
Customers can either modify existing design templates when ordering or upload their own designs. Both of these choices lead to ingenious personalized designs.
However, the designs require some processing before the models can understand the information they provide. The pre-processing stage starts with an Optical Character Recognition (OCR) step to identify textual information from the design.
The textual information is then passed through a text classification algorithm which helps predict the industry. The text classification model we used here is a custom-trained version of the zero-shot learning model (Hugging Face has a zero-shot learning transformer pipeline). This model is able to classify text on a custom list of categories (which are industries refined for our use case).
Design OCR text is a powerful feature for industry classification. When mixed with transactional and product-level features, we observe a strong signal for predicting industries accurately.
1. Lack of training data
Challenges in gathering training data
The final industry prediction model is a supervised training model. However, we need relevant data to train and test this model.
The problem, however, is that ML use cases with a pre-curated stable source of accurate training data are rare. In fact, most of the use cases deal with missing and incomplete training data. To solve an ML problem efficiently, we must first solve the data (sourcing) problem.
When the industry prediction model was conceived, we had a limited dataset comprising of:
- Hand-labelled datasets: A hand-curated and balanced sample that represents the population
- Internal spot checks: Each team member in our data product labelled 100 docs/day
- Limited industry data: From when customers register on the website
- Surveys: Without enough background truths
While this was enough to get us started, we soon found the limitation of not having a stable continuous stream of accurate data that could be used for training/testing.
So, how did we solve it?
An early investment in the analysis of training data availability and quality might help the team to decide to strategize to first gather training data and then invest in modeling.
There is power in acting like owners who influence teams and the organization to come up with an organic data source. We triggered a lot of conversations about how to source data across the organization at different levels.
Initially, though, we had to work with proxy data. After a few brainstorming conversations, we were successful at integrating a steady and accurate data stream along with several other secondary data sources acting as accurate proxies.
One example of such training data source comes from customer data enrichment efforts with public government records containing basic company information. Companies offer basic registration information, a dataset that is quite precise and accurate.
Naturally, after showcasing the value of this data, our stakeholders and partners could appreciate the value of having this data. We are now starting to use one of the best sources for the self-reported industry: data that customers fill out when registering on the website.
But the training data gathered might still be imbalanced – the representation for all industries (Y labels) needs to be balanced otherwise, it introduces bias in the model to just use the majority class for all predictions instead of the most accurate one. However, there are algorithms which help balance the dataset – SMOTE (Synthetic Minority Over-sampling Technique) helps generate new data points artificially that look like the rest of our minority class data points.
2. Long model training time
It is unlikely to iterate fast when a single model training run takes a day. Further, to achieve higher accuracy with hyperparameter optimization, we required the model to run multiple times. The slow iteration time is detrimental to the value delivered and offers a terrible development experience to data scientists.
The model in question was an XG Boost classification model, which yielded significant speed-ups by applying the following two optimization techniques:
- Migration of sequential training (with pandas) to a natively parallel and distributed training pipeline (with spark MLlib)
- Parallelization of cross-validation and reducing hyperparameter search space
Hyperopt is helpful for reducing hyperparameter search space. We used sparkdl XGBoostClassifier along with Spark ML CrossValidator. This distributed training workflow has orders of magnitudes faster training time over our initial sklearn + pandas sequential workflow. There is infrastructure setup pre-work to provision and maintain clusters but in terms of cost we break even since training time is reduced.
3. MLOps Challenges
a. Model metrics and artefacts storage
With multiple training runs and continuous iterations over time, tracking model metrics and cross-checking training/test runs became a challenge. Moreover, transitioning models from the development to the production stage became more complex.
MLOps experts have discussed these challenges and their solutions in depth. As MLOps gains traction, we testify that ML Lifecycle tools like MLFlow are an essential toolkit for any Data Scientist, as they offer out-of-the-box automated experiment tracking, model repository, and registry.
These tools offer:
- Tracking: With the experiment tracking feature, each training run automatically logs runtime metrics like start/end time, script links, and artifacts like data files, images, etc. Besides, MLFlow offers tons of flexibility with custom user-defined metrics and artifacts. As a result, it becomes easier to track model metrics when they can be visualized across multiple runs using graphs.
- Models (Format for storing models): Each model has its own signature (model input schema) and examples of how to send data to the model. The models can also be traced to the source using job ID.
- Projects: Projects provide a way to package the modeling code. This packaging helps provide reproducibility for runs across multiple platforms. Generic models can also be wrapped with custom classes to provide out-of-the-box pre/post-inference processing.
- Model Registry: Provides dev/stg/prd versions and stages them across environments, enabling continuous experimentation in a decoupled way. Models only transition to production when they are explicitly stated in the registry.
b. Reproducing training/inference runs
Another challenge that became apparent was debugging and understanding how changing data impacted the model performance/training.
A reproducible training/inference system allows Data Scientists to dive deep into a single model run that can lead to deeper understanding while debugging inconsistent behaviors. Since data constantly gets updated and new features are added/removed, it is essential to version input data/ features to reproduce both training and inference.
Enter Feature Store – with a feature store like FEAST or SageMaker Feature Store, versioning for features is easier than ever, and Point in Time joins offers reproducibility.
Feature Store works with both incremental and full refresh data loading patterns. It also adds meta information about features. For instance, whenever we push transactional features to the feature store–in addition to the “transaction date”–it logs a “feature created date.” These features are naturally stored as files indexed on “feature created date.” Each day is a new file.
To understand Point in Time joins a bit better, let’s imagine that we have a feature store containing feature tables for transactions, products, and designs as follows:
In this simple example, Feast contains features for 3 days: Day 1, Day 2, and Day 3.
Consider two users – Rob and Tori, and say on Day 1, we see transactions for both. On Day 2, only Rob has transactions, and On Day 3, we see transactions for both Tori and Rob.
When requesting transaction features for Day 2 -> feature store joins across all storage files internally and returns the latest features from all previous days of Day 2.
If we request features for Day 3, we naturally get the latest features as on day 3.
Hence, we see in-built historical feature data reproducibility for diving into any training/inference run.
Feature stores are flexible to offer options for an offline store (like S3) for bulk queries or an online store (like Redis/dynamo DB) for fetching features in low-latency, real-time inference systems. With this, the move from bulk scoring jobs to a real-time scoring API is faster to design and implement.
Final Workflow Diagrams
Final thoughts
With the evolution of MLOps, we find ML tooling and practices like MLOps principles gaining wider adoption. With developer productivity, we observe faster lead times for new model releases, allowing us to be nimble and experiment to win big and/or fail fast.
All for one – to offer our customers engaging and relevant personalized experiences across all touchpoints in their journey with Vista.
theWho band squad – Miguel Dumont, Vibhusheet Tripathi, Zach Thomas, Mouloud Lounaci, Ehud Hermony,
CBPM Customer Predictions team – Zach, Gregg, Stan, Laurent
#shout outs
Partners – The Cimpress Artwork Intelligence team with their state-of-the-art AI lab and Cimpress Customer Enrichment team for the amazing support on enhancing training data.
Data Platform and Data Science Platform teams – Data products are just one part of the data mesh and are scalable only with a supercharged platform team.
Thinking of making a career leap into a data-driven organization? Explore possible career opportunities with Vista DnA.
References:
- https://en.wikipedia.org/wiki/Industrial_market_segmentation
- https://arxiv.org/pdf/1712.05972.pdf
- https://huggingface.co/docs/transformers/main/main_classes/pipelines#transformers.ZeroShotClassificationPipeline
- https://arxiv.org/abs/1106.1813
- https://hyperopt.github.io/hyperopt/
- https://databricks.github.io/spark-deep-learning/#module-sparkdl.xgboost
- https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.tuning.CrossValidator.html
- https://docs.databricks.com/applications/machine-learning/train-model/xgboost.html#distributed-training
- https://ml-ops.org/
- https://mlflow.org/
- https://mlflow.org/docs/latest/tracking.html
- https://mlflow.org/docs/latest/models.html
- https://mlflow.org/docs/latest/projects.html
- https://mlflow.org/docs/latest/model-registry.html
- https://aws.amazon.com/sagemaker/feature-store
- https://ml-ops.org/content/motivation
- https://ml-ops.org/content/mlops-principles