Hear from data scientists Jennifer Gaskins, Damien Juery, and Alvaro Martinez about how they provided fast, insightful support for a new product introduction using natural language processing (NLP) to identify what matters the most to our customers.
March 2020
The world was grappling with a fast-spreading pandemic. Small business and consumer needs had shifted drastically, and globally almost every person urgently needed something almost no one needed before: a face mask. Recognizing this emerging demand, Vista saw the opportunity to repurpose some of its manufacturing capabilities, and in a matter of weeks was ready to produce and launch face masks.
The sales of face masks took off, and just as quickly, our customers began contacting our CARE team about them. Our Voice of the Customer (VoC) team wanted to understand our customers’ reasons for contacting us, in order to improve their experience. Naturally, many of the customer contacts fell into the usual buckets: questions about shipping, order status, and promotions. We also knew that there would be mask-specific issues, including ones that our internal teams never imagined.
As Data Scientists, we were able to jump in to support the VoC team with this challenge. Since the product had just been launched, we had limited historical data and none of it had been labeled. This led us to utilize unsupervised topic models.
A multifunctional team was assembled, including:
- A Product Owner, bringing our stakeholders’ needs to our team;
- A Data Engineer, handling the data ingestion pipeline;
- An Analyst, supporting our reporting and monitoring needs; and
- Us (Data Scientists), formulating the problem and building a Machine Learning solution.
Our Data Engineer set up a pipeline to regularly pull in the new transcripts of customer contacts by phone, email, and chat: formatting them and protecting our customers’ privacy by removing sensitive and personally identifiable information. While we iterated on the model (described more below), our Analyst and PO worked with stakeholders to understand how they would use our output and created a dashboard to expose the results of our topic model in a useful way. Once the model was ready to go, the Data Engineer put it into production, scheduling the job to run daily and update the data in the dashboard.
Data Set
We used data from three different CARE contact channels (email, chat, and phone), and decided to work only with English-language contacts about the new face masks, adding up to 10k transcripts in total. English-language covered some of our largest markets, and it was presumed that the issues we would uncover in English contacts may also be present in contacts in other languages.
Text Cleaning
Building our topic model required each transcript to have clean text yet enough context to distill its topic. As described in the modeling section, the choices made in text cleaning had a large impact on the model output, and we found it necessary to iterate many times on the cleaning process to get to a high-quality result. There were 2 important parts of our text cleaning: a channel-specific cleaning routine, and a channel-agnostic one.
Channel Specific: Because text formatting and conversation styles in each of our CARE contact channels are different, we had to create tailored approaches for each of them (e.g., emails contain html headers, whereas chat and phone transcripts do not). For chat transcripts, we kept only the first three messages of the CARE agent and all the messages of the customer, aiming to keep the most relevant text about why the customer initially contacted us while removing information about the proposed resolution from the agent. Similarly, for email transcripts we kept only the original inbound email, and for phone transcripts we only kept the messages of the customer.
Channel Agnostic: First, it removed non-alphanumeric characters and stop words (language-dependent words that do not carry much information about the topic of a text, e.g., “the”, “a”, “an”, etc.). It then converted the words into their lemma form based on the part of speech, e.g., “help”, “helped”, and “helping” are all mapped to “help”. An example of an email transcript before and after being processed by the text-cleaning pipeline can be seen in the following figure :
Feature creation
Once the text-cleaning pipeline finished, we created features to feed into the model from the transcripts (which we will refer to individually as “documents” and collectively as a “corpus”). We adopted a Bag-of-Words (BOW) approach (also called count vectorization). Despite being the simplest representation for text, we decided to use this strategy for two main reasons. First, BOW is straightforward to implement and to understand. Second, BOW results are easily interpretable. Usually, BOW is used as the zero-order strategy to deal with text before proceeding to more sophisticated text representations.
Beyond individual words, we constructed phrases as n-grams. An n-gram corresponds to the sequence of n consecutive words in a text. For instance, 1-grams (also called unigrams) correspond to single words, 2-grams (also called bigrams) correspond to two consecutive words, and so on. For instance, let us take a look at the n-gram representation (after text cleaning) of the following text: “I have a question about mask sizes”.
- Resulting words after text cleaning: question, mask, size (also called unigram representation)
- Bigram representation: question mask, mask size
- Trigram representation: question mask size
From the previous example we can see that, in general, the maximum n-gram representation of a text sequence coincides with the number of words in the sequence.
In BOW, each document is represented by a vector of the count of each of its tokens (i.e., words and, if applicable, phrases). The feature matrix of a collection of documents in the BOW context is illustrated in the following figure:
Each row of the matrix is a vector representation of a given document and each column gives the number of times (counts) a token appears in that document, e.g., in the figure, token 1 appears once in document 1 and document 2 and not at all in document K.
Unsupervised topic modeling
For our unsupervised topic model, we used Latent Dirichlet Allocation (LDA). When applied to topic modeling, LDA assumes that each document contains terms (here, tokens) belonging to one or more topics, and each topic is defined by a set of terms that appear with varying frequencies. The LDA model takes our vectorized text data as input and “discovers” the topics contained in the corpus, returning the term weights within each topic. Applying the model to our data set resulted in the topic distribution per document.
When we inspected the discovered topics, we quickly realized that we could substantially improve the usefulness of the model output by more thoroughly cleaning and adjusting the input. Although we had already excluded standard English stop words, our data set included other frequently appearing but useless words and phrases that were specific to our corpus, such as set phrases CARE agents would commonly say and greetings that were unrelated to the topic of the contact. We also found that the length of n-grams we considered strongly influenced the usefulness of the discovered topics.
Interpreting and using the discovered topics
The topics surfaced by LDA can be described by a list of tokens. For instance, in the following figure, each highlighted word corresponds to a token frequently appearing in a given topic, whereas different color highlights represent different topics.
We can then identify the following topics:
Topic #1 – age, child, kid, suitable…
Topic #2 – status, order number, order, ship, date…
Topic #3 – wear, time, hours, filter, particles…
Topic #4 – difficult, donating, organization…
Topic #5 – cancel, order…
We can immediately relate the sets of keywords to meaningful topics. For instance, topic #1 refers to questions about the size of the kids’ masks.
One challenge for our stakeholders to use the results was that every time our LDA model was trained on a fresh batch of contacts, the discovered topics would change. We wanted to keep the ability to discover newly-appearing topics, while at the same time keeping track of the evolution of contact volumes about already-discovered topics. To accomplish this, we decided to keep updating the LDA model daily with new contacts, and implemented an approach to connect the topics over time.
Given that our corpus vocabulary changed daily, we couldn’t directly compare the topic vectors that were discovered day-to-day. So, we used a pre-trained word embedding model to transform each topic vector into a single embedding vector by averaging the embeddings for each term in the topic weighted by the term frequencies. The cosine similarity of the topic embedding vectors gave us a measure of how close the topics were to each other. We determined a similarity threshold to decide when two topics discovered on different days were actually the same or represented different topics. In this way, we were able to have continuity in our reporting of topic volumes over time, while allowing new topics to be discovered. This approach is illustrated in the figure below, where discovered topics are labeled each day with a letter.
Illustration of the evolution of discovered topics over time. Note that the topic labels (letters) are not meaningful, and in fact sometimes connect to a different label from day-to-day (topic labels B and C switch from days 1 to 2, topic labels A and B switch from days 4 to 5, and topic label D on day 2 becomes E on day 3 and D again on day 4). The figure also illustrates that (1) a discovered topic may disappear (topic label E from day 1 to 2), (2) a new topic may appear (topic label E on day 4), (3) a topic may split into 2 topics (topic label C splits into topic labels C and D from day 2 to 3), and (4) two topics may merge into a single topic (topic labels C and D merge into topic label C from days 3 to 4).
Business Impact
In just a few weeks, we had a working product. Using our insights, the VoC team was able to proactively address customer questions on sizes, designs, and technical specifications of mask filters. We saw an immediate decline in our contact rate related to these issues.
This product improved our customers’ mask experience, but it had even broader impacts on how we use transcripts at Vista. Fast forward two years, we now have multiple product teams whose primary purpose is to automate and better leverage what our customers are telling us. For example, every contact is now categorized by the products being discussed and the reason we’re being contacted so that we can understand customer trends and create (often via automation) better customer experiences.
Interested in Data Science topics? Make sure you follow us on LinkedIn! We have several other blogs on Data Science and other Data & Analytics subjects! Check them out.
We are also hiring! Have a look at our open positions.