Training Data Management
Improving the ground truth training data management experience by improving transparency, adding functionality for troubleshooting and self service, as well as reducing time to value for customers getting to a working data extraction model in the platform.
Improved guidance for getting to a better model
When the ground truth manager was first developed it was just that, a page for seeing what documents the model was being trained on. However, there was little help for improving a model or troubleshooting a model. Training data analysis allows customers to quickly see diversity and quality of their data, along with recommendations for getting to a better model on their own, versus trying to find a needle in the haystack.
Smarter document subsampling
Allowing for training a good model on less documents by surfacing to the customer which of their documents were the most important to annotate in order to get a pretty good golden training set for their data. Prior to this customers were always told more documents were better, but that wasn’t necessarily true if all the documents were the same, and were not representative to what a model would encounter. Also there is a point of diminishing returns where annotating more documents only improves the model slightly and may be using more of the customers precious time than necessary.
Model management consistency and simplicity
Each type of model in Hyperscience was managed in a different location with a very different type of UX. This was understandably hard for our implementation team to educate new customers on since it was non intuitive. Adding overviews of models with better primary actions for the user to take, clearer information on model state and performance, as well as consistency across all models in the platform.
Increased transparency for document filtering
One of the main efforts of model management improvements has been better transparency. One of the biggest black boxes in Hyperscience was showing the amount of documents available but not showing how many would be eligible for training the model. Therefore, customers would train a model, thinking they had the required amount of documents when in reality it was much less. During the data analysis step we were able to give a more accurate representation of documents that would be eligible for training as well as reasons why documents were not eligible for training (some which the user could fix and others they could not).
Adding model versioning capabilities
Allowing customers to see old versions of the model in the platform. This enables more testing abilities on models before after upgrades before pushing to a production environment. It also helps reduce customer’s fear of losing model performance after upgrading their Hyperscience environment.
Roles
Julie Byers | Main designer
Meg Pirrung | Designer on classification models. Currently pairing on the model management consistency project.
David Liang and Kristina Liapchin | Product managers
Tomo Berry | Engineering manager
Jocelyn Beauchesne | Machine Learning manager
Requirements
Users can get to a good working model faster. Users can troubleshoot models that are having problems. Users can easily upgrade models without fear of not being able to return to a previously working model.
Historical context & Problems to solve
Once a user upgraded a model there was no going back.
Troubleshooting a model was like finding a needle in the haystack. There was no way to know where annotations were not consistent without going through each annotation and carefully comparing to others.
Submission data and training data were connected. Therefore if a keyer performed poor quality QA it could affect a model’s performance.
Classification, identification, and transcription models were all managed differently with vastly different UX, which was non intuitive.
Cold start model cases were very intimidating with little guidance and transparency
Model management principles
After interviewing customers, partners, CX, sales worked on defining a set of leading principles for the new model management experience. Ran through with ML, eng, cross functional PMs, etc to get feedback, prioritize, and finalize the guiding principles for the next few releases.
Step 1: Document Clustering
The engineering foundation for all the future functionality we wanted to explore relied on two things: being able to separate training data from submission data, and being able to create clusters of like documents. This also helped address one of the customer problems of context switching while annotating. Let’s say a customer had 5 groups of documents, in the old system it was set up in the order of submission to the platform, so they could be annotating several documents from group 1 and then have other groups pop up in between and forget where their annotations were, leading to inconsistent annotations. Another customer problem it helped address was if multiple people were helping with initial annotations, it allowed customers to give each person a group or a few groups, instead of having multiple people work on multiple groups. This again allowed for better consistency, since people knew where they annotated certain fields (vs going off what someone else annotated and creating two sets of annotations or having to create an annotation guide to constantly refer to).
This was also the first attempt at creating an experience that could convey data diversity (more groups equals more document representation).
Iterations started with breaking all documents into their clusters and showing the status of those clusters. The ML team did not like this because they felt it put the emphasis on groups vs the document set as a whole. Also groups would be ever evolving so if the customer were to upload a document for group 1 you may instead create a new set of groups when running data analysis again.
The MVP ended up showing groups as a piece of metadata in the table, as well as showing a summary of over and under represented groups.
Pros:
Customers liked having a better idea of actions to take to get to a better model with the over and under represented groups.
Customers and our implementation team liked being able to split up annotations by group.
Cons:
The list of groups got quite long when there were many groups. In this initial proof of concept we did not know how many groups were realistic for our customers data sets. Ended up many had 30+ groups, therefore the list got quite long and unruly.
Once customers were able to cluster their own documents themselves, they wished to have more control over the groups, including which documents were in there, and what the groups were named.
In this current MVP interface, it was not apparent to customers they had to rerun data analysis after uploading more documents to recluster; they assumed documents automatically went into the group they filtered down to.
Step 2: Guided labeling
The next step was helping customers get a better training data set, by helping them annotate their data. The first way leveraged the groups by adding suggestions to where fields were. After one or two documents in a cluster were annotated, Hyperscience could start giving suggestions for where the fields were. Due to the ML team not being sure of the accuracy of the suggestions until we were able to test with real customer documents, we decided to have the suggestions be able to be turned on/off and to not automatically select the fields for the customer.
Several different iterations of visuals for this guided labeling experience were explored. We landed on a heat map like visual, in the same color as the field label, and only showing the suggestion when that field was selected.
After user testing some things we learned were that:
Customers trusted our suggestions almost too much, so even if a suggestion was wrong users tended to click on it anyway since it was suggested. For future iterations we began exploring how to show machine confidence in a suggestion, in hopes that if the machine had lower confidence a human may look at that field a little more closely.
Furthermore, it was confusing to customers when the suggestion was that a field was not present on the page, and thought that it meant that no suggestions were available.
Step 3: Smarter subsampling and document filtering
Then I worked with looking at the document and group recommendations as a whole. One of the pieces of customer feedback we had gotten was confusion around the number of documents in the set, and the lack of transparency for documents the model would actually be trained on. I worked with taking the generic progress bar we had for document recommendations and trying to edit it to give more meaning to the customers.
This recommendation bar redesign worked seamlessly with discovery efforts from the ML team, who wanted to be able to help customers get to a better model faster by requiring less annotated documents from the start. Originally we gave a generic recommendation of 400-1200 documents to get to a good working model. The ML team showed me a graph where eventually there is a point where annotating more documents doesn’t necessarily lead to a better model. Based on the model itself they found the customer could get to a fairly good model with 50 -100 documents, but wanting to encourage more documents wasn’t necessarily a bad thing, just helped with a quicker time to value.
The first iteration path, while valuable, was not going to work out because it was trying to combine too many concepts into one idea, making it confusing and not helping provide valuable information to customers. Some documents could fall into multiple categories (it could be part of the initial 10 minimum required documents, and still also counts toward the basic recommended). I had to think of a way to separate requirements that would block the user from suggestions used to help the user get to a better model themselves.
The next iteration focused on two parts: the most basic recommendation and then if the customer had time an optimization of the model experience. That optimization would not be available until the customer had all the basic requirements met first. This allowed us to separate recommendations for both getting the quickest time to value and getting the best model if you had more time. When optimization was enabled, each document got a rank of importance to make sure that they had all the most important documents annotated first, and if their model was not performing as expected they could go back and check their less important rankings.
After testing with customers we learned that:
Optimize was too exciting of a word. People wanted to click it as soon as it became enabled without understanding what they were optimizing for.
The rank system was too complex, and too detailed for the average business user.
Group recommendations were confusing with the overal document set recommendation (since groups by default recommended 15), but maybe the overall data set recommendation was less than those groups recommendations added up. Overall, it felt too prescriptive and less recommended.
Ineligibility of documents was much clearer with the progress bar changes, however the copy to see the ineligibility details was confusing and not where the customers expected.
Also the ML and development teams had some feedback on this iteration:
Groups were less important as getting an appropriate amount of groups (which leads to better data diversity).
Group level document recommendations were not possible or helpful with the current system
Optimization recommendations made it seem like customers should delete documents, which they didn’t want to necessarily encourage, since more documents is not necessarily a bad thing.
Also at this point I learned documents that were previously annotated would have a higher ranking automatically than those that were not, and the rankings would change frequently/often, which could be even more confusing to a customer.
From this feedback for the MVP version of the designs (since time was running out) we ended up keeping the group recommendations as it currently was in the system, knowing we would iterate on the prescriptive feeling of those next release. We focused more of our effort on the data filtering and the smarter subsampling.
For the data filtering I worked with the technical writing team to come up with better copy for the ineligibility reasons and call to actions. Furthermore we were able to surface the ineligibility reason in the hover state, the document filter, and when the customer clicked into the document. I kept the overall revised document progress bar to show the overall amount of documents in the set, and those that were eligible and ineligible.
For the smarter subsampling, I simplified the ranking system to be high and low importance (with the rankings still on the backend, just not surfaced to the users). I created a call to action for annotating the higher importance documents first. Additionally, if the customer had less than 10 groups overall, we created a call to action to add more groups/diversity to the data set. And while the overall data set could be larger than the recommendation, we showed that count but did not actively encourage the users to delete any documents. It took many iterations, but we finally reached a solution that satisfied both the ML engineers and the customers.
Step 4: Model management [Current work in progress]
The next stage was improving the model management experience as a whole. Some feedback we had gotten from customers was that:
It was confusing that different types of models were managed differently
They wanted to be able to upgrade models, play around with the upgraded functionality, but if they couldn’t get a working model, they would be able to go back to the working version.
See old versions of models. (In the current system archived models just disappear from the customer view, so unless they download the model there is no way in the platform to access it again).
Have better call to actions on what to do next, and empower users to test, troubleshoot and build models themselves.
Have groups and recommendations feel less prescriptive.
Being able to upload to all models that connect to the same layout.
To do this I started pairing with Meg, the designer working on the classification models to figure out the necessary requirements to build a new model overview page that worked for all types of models in the system. The first step for the development team was adding parity of a training data manager for the classification models, since they only existed in the extraction models before.
From there I started working on creating designs that were less prescriptive for group recommendations. I was heavily inspired by FICO credit scores, since they had various categories for ways to improve your credit (and factors hurting it), but changing those had no guarantee that your credit would actually get better. Our models were quite similar, you could add more diversity, and in theory that would help you get to a better performing model, but there was no guarantee if the data set provided was not a good set in general. I settled on three categories: diversity, importance, and quality. Diversity is the number of groups a data set has, where anything over 10 groups is probably ok. Importance was the number of annotated documents for that high importance document group that had been annotated. And quality was the number of annotation anomalies found in a data set. When testing with the implementation partners, this approach felt a lot more like a recommendation vs black and white thinking.
Then I started working on an overview page where customers could quickly see the status of the model and basic high level call to actions. This would allow for a separate training data tab, where all the training data management would be, and a history tab where the customer could see the various versions of their model. This broke up some of the page making it feel less long and reduces cognitive overload of information for the customer. This section and process is in current iterations and will continue to be worked on.