Table of Contents
“How long will it take for me to build and launch my first guide?”
This is one of the most common questions we hear, and is one of the most important problems we’re trying to solve. With the evolution of AI and LLMs, there is powerful technology at our disposal that allows us to rethink how to approach this problem.
Internally, we’ve long talked about this as the “blank slate problem”, a familiar UX quandary many of us face. Traditionally this is solved with templates and services. We also see a lot of applications of AI to the blank slate problem by asking users for general titles or outlines. But the results always feel off, and it still requires the user to create the necessary inputs.
But what if the blank slate problem isn’t about “blank slates” at all, but is actually a translation problem?
It’s not a blank slate problem
One of the most challenging problems our customers face is what to write in their onboarding checklist, upsell card, or contextual help. It’s not that they don’t know how to describe a user’s path to value, it’s that their mind gets stuck on the right words, the specific calls to action, and more.
Yet the content around onboarding and activation often already exists: in Loom videos, in help articles, and certainly in the ability to show a customer how to do something.
Therefore, it's not a blank slate problem. It’s a translation problem across modalities.
Using AI to translate across modalities
We started incorporating AI into Bento 6 months ago with editing capabilities to make content better.
Then, as our understanding of this “blank slate” evolved, we thought, “what if we could teach GPT how to use Bento and take that existing content?”
By taking inputs like video, software interactions, and text, we’re able now to:
- Turn a video into a multi step checklist
- Take clicks into a pre-built flow, guiding users around an interface.
And, in doing so, we’re also leveraging one of the most exciting aspects of AI agents: the ability for these LLMs to learn tools and interface with APIs.
Over the past months, we have taught GPT not only how to use Bento’s mechanics, but we’ve also taught it the same best practices that we share with customers. For example, the appropriate length of content, the tone of content, and even what a good call to action looks like. This way, the output it generates isn’t just “correct” but it bends towards effectiveness.
Peeking under the hood
5 years ago, I started working on AI powered products as the Head of Product at atSpoke. With an in-house ML team, we built and trained our own models that could take chat / email requests for IT and HR requests, and either automatically answer it with a knowledge base article, or route it intelligently to the right team and kick off the right workflow.
But with Bento, we’re not just dealing with text. We’re trying to support as many of the existing form factors as possible and outputting it into something that fits our product.
Wrangling the multimodality of possible inputs opened up a number of challenges: how to deal with managing the context window, or how to return output that’s not only text, but includes details like CSS selectors and calls to actions.
Today, we’re handling three primary types of inputs:
- Speech from video recordings
- User interaction events when they click through an app
- Text content in the form of articles or blogs.
These inputs are then fed into the context window of the LLM and prompted for a set of outputs that include guide content (whether checklists, flows, etc), actions the user should take, or even where the element should be attached.
In these cases, the LLM is acting as an AI Agent that we’ve taught to use Bento and produce guides that themselves are multimodal:
- Text
- CSS selectors, e.g. where a step should live within an application’s UI
- Rich interactions like CTA buttons
But doing all this has been challenging. For example, when we pass in the content of a help article, superfluous content like menus also get captured. To solve this problem, we have to consider the implications of the scraper tools we’re using, whether and how we process that before passing to GPT, the token limits of the models we use and the impact of that on latency and quality, etc.
Running the exact same query multiple times already introduces inconsistencies. When you factor in the half dozen variables we’re playing with here, knowing what to tweak and refine gets difficult, and slow.
Teaching the LLM to act as an agent is also non-trivial. First we have to train it on the format of data that we can ingest which may differ across guide types. Then, in order to introduce more advanced capabilities like formatting and styling, we have to consider all the libraries and formats that we currently use and their formats.
What’s next?
The road ahead is full of possibilities and we’re excited to leverage the wealth of new tools like Rivet that help teams like ours iterate faster. Prior to using it, each tweak would require us to run through the whole flow manually which can take 5-10 minutes. That’s no way to iterate or experiment quickly.
Now, we can isolate each of these variables and see its impact in Rivet’s graph UI. It’s not only faster but the debugging experience is far superior. Once we can quickly and easily see the impact of a change on token usage, speed, and subjective measures of quality, we can discard or dive deeper into individual hypotheses.
In these next iterations, we will continue to focus on quality of output, but also tackle richer output like producing media, or generating interactive steps like branching.
Follow along and join in as we continue our everboarding journey to ensure that the software we build, and the software people buy, is adopted and deeply integrated.