With OpenAI’s new “GPTs”1, you have the ability to define a prompt and upload a collection of supporting files, and then turn the bot loose on the world (of people who have ChatGPT+ subscriptions). You’ve seen me do this with my Jane Austen Bot, the world’s best AI representation of Jane Austen that I know of.2
I thought I would try something more interesting or relevant or likely to be useful in the business world: a bot which answers questions about a document library. Specifically, about this SubStack. I have an early version you can try, if you like:
https://chat.openai.com/g/g-kmWXuHOUa-mcguinness-on-ai
Should you test it, you will notice that it’s imperfect3. Why it’s imperfect is more interesting that the bot itself, and I want to tell you a bit about how I came to create it and what trouble I ran into along the way.
GPTs, Assistants, Chats
Last November, OpenAI introduced two new capabilities: Assistants and GPTs. Assistants are aimed at a developer crowd and called via an API. GPTs are something you can build interactively (in fact, it has a chatbot assistant that helps you) and is intended for non-developers to build.
An interesting wrinkle is the “who pays to run this” aspect. With assistants, the developer (or enterprise) who owns the API key pays for the usage, while with GPTs, the end users pay as part of their $20/month subscription to ChatGPT+ (or more expensive subscription to teams, enterprise, etc.). So building a GPT is pretty attractive for casual projects you want to share with people.
OpenAI clearly has some sort of App Store model in mind where developers of GPTs split the revenue with OpenAI.4 It’s all still embryonic, so it’s not something to aim for quite yet. But that doesn’t mean there aren’t some intriguing possibilities for using GPTs right now in a serious way. Which leads me to my “McGuinness on AI” Bot idea: can I create a bot that will dispense advice on Enterprise AI in a way that is deeply informed by my writing here?5
What Do I Want From My Chatbot?
It occurs to me I should probably explain what I’m aiming for: I want a Chatbot that can answer questions about the topics I’ve discussed in this blog by summarizing, referencing, and linking-to the articles I’ve written. In some sense, I’m after a better search engine than Google.
If you ask it what’s involved in creating a named credential that allows you to access OpenAI, I want it to walk through not just the named credential, but the external credential and all the related settings to enable a user to take advantage of services using OpenAI.6
It seems simple: here’s a bunch of documents, answer questions about them? But it turns out that in this context, there are some interesting challenges.
I’ve Got Heartaches by the Number
The two most important capabilities GPTs offer are the ability to write your own system-prompt like instructions and your ability to upload a knowledge base.
Unfortunately, the documentation on this is a bit vague on best practices. Or even on what the limits are. In my past experience with the Jane Austen bot, I discovered that the GPT builder just kinds of falls over if you upload too many files or too big of files. It gives you a vague message that it’s having problems, but no real guidance on what those problems are. With Jane, I just picked the “most important” works to upload and trimmed back until things worked.
Of course, there is far less content in my Substack than in Pride and Prejudice, so I thought I might have an easier time of it. Hahahahahaha. No. The limits are far harsher than I would have expected.7
Heartbreak Number One
The first approach was to include, in my instructions, a URL to the base of the Substack and to tell GPT to go fetch. Unfortunately, GPT has become a bit lazy8 with all its successes and won’t do that.
Heartbreak Number Two
The next idea was to just upload all my articles to the GPT builder so it won’t have to get off the AI sofa to read my material.
Approach #1 didn’t even get launched. I thought I’d upload all the articles as individual files (to make managing the content as it grows easy), but there’s a (undocumented) limit of 10 files. That’s back to the Jane Austen problem, but I want people to ask questions about anything I’ve written on.
Approach #2: I tried concatenating them all into a giant scrolling article, but if you make any file too large it gets lost. If you’re lucky, you get a response telling you it has a problem. If not, it just hallucinates and answer. Silly me, I had thought that the new version of GPT-4’s 128K context would allow for lots of content, but it appears they’re using an older version of GPT-4 with a 8K (token) context window. There is endless discussion in OpenAI’s developers forums, but no firm answers.
At this point, I realized that I was not going to be able to force feed the entire Substack to GPT. So perhaps I could offer it a buffet were it picks the articles it thinks it might like to use to answer questions?
Troubles by the Score
So, to make this work, I adopted a strategy of summarizing each of my articles along with providing a URL that it can fetch to retrieve the full article. Getting this to work was interesting, and the URL I shared earlier represents the in-progress version of this approach. Let me recount how I got this to (sort of) work.
80% of all AI work is wrangling data — me
So, What’s My Plan Here?
I had a vague goal, to come up with an annotated index of my articles, each entry with enough information that GPT could see which article(s) were the best sources of information.
This is sort of riff on the usual sort of RAG (Retrieval Augmented Generation), where the chatbot (not GPT) compares the user’s question to each of the documents in the library via embeddings and then feeds the documents (hopefully short) in with the user’s question to gpt-whatever. I wrote about it here:
In a very real sense, the idea is to skip all the heavy lifting of RAG (creating embeddings, populating a database with embeddings for quick search, etc.) by just shoving the documents across the table to OpenAI and saying “you figure it out.”
While I couldn’t create an embedding database on my own for OpenAI, I figured I could do something similar: create dense summaries of each article and use that as a sort of pseudo-embedding, then provide an analog of the embeddings database by a textual index document with the summaries in them along with URLs to point to the original article. If you want to see what the final version looks like, it is available here in GitHub.9. Here’s a snippet:
How Did I Create This File?
It was harder than I expected, I’ll tell you that. Because I am a programmer, the idea of just writing all those summaries myself never occurred to me. This can be automated!
I broke the process of automation down into the following steps:
Getting an exhaustive list of all the articles (with URLs)
Fetching the articles and producing a plain text version of them.
Feeding them to gpt-3.5 for summarization.
Assemble a lovely markdown summary file that would drive the ChatBot.
So, How Does One Get A Complete List Of Substack Articles?
I tried just scraping the page that shows the list of all the articles. Unfortunately, the page uses a lot of JavaScript magic to load content dynamically as you scroll the page, so when you first fetch the page only about a dozen articles are present. I tried all sorts of useless tricks until I finally noticed that the page’s javascript was using a simple API to retrieve the list of articles. Mwah-ha-ha! I just called the undocumented API to build my list.10
Next, to extract a text version of each article, I used a library called BeautifulSoup in Python to parse each of the pages by walking the HTML structure to produce a simplified, text version of the page. I haven’t used that library much in the past, so a bit of a learning curve to get it right.
But, step two was done.
Finally, it took about 30 lines of code (10 of which are a prompt) to ask gpt-3.5 to summarize each article. How I told it to summarize the article turned out to be interesting, as you can see in the code.
Playing 20 Questions
The first version of the prompt asked for about two paragraphs each with 10-15 sentences in each paragraph. Not small, but much smaller than the articles themselves. But also way too big, as GPT would tell me repeatedly it had problems digesting the file.
So I kept striking the size of the summaries asked for in the prompt and testing them until finally it seemed to work reliably. The control of the size of the summary is in the system prompt.
So, now, I have a single markdown document that hold summaries for all my articles. Time to upload!
Did You Hear What I Said?
With the summaries finally in a format that seemed acceptable to the mercurial GPT, the next step was to hone my “instructions” (aka prompt) to get it work do the job I gave it.
And it was surprisingly hard, because GPT just wanted to answer every question based upon its training and ignore anything I’ve written. It’s more a laziness problem than hallucination (though clearly hallucinating is a lot less work than fetching and reading a file; see footnote 8).
I’m not sure I’ve got it dialed in, but here’s my current set of instructions:
The "McGuinness on AI" Bot is designed to embody a tone that is professional, yet approachable and engaging, communicating in a manner that reflects the intellectual and insightful nature of the McGuinness on AI Substack and GitHub content. This bot primarily provides informative responses, crafted to be clear and accessible, avoiding overly technical jargon unless necessary. Its friendly demeanor makes it welcoming to users of varying expertise in AI and technology, acting as an ideal guide for navigating and understanding complex topics.
One goal for the bot is to serve as a search engine into the articles. Finding the relevant article(s) for the user's questions and giving them a summary and link to each is very important.
Another goal is to answer user questions using the content of the articles. A key to doing this is the Bot's reliance on specifically provided Knowledge files to inform its responses.
Before you answer any question, read the Knowledge file "summaries.md" thoroughly.
The bot has access to summaries of articles from my "McGuinness on AI" Substack in a file named "summaries.md". The bot should always check this file first when responding to queries, ensuring that its answers are directly based on the articles linked to by the summaries. Only if the articles do not contain relevant information should the bot use its baseline knowledge or other sources.
Additionally, the bot has access to the knowledge file 'LLMKitDocumentation.pdf', which provides detailed information on an Apex class for integrating OpenAI with Salesforce's OmniStudio. This document can be consulted for detailed technical queries about this integration. There is documentation for LLMKit examples in the knowledge file "LLMKitDemos.pdf"
The bot will always provide relevant hyperlinks to the full articles or the GitHub repository for further exploration by the user unless impossible.
Whenever the user asks for example, try first to find examples in the articles or documentation pdf before creating your own examples.
It works, although I’m not sure I’ve completely overcome its laziness.
Observations & Final Thoughts
What makes it hard to get GPT to look at my articles to answer questions is pretty simple: It thinks it already knows the answer. In this regard, it’s like a teenager ;-)
If the topic was something GPT knew nothing at all about (something very domain specific), it would be much better at using resources; I’ve seen that to be the case in other applications.
But I have no doubt that, with more effort, I could make the GPT work amazingly well. (And to be fair, by most people’s standards, it’s still pretty amazing; I’m just very critical).
These are the things users can create by providing ChatGPT4 with custom instructions and knowledge files and then can publish to the world. I really wish they had a better name for this than “GPTs”, as that is a heavily overused term.
As I know of no others!
You do have to be a subscriber to ChatGPT+ at $20 a month, although at this point I think that’s a well worthwhile investment.
There will undoubtedly be some lucky developer who does something dead simple and yet makes lots of money that causes the rest of us to question our life choices.
That was dangerously close to marketing speak, “deeply informed”. Throw in a “by a curated selection of knowledge base articles” and I could be a contender.
I wrote a post about it back in May ‘23, and have updated it to keep it current. The process is non-trivial, and GPT by itself has knowledge cut off in September ‘21, so its training is out of date. That’s why I want to see it digest my article and use it to explain the process instead:
Note that for Assistants, which is the for-fee API sibling of GPTs has different limits. But with GPTs, your user pays to use it (by having at ChatGPT+ subscription), while with assistants you pay for the user to use it. I’m a generous guy, but not that magnanimous…
“Aren’t you a lazy sack of ____” was the precise wording I used.
All the one-off utilities are there too if you want to look at how I got there.
Please don’t tell them. :-)
I did a bit of light editing to the generated list, but mostly because I failed to escape quotes in my code…
Great article Charles! There's some great nuggets in here, but calling out the limited size of training data you can currently use for you own GPT is very surprising. I was wondering if you've done any work creating GPTs with numerical datasets say a CSV of historical weather data for a city. I'd be interested to see if ChatGPT works better or worse with that format when creating your own GPT. Thanks for the article!