After my YFJ Bot post last week, several people asked me how I built the bot as the code I provided was only for generating responses. Since it is quite a simple application, here I will walk you through how the application is built and if you are a Substack writer, you can replicate what I do by creating a chatbot based on your content in minutes.
In a nutshell, the chat bot is answering questions based on the context provided. Ideally, I could provide all my blog content as the context but it will be too long for ChatGPT to handle. After all, I wrote 233 posts and 500K+ characters so far. Instead, I leverage OpenAI’s embeddings API to determine what posts I want to use as the context of a question. The embeddings API allows us to represent a piece of content as a numeric vector. If two embedding vectors are similar to each other, it indicates that their underlying content is also similar to each other. If I want to find posts that are relevant/similar to a question, I can find posts that have embedding vectors that are similar to the question’s embedding vector as a proxy.
To generate embedding vectors of a post, I first threw in the content of the post into the API to get an embedding vector back. But in order to make the similarity scores aligned better, I made some improvements by generating additional embedding vectors from each post. Given a post, I asked ChatGPT to generate a list of questions and answers from the post content. After that, I generate an embedding vector from each (question, answer) pair, which is attributed to the original post. After all the processing, I end up having a list of (embedding vector, post_id) pairs. From the 233 posts, I generated a total of 2404 (embedding vector, post_id) pairs.
Then, given a question, I would calculate the embedding vector of the question and find the posts that are most similar to the question. This is achieved by calculating the cosine similarities of the question embedding vector and all the embedding vectors from the list mentioned above. From there, I fetch the top N results that have the highest similarity scores and find the corresponding posts. Using the content of the posts as context, the chatbot is able to answer questions pretty well if my blog has relevant content about the question. If my blog has no relevant content, it will answer “I don’t know” or “I don’t have an opinion”. Overall, ChatGPT is quite amazing to process the content and comes up with the best answer.
Here is an example of no relevant content.
Q: What do you think of the 2028 LA olympics?
YFJ Bot: I don't have an opinion on the 2028 LA Olympics as I haven't written about it yet.
And relevant content.
Q: What will happen to venture capital in 2023?
YFJ Bot: I believe 2023 will be a very tough year for startups and the venture capital landscape. The current environment of higher interest rates and lower valuations will make it increasingly difficult for startups to raise funds, and many of them may end up folding. The overall sentiment will be more cautious, and investors will become more disciplined in their investments. However, great founders will showcase their resourcefulness and make the most out of the challenging circumstances. There may be a slow return of initial public offerings (IPOs) for some late-stage companies, but valuations are expected to be significantly lower than in recent years.
Here is the complete code of making the YFJ Bot happen. If you are a Substack writer, you can export your content, download and unzip it to a directory and change the post_dir variable in the code to build your bot. We truly live in an incredible time where great technology is accessible at our fingertips. I believe we will be seeing incredible productivity boosts from software companies. Every software engineer is gonna become a 10X engineer by leveraging this amazing technology.
So wait can you embed it also onto substack?