Scroll to read more

Introduction

Last week I kicked off a 6-week exploration to see if I could incorporate “personality” to an LLM.  

The primary gaps I wanted to address were:

  1. The model’s ability to identify and retain relevant information.
  2. The model’s ability to “evolve” over time based on our interactions. 

I broke this into 4 primary steps, alternating between the goals above.

  1. Enhanced Short/Midterm memory.
  2. Introducing “Personality”.
  3. Long term memory.
  4. Introducing “Character”.



For the last week, I’ve been playing with different approaches to effectively enhance short term memory of the bot.  But before I take you through my findings, let’s take a moment to explain how we interact with LLM’s. 

Prompting LLM’s

When we want to interact with a LLM, we begin with a simple prompt. This is a common language input that serves as the seed from which our AI models generate their responses.

This prompt is broken down into smaller components known as tokens. These tokens are not just words but can include parts of words or punctuation, representing the fundamental units the model processes.

Once tokenized, these units are fed into the machine learning model, which tries to understand the interactions between each of the tokens.

But this is where the challenges begin. The ‘attention mechanism’ is comparing each token to every other token within the entire context. If move from a simple input like the one shown above with 3 tokens, to a paragraph with 300 tokens, I would need 10,000x as much memory just to process the prompt!

But researchers are developing innovative methods to manage longer ‘contexts‘ without overwhelming computational resources, thereby pushing the boundaries of what AI models can understand and generate.

With this “context” in mind (pun intended), let’s delve into the progress made this past week.

Baseline Experimentation

This week, I set a clear objective: develop a method that allows for sustained, day-long conversations with a chatbot without losing the thread of the discussion. The target was ambitious — the system should handle over 100 messages seamlessly.

To kick things off, I established a baseline for the model’s performance. I devised a simple test involving five straightforward questions to engage the model. First, I gave it information I’d like it to remember, and then saw how long we were able to interact before it forgot or began to confuse these.

Using an “out of the box” implementation, each prompt was seen in a “vacuum”. So unlike conversations you may have with ChatGPT / Gemini / Claude, this would not automatically carry on from one message to the next, so the model essentially “forgot” the information immediately. To address this, I started with a naive approach, where I simply added the previous prompt and responses to a running “cumulative context” and used this as the model input.

Surprisingly, the model performed quite well under these constraints, correctly responding to four out of the five questions before memory exhaustion after approximately 80 messages. However, this didn’t align with my original experience from interacting with the various models/Chatbots. I reviewed my approach and found that first approach was overly precise and lacked the nuances of natural conversation.

Motivated by this, I decided to challenge the model further by reformulating the same questions using more natural language. The results were drastically different — the model struggled significantly, providing me with a baseline similar to what I had experienced form the other tools.

With this, it was time to start experimenting.

Naive Distillation

My first approach was to implement a naive distillation step. The idea was to refine each incoming comment to extract and condense the most crucial information, mirroring the clarity achieved when inputs were highly explicit.

After each user comment, the model would not only generate a response but also engage in a reflective process, pinpointing the key pieces of information from the user’s input. This distilled information was then integrated into the ongoing conversation context, effectively enriching the model’s understanding and recall capabilities.

The results were immediately noticeable. The model began to respond with significantly improved accuracy, echoing the success seen in the initial explicit-input trials. However, this solution introduced a new complication. Each user message generated not just a response from the bot, but also a “distilled context” update.

For every 10 messages sent by me, approximately 1300 tokens were added to the rolling context. Considering my GPU memory limit of about 5000 tokens, this meant that the system would become overloaded in fewer than 40 messages, leading to a crash.

Meta-Distillation

My next challenge was to figure out how to keep the benefits of distillation without overwhelming the system or making the user experience clunky. Enter “Meta-distillation.” I kept the initial distillation process, but every 10th message, I would compress just the distilled info into a single, neat entry.

The result: the best performance so far, and instead of adding a mountain of tokens every few messages, I was just adding about 100 tokens per every 10 messages.

The best part is that all of this is happening behind the scenes. I prompt the bot and after it shoots back a response, it quietly handles the distillation, so the user gets their reply without any delay. By the time I finish writing back, the distillation is complete and the model is just waiting for me to catch up.

(PS. I had ChatGPT make up some of the facts about me, so don’t judge me for anything in there…)

At this point, I was ready to pressure test it, and see if it could meet my 100 message challenge. The result: it retained the info, answered my questions, and didn’t even crash my instance! This met my goal for the week, but it obviously wouldn’t scale into the thousands of messages. But that’s ok because I have a plan for that: RAG—Retrieval Augmented Generation. But we’ll get there in a few weeks.

Next Steps

Now that I have a working solution for short term memory, the next step is to introduce some personality into my bot. There was still a major gap in the model, which is that it still constantly reminded me that it was an “AI language model” and struggled to remember the name I gave it.

So my plan is to “seed” some personality with fine-tuning, and then introduce a process to continue to continually refine the personality based on existing conversations.

Stay tuned!