Sweet Streams

Sweet Streams

One of the gifts and challenges of working at a consultancy like Flagrant is that we sometimes find ourselves in between projects with time on our hands. This can be a good opportunity to slow down a bit and catch up on some personal projects away from the computer. For me, that might mean making and freezing another batch of home grown tomato sauce or pesto to use in the winter.

It is also a great time to dig into a side project!

A big benefit of working at Flagrant is that you have other thoughtful, creative folks who see and know you, looking out for opportunities that might be of interest to you. One of our mottos is “Genuinely desire success in those around us and do what we can to make it happen”. Last month Jim did exactly that when he proposed that I take some of this ‘slow’ time to help build out an AI driven prototype.

I had just come off of helping out with the Stripe integration for blueprints which will “Curate your own personalized collection of blueprints and leverage the power of LLMs to generate code in your style on our site, in your editor, or through our API”. Jim noticed my curiosity about the underlying AI components to that app and sensed I would enjoy a deeper dive.

Jim also has his pulse on the political moment we are in and was hoping we could build a prototype for an app that digests lengthy, complex documents (e.g. Project 2025) and answer questions about them in simple, plain English.

For me this was a great opportunity to become familiar with some of the LLMs out there and the common patterns for interacting with them and the APIs they offer. In particular I learned a lot about Retrieval Augmented Generation aka RAG. A common pattern for driving AI chat applications, RAGs lean on the vast general knowledge of trained LLM models while including some additional context to tailor the results to a more specific problem.

Most AI companies out there these days offer APIs for working with their LLMs (e.g. OpenAI’s GPT models, Anthropic’s Claude models). While there are ways to host some models yourself it is often much faster and cheaper when starting out to lean on their services. After considering a few options we chose to stick with OpenAI since they seem to be actively developing powerful developer tools and endpoints along with some of the best known versatile models out there. Clearly though this is a very fertile space with new players and options surfacing at an accelerating pace.

While the inner workings of these LLMs are complex and somewhat mysterious, it turns out that interacting with them can be surprisingly simple and trivial. One discovery for me is that often the first step can be to generate and store embeddings, vector representations of the data you want to inform your chat conversations with LLM.

In our case, we broke up documents into chunks and stored the embeddings for each chunk in our local postgres database as vectors thanks to the handy pgvector extension. These embedding vectors can then be queried for similarity using a gem like neighbor or perhaps an even larger wrapper like langchainrb_rails which abstracts much of the logic and communication required to build a RAG based Rails app.

I would have never guessed that some of the power of LLMs ends up residing in your own familiar postgres database in a Rails app as you leverage complex vector algebra to select the best chunks of text to send as messages to the LLM for generated answers to questions. Part of the motivation for this approach is that there are data, speed and cost limits to how much text you can send to an LLM over an API request so it is more efficient to do some of the matching and pruning of information locally. I suspect that as these LLMs become even faster and more powerful some of these limitations might change and more robust options in which an LLM can receive and digest large documents in a single request will become available.

Another big discovery for me from working with these models was realizing how much of the effectiveness of the results depends on the system prompts that are fed along with the context user messages sent in the chat API requests. Indeed, it is for this reason that prompt engineering has already become a new career path and may likely become one of the many hats that most developers will need to wear in the coming age of AI driven development.

For example, we wanted the AI chat responses to include source citations to the text chunks that we sent in our requests. There is currently no official API support for this but, no matter, because you can simply require this using plain English with appending something like this to your system prompt:

   Each chunk will be given a unique identifier and you
   must site the chunk that you used to derive your answers by placing the
   unique identifier in square brackets after each note in your answer.
   The identifier should look something like [Document-1-Chunk-1-2] and
   each identifier should always be wrapped with its own square brackets.
   

Believe it or not this pretty much just worked. Makes you wonder what the future of programming will look like?

So in the end, some of the most fun and interesting parts of building our chat application was working with some of the familiar new tools provided by Rails and Hotwire. One little challenge to solve was how to get the GPT answers to stream to the UI.

A rather convenient by product of how LLMs work is that they generate their responses progressively on the fly. It is still a marvel to me that LLMs are able to build coherent paragraphs of content incrementally without having any sense of the overall arc of what they are crafting. So while you can wait for a complete response using the chat endpoint you can create a much superior UX experience by providing a stream proc that will process the response tokens as they are generated and passed along. The puzzle is then how to best update the UI when clearly a long running request like this needs to happen in the background.

Fortunately, with the introduction of ActionCable and TurboStreams in recent versions of Rails solving this has become almost trivial. Since ActionCable leans on the pub/sub features of Redis you can actually broadcast to the UI from virtually anywhere in your app. If controllers, models and even the Rails console can push stream data, certainly an ActiveJob that is backed by Sidekiq which leans heavily on Redis itself can do it as well.

Moreover, if you are using Turbo in your Rails app, then subscribing to a channel for receiving streams is dead simple. In our case it looked like this:

 <%= turbo_stream_from rag %>
 <div class="form-group" data-controller="rag">
   <div class="flex justify-between">
     <label for="answer"><%= rag.question %></label>
     <label for="asked_by">
       Asked by <%= rag.asked_by.name %> at <%= l(rag.created_at, format: :rag) %>
     </label>
   </div>
   <span id="<%= dom_id(rag, :answer) %>" name="answer" class="form-control rag">
     <%= rag.answer_html %>
   </span>
 </div>
 

You might have guessed that we have a Rag model that stores the questions and answers of our chat application and the turbo_stream_from adds all the magic that generates something like the following custom html element provided by Turbo which handles the channel identification and subscription:

<turbo-cable-stream-source channel="Turbo::StreamsChannel" signed-stream-name="IloybGtPaTh2Y21GbmRHbHRaUzlTWVdjdk1qQXgi--8df146aa46cdbfe757c51fd6a368011239dc85d1e4ea0cdbaafacf3002d2d900" connected=""></turbo-cable-stream-source>
 

Now we just need to broadcast to the proper rag identified channel and give the correct unique html target identifier for updating:

  def stream_rag
    rag.answer = ''
    OPENAI_CLIENT.chat(
      parameters: {
        model: 'gpt-4o-mini',
        messages:,
        stream: ->(chunk) { read_stream(chunk.dig('choices', 0)) }
      }
    )
  end

  def read_stream(chunk)
    if chunk['finish_reason'].present?
      rag.save
    else
      rag.answer << chunk.dig('delta', 'content')
      Turbo::StreamsChannel.broadcast_update_to(
        rag,
        target: "answer_rag_#{rag.id}",
        html: rag.answer_html
      )
    end
  end
 

And the results are thrilling:

Updating an individual user’s web page is cool but it is also just as easy to allow any user to receive updates which is just the sort of behavior you would expect from a multi-user chat interface. Basically we just need to make sure that anyone who is following a particular discussion subscribes to a stream that gets updates for new rags:

<%= turbo_stream_from @discussion if @discussion.persisted? %>

And in our discussion model we can do something like:

  def ask(question:, asked_by:)
    rags.build(question:, asked_by:).tap do |rag|
      save!
      broadcast_prepend(
        target: dom_id(self),
        partial: 'discussions/rag',
        locals: { rag: }
      )
      StreamRagJob.perform_later(rag_id: rag.id)
    end
  end

The broadcast_prepend is a convenient helper that now comes baked in to ActiveRecord models for just this sort of streaming updates.

I am grateful that Rails has kept up with the leading edges of web technologies making integrating the latest powerful AI tools a snap. It is cool to see how well turbo streams handle responses from OpenAI and I will be curious to see how things progress going forward. I am also grateful to Flagrant for providing me the learning opportunity to explore these new technologies and get a feel for some of what it takes to build an AI driven Rails application.

Now time to package up that tomato sauce and pesto for the freezer.

If you’re looking for a team to help you discover the right thing to build and help you build it, get in touch.

Published on October 10, 2024