Chat with your custom data through Drupal & PHP: harnessing OpenAI and a vector database

In today's digital world, having immediate access to information is crucial. This blog post explores how you can chat with your custom data (or knowledge base) using Drupal & PHP, all while harnessing the power of OpenAI and a vector database. Dive in to discover a seamless interaction with your data, unlocking a new level of engagement and efficiency.

I am not an expert in this field, so bear that in mind as you read through this blog post. This piece emerged from diving into the topic and experimenting to see what works. I penned it down to solidify what I've learned and to retain the insights better.

Table of contents

1. Get text from Drupal nodes
2. Get embeddings using OpenAI embeddings endpoint
- 2.1 Token limit
- 2.2 Get embeddings
3. Store embeddings in Pinecone
4. Perform search
5. Improve results using OpenAI completion endpoint
What about non-English languages?
What about fine-tuning a GPT model?
Summary

So, without further ado, let's get started and delve into the mechanics of how it all works.

1. Get text from Drupal nodes

The first step is getting the text ready from the content in Drupal. In this example, we'll use a content type called Article, which has just two fields. The first field is named Title, and the second one is Paragraphs (machine name: field_paragraphs). Paragraphs is a multi-value field, where we store the text we want to get. Here is an example of a node with just two paragraphs:

Using paragraphs instead of a single, large Body field to store your textual data is a smart move. This way, your content is naturally split into clear sections, making the searches more accurate and finding the right information easier. To programmatically pull out the text, just load the article node (in my example, the node ID is 1) and loop through the paragraphs one by one:

/** @var \Drupal\node\NodeInterface $article */
$article = \Drupal::entityTypeManager()->getStorage('node')->load(1);

$text = [];
 
/** @var \Drupal\paragraphs\ParagraphInterface $paragraph */
foreach ($article->get('field_paragraphs')->referencedEntities() as $paragraph) {
  $text[] = strip_tags($paragraph->get('field_text')->value);
}

With our node's text ready, we can now delve into the wonders of OpenAI.

2. Get embeddings using OpenAI embeddings endpoint

The following step involves interacting with the OpenAI embeddings endpoint. This process is straightforward—we send over textual data, and in return, we receive vectors, which are numerical representations of our data - basically an array of numbers. For this step, ensure you have a valid OpenAI API key and some credit, as accessing this endpoint comes with a cost. The pricing currently stands at $0.0001 per 1000 tokens for the recommended Ada v2 model (also known as text-embedding-ada-002), translating to roughly 15,000 pages (assuming average of around 500 words per page) for just $1.

2.1 Token limit

The model text-embedding-ada-002 has a cap of 8191 input tokens, so it's vital to ensure you're not sending too much text in a single request. A useful guideline is that typically, one token equates to around 4 characters of text for standard English text. To calculate the number of tokens in your text programmatically, the TikToken library is a handy tool. Alternatively, if you prefer not to do it in the code, you can use OpenAI's online tokenizer.

To use the programmatic approach, first install the library with Composer. Use this command:

composer require yethee/tiktoken

After you install it, you can use the library to count the number of tokens in your text. Here's how to update the earlier code to include token counting:

use Yethee\Tiktoken\EncoderProvider;

$provider = new EncoderProvider();
$encoder = $provider->getForModel('text-embedding-ada-002');

/** @var \Drupal\node\NodeInterface $article */
$article = \Drupal::entityTypeManager()->getStorage('node')->load(1);

$text = [];
 
/** @var \Drupal\paragraphs\ParagraphInterface $paragraph */
foreach ($article->get('field_paragraphs')->referencedEntities() as $delta => $paragraph) {
  $field_text = strip_tags($paragraph->get('field_text')->value);
  $tokens = $encoder->encode($field_text);

  $text[] = [
    'delta' => $delta,
    'text' => $field_text,
    'token_count' => count($tokens),
  ];
}

In this example, we don't have to worry about the API input token limit. But if you're working with long text, you'll need to break it into smaller pieces to fit within the API limit.

2.2 Get embeddings

Now we can send our text to the OpenAI embeddings endpoint to see the results. First, we need to install another library that makes it easier to use the OpenAI API with PHP. Use this Composer command to install it:

composer require openai-php/client

To send the text to the endpoint do something like this:

use Yethee\Tiktoken\EncoderProvider;

$provider = new EncoderProvider();
$encoder = $provider->getForModel('text-embedding-ada-002');

/** @var \Drupal\node\NodeInterface $article */
$article = \Drupal::entityTypeManager()->getStorage('node')->load(1);

$text = [];
 
/** @var \Drupal\paragraphs\ParagraphInterface $paragraph */
foreach ($article->get('field_paragraphs')->referencedEntities() as $delta => $paragraph) {
  $field_text = strip_tags($paragraph->get('field_text')->value);
  $tokens = $encoder->encode($field_text);

  $text[] = [
    'delta' => $delta,
    'text' => $field_text,
    'token_count' => count($tokens),
  ];
}

$openai_client = \OpenAI::client('YOUR-OPENAI-API-KEY');

foreach($text as $item) {
  $response = $openai_client->embeddings()->create([
    'model' => 'text-embedding-ada-002',
    'input' => $item['text'],
  ]);

  $embeddings = $response->toArray();
}

The $embeddings variable has vectors that are 1536 dimensions long. It also holds information about token usage. Let's dump the variable and see what it looks like:

So what can we do with all these vectors? The answer is to save them in a vector database. And my vector database of choice for this blog post is Pinecone.

3. Store embeddings in Pinecone

The first thing to do when saving embeddings in Pinecone database is to create a $vectors variable. This variable will hold both the vector info and any extra details you want to add, called metadata. Metadata is like bonus info about your original text. You can use this extra info later for looking things up or narrowing down your search.

$vectors = [
  'id' => '...some ID',
  'values' => $embeddings['data'][0]['embedding'],
  'metadata' => [
    'entity_type' => $node->getEntityTypeId(),
    'bundle' => $node->bundle(),
    'entity_id' => $node->id(),
    'field_delta' => '...field delta',
  ],
];

Now we can insert the record to the database:

$pinecone_client = \Drupal::httpClient()->fromOptions([
  'headers' => [
    'Content-Type' => 'application/json',
    'API-Key' => 'YOUR PINECONE API KEY',
  ],
  'base_uri' => 'YOUR PINECONE HOSTNAME',
]);

$pinecone_client->post('/vectors/upsert',
[
  'json' => [
    'vectors' => $vectors,
  ],
]);

Let's combine this code snippet with all previous snippets into one giant block of code that will get the text from Drupal, then get embeddings from OpenAI and finally store those embeddings into Pinecone database:

use Yethee\Tiktoken\EncoderProvider;

$provider = new EncoderProvider();
$encoder = $provider->getForModel('text-embedding-ada-002');

/** @var \Drupal\node\NodeInterface $article */
$article = \Drupal::entityTypeManager()->getStorage('node')->load(1);

$text = [];
 
/** @var \Drupal\paragraphs\ParagraphInterface $paragraph */
foreach ($article->get('field_paragraphs')->referencedEntities() as $delta => $paragraph) {
  $field_text = strip_tags($paragraph->get('field_text')->value);
  $tokens = $encoder->encode($field_text);

  $text[] = [
    'delta' => $delta,
    'text' => $field_text,
    'token_count' => count($tokens),
  ];
}

$openai_client = \OpenAI::client('YOUR-OPENAI-API-KEY');

$pinecone_client = \Drupal::service('http_client_factory')->fromOptions([
  'headers' => [
    'Content-Type' => 'application/json',
    'API-Key' => 'YOUR PINECONE API KEY',
  ],
  'base_uri' => 'YOUR PINECONE HOSTNAME',
]);

foreach($text as $item) {
  $response = $openai_client->embeddings()->create([
    'model' => 'text-embedding-ada-002',
    'input' => $item['text'],
  ]);

  $embeddings = $response->toArray();

  $vectors = [
    'id' => $article->getEntityTypeId() . ':' . $article->id() . ':' . $item['delta'],
    'values' => $embeddings['data'][0]['embedding'],
    'metadata' => [
      'entity_type' => $article->getEntityTypeId(),
      'bundle' => $article->bundle(),
      'entity_id' => $article->id(),
      'delta' => $item['delta'],
    ],
  ];

  $pinecone_client->post('/vectors/upsert',
  [
    'json' => [
      'vectors' => $vectors,
    ],
  ]);
}

Clearly, in a real project, you'd split this up into separate methods. But for this example, we're keeping it simple and easy to test with Devel PHP.

Let's take a peek at Pinecone's index dashboard and see what it looks like:

Now that we have our data indexed, we can start searching it.

4. Perform search

If you've been closely following this article, you'll probably know what comes next. First, we need a search term. We then send this term to OpenAI's embedding endpoint to convert it into a vector representation. Lastly, we execute a query against the vectors in our specialized vector database. This search is called semantic search and is also known as neural or vector search. We are not searching for keywords as in traditional relational databases. Instead, we are performing searches based on the meaning of the content. The search tries to understand what the words mean together, which allows for more accurate and relevant results.

The primary reason for utilizing a vector database is to efficiently perform similarity searches among vectors. Traditional relational databases like MySQL aren't designed for this type of operation. Vector databases, on the other hand, are optimized for high-speed similarity searches. And they can scale to very large data sets quickly and easily. Pinecone offers three distance metrics: cosine similarity, dot product, and Euclidean distance. In my setup, I've opted for cosine similarity, which is often the recommended approach for many use-cases.

Let's execute a query against Pinecone database:

$openai_client = \OpenAI::client('YOUR-OPENAI-API-KEY');

$pinecone_client = \Drupal::service('http_client_factory')->fromOptions([
  'headers' => [
    'Content-Type' => 'application/json',
    'API-Key' => 'YOUR PINECONE API KEY',
  ],
  'base_uri' => 'YOUR PINECONE HOSTNAME',
]);

$search_term = 'What day do we have English practice?';

$response = $openai_client->embeddings()->create([
  'model' => 'text-embedding-ada-002',
  'input' => $search_term,
]);

$embeddings = $response->toArray();

$payload = [
  'vector' => $embeddings["data"][0]["embedding"],
  'topK' => 1,
  'includeMetadata' => TRUE,
  'includeValues' => FALSE,
];

$pinecone_query = $pinecone_client->post('/query',
[
  'json' => $payload,
]);

$result = json_decode($pinecone_query->getBody()->getContents());

$metadata = $result->matches[0]->metadata;
$entity_type = $metadata->entity_type;
$entity_id = $metadata->entity_id;
$delta = $metadata->delta;

$article = \Drupal::entityTypeManager()->getStorage($entity_type)->load($entity_id);
$paragraph = $article->get('field_paragraphs')->get($delta)->entity;
$text = strip_tags($paragraph->get('field_text')->value);

As you can see, my search term is What day do we have English practice? Let's see what we get as a response by dumping the $result and $text variables:

We requested only one result (topK = 1) from the query, and it's the one that has the information we searched for (second paragraph from our first screenshot above). So it seems like the search works well, but the final result needs some improvement, we obviously cannot show this to the end user.

5. Improve results using OpenAI completion endpoint

We can now use OpenAI's chat completion endpoint to make the answer sound more realistic. You can use the older GPT-3.5 Turbo model or the newer and more capable GPT-4 model. GPT-3.5 Turbo (4K context model) is cost-effective with pricing set at $0.0015 per 1,000 input tokens and $0.002 per 1,000 output tokens. GPT-4 (8K context model) is definitely better, but that comes at a higher price, which is set at $0.03 per 1,000 input tokens and $0.06 per 1,000 output tokens.

Just in case it's not clear to you, the text you send to the model is made up of input tokens, and the text you get back from the model is made up of output tokens.

Let's improve our final result:

$openai_client = \OpenAI::client('YOUR-OPENAI-API-KEY');

$pinecone_client = \Drupal::service('http_client_factory')->fromOptions([
  'headers' => [
    'Content-Type' => 'application/json',
    'API-Key' => 'YOUR PINECONE API KEY',
  ],
  'base_uri' => 'YOUR PINECONE HOSTNAME',
]);

$search_term = 'What day do we have English practice?';

$response = $openai_client->embeddings()->create([
  'model' => 'text-embedding-ada-002',
  'input' => $search_term,
]);

$embeddings = $response->toArray();

$payload = [
  'vector' => $embeddings["data"][0]["embedding"],
  'topK' => 1,
  'includeMetadata' => TRUE,
  'includeValues' => FALSE,
];

$pinecone_query = $pinecone_client->post('/query',
[
  'json' => $payload,
]);

$result = json_decode($pinecone_query->getBody()->getContents());

$metadata = $result->matches[0]->metadata;
$entity_type = $metadata->entity_type;
$entity_id = $metadata->entity_id;
$delta = $metadata->delta;

$article = \Drupal::entityTypeManager()->getStorage($entity_type)->load($entity_id);
$paragraph = $article->get('field_paragraphs')->get($delta)->entity;
$text = strip_tags($paragraph->get('field_text')->value);

$response = $openai_client->chat()->create([
  'model' => 'gpt-4',
  'messages' => [
    [
      'role' => 'user', 
      'content' => 'I will provide you with a question and some related context that may have the answer. If you cannot find the answer, just say I do not know the answer. Do not mention that you are using the contextual data for the answer. The result should be neatly formatted, showing the question first and then the answer. Question: "' . $search_term . '". Context: "' . $text . '"'
    ],
  ],
]);


$response_message = $response->choices[0]->message->content;

Let's dump the $response_message variable and see what our answer looks like.

Given the limited original textual data, we can't ask too many questions, but let's try a few more.

Seems like a pretty solid set of answers. Now let's see what happens if we use a non-English language.

What about non-English languages?

GPT models work best with English; there's no doubt about that. Now, let's see what happens when we use another language. In my case, I'm going to use Serbian. I will add a few more paragraphs to the original article and then start asking questions.

Here's a JSON representation of the updated article with Serbian text:

{
  "title": "Pogodnosti za zaposlene",
  "paragraphs": [
    {
      "content": "Pregled očiju. Znamo da provođenje najmanje 8 sati dnevno na računaru može biti loše za naše oči. Da bismo očuvali zdravlje očiju, organizujemo pregled očiju. Takođe ćemo pomoći da se plate nove naočare."
    },
    {
      "content": "Časovi engleskog jezika. Svake srede vežbamo engleski zajedno. Imamo nastavnika engleskog jezika da nam pomogne. Takođe postoje časovi engleskog za ljude koji žele da nauče više."
    },
    {
      "content": "Hrana i piće (za sve). Kompanija će povremeno sponzorisati doručke i ručkove. Uvek ima besplatne vode, sode, kafe, mleka, piva, Nesquik-a, žitarica i čaja."
    },
    {
      "content": "Sportske aktivnosti. Ako ste sportski tip, kompanija će sponzorisati vaše rekreativne aktivnosti sa mesečnim budžetom od 2500 RSD. Samo treba da predstavite fakturu za sportsku aktivnost (npr. članarinu za teretanu) na kraju meseca i dobićete povraćaj novca. Plus, momci organizuju rekreativni fudbal jednom nedeljno, da se malo opuste."
    },
    {
      "content": "Konferencije (za sve). Kompanija će vam obezbediti ulaznice za konferencije (možda ponekad čak i platiti prevoz). Ovo je moguće dogovoriti, zavisno od lokacije konferencije, itd."
    },
    {
      "content": "Masaža gornjeg dela tela (torsa) (za sve). Svakog meseca imamo internu sesiju masaže sa fizioterapeutom. Možete rezervisati svoje mesto i dobićete fenomenalan tretman. Eterična ulja i opuštajuća muzika su uključeni."
    }
  ]
}

Now let's ask a few questions to see how well the model answers in a non-English language.

Seems like GPT models are not that bad for use with non-English languages, and I'm completely satisfied with provided results.

What about fine-tuning a GPT model?

Can we use fine-tuning to train GPT on custom data so that we can ask it questions? No, that's not the purpose of fine-tuning. You might then ask, "What is fine-tuning for?" It's designed to teach a model a new task, not to add new information.

Fine-tuning is more difficult and more expensive than using embeddings and semantic search. You'll need to pay both for training the model and for its usage, which includes both input and output tokens.

If you want to dive deeper into this topic, check out the following YouTube video: Finetuning GPT-3 vs Semantic Search.

Summary

In summary, I can say that after a lot of reading and trying out different stuff, this looks promising. I've just scratched the surface of what can be done, and I think even better results are possible.

Using embeddings to get vector representations of your textual data is really cheap. Basically, you can send 15,000 pages and get embeddings for just 1 dollar. Storing those embeddings can have some cost. Pinecone offers a free plan for smaller apps, but there are many more vector databases out there. Some are open-source, like Milvus, so if you have expertise in managing services, it can also be very inexpensive.

Using chat completion endpoints can be a bit more expensive because the prices are slightly higher, and you also have to pay for both input and output tokens. If your data isn't properly chunked, you could end up sending too much text to the endpoint, which could result in higher costs. However, if your indexed data is optimized, the cost doesn't have to be high. A rough estimate is that you can perform 30 to 2,000 requests for 1 dollar, depending on the model used, the length of the context, and the length of the text you are sending and receiving.

For organizations that want to restrict access to certain sensitive data, it's obvious that a record's metadata can be used to control who has access to what. For each piece of information, you can attach a role field that specifies the required user role to get the results. Later, you can limit your vector search based on this metadata.

Whew, we've reached the end of what is likely my longest article to date! Thank you for sticking with me through this comprehensive journey. I hope you found the content both insightful and useful. That's a wrap for now, but stay tuned for more in-depth discussions and explorations in the future. Cheers!

Chat with your custom data through Drupal & PHP: harnessing OpenAI and a vector database

About the Author

Further reading