Watch Neon Developer Days #3 🚀
AI & embeddings

The pg_tiktoken extension

Efficiently tokenize data in your Postgres database using OpenAI's `tiktoken` library

The pg_tiktoken extension enables fast and efficient tokenization of data in your Postgres database using OpenAI's tiktoken library.

This topic provides guidance on installing the extension, utilizing its features for tokenization and token management, and integrating the extension with ChatGPT models.

What is a token?

Language models process text in units called tokens. A token can be as short as a single character or as long as a complete word, such as "a" or "apple." In some languages, tokens may comprise less than a single character or even extend beyond a single word.

For example, consider the sentence "Neon is serverless Postgres." It can be divided into seven tokens: ["Ne", "on", "is", "server", "less", "Post", "gres"].

pg_tiktoken functions

The pg_tiktoken offers two functions:

  • tiktoken_encode: Accepts text inputs and returns tokenized output, allowing you to seamlessly tokenize your text data.
  • tiktoken_count: Counts the number of tokens in a given text. This feature helps you adhere to text length limits, such as those set by OpenAI's language models.

Install the pg_tiktoken extension

You can install the pg_tiktoken extension by running the following CREATE EXTENSION statement in the Neon SQL Editor or from a client such as psql that is connected to Neon.

For information about using the Neon SQL Editor, see Query with Neon's SQL Editor. For information about using the psql client with Neon, see Connect with psql.

Use the tiktoken_encode function

The tiktoken_encode function tokenizes text input and returns a tokenized output. The function accepts encoding names and OpenAI model names as the first argument and the text you want to tokenize as the second argument, as shown:

The function tokenizes text using the Byte Pair Encoding (BPE) algorithm.

Use the tiktoken_count function

The tiktoken_count function counts the number of tokens in a text. The function accepts encoding names and OpenAI model names as the first argument and text as the second argument, as shown:

Supported models

The tiktoken_count and tiktoken_encode functions accept both encoding and OpenAI model names as the first argument:

The following models are supported:

Encoding nameOpenAI model
cl100k_baseChatGPT models, text-embedding-ada-002
p50k_baseCode models, text-davinci-002, text-davinci-003
p50k_editUse for edit models like text-davinci-edit-001, code-davinci-edit-001
r50k_base (or gpt2)GPT-3 models like davinci

Integrate pg_tiktoken with ChatGPT models

The pg_tiktoken extension allows you to store chat message history in a Postgres database and retrieve messages that comply with OpenAI's model limitations.

For example, consider the message table below:

The gpt-3.5-turbo chat model requires specific parameters:

The messages parameter is an array of message objects, with each object containing two pieces of information: The role of the message sender (either system, user, or assistant) and the actual message content. Conversations can be brief, with just one message, or span multiple pages as long as the combined message tokens do not exceed the 4096-token limit.

To insert role, content, and the number of tokens into the database, use the following query:

Manage text tokens

When a conversation contains more tokens than a model can process (e.g., over 4096 tokens for gpt-3.5-turbo), you will need to truncate the text to fit within the model's limit.

Additionally, lengthy conversations may result in incomplete replies. For example, if a gpt-3.5-turbo conversation spans 4090 tokens, the response will be limited to just six tokens.

The following query retrieves messages up to your desired token limits:

<MAX_HISTORY_TOKENS> represents the conversation history you want to keep for chat completion, following this formula:

For example, assume the desired completion length is 100 tokens (NUM_COMPLETION_TOKENS=90).

Conclusion

In conclusion, the pg_tiktoken extension is a valuable tool for tokenizing text data and managing tokens within Postgres databases. By leveraging OpenAI's tiktoken library, it simplifies the process of tokenization and working with token limits, enabling you to integrate more easily with with OpenAI's language models.

As you explore the capabilities of the pg_tiktoken extension, we encourage you to provide feedback and suggest features you'd like to see added in future updates. We look forward to seeing the innovative natural language processing applications you create using pg_tiktoken.

Resources

Need help?

Join the Neon community forum to ask questions or see what others are doing with Neon. Neon Pro Plan users can open a support ticket from the console. For more detail, see Getting Support.

Last updated on

Edit this page
Was this page helpful?