Why do we need vector databases? The proliferation of embeddings immediately brought forth the need to efficiently store, index, and search these arrays of floats. However, these steps are just a small piece of the overall technology stack required to make use of embeddings. The task of transforming source data to embeddings and the serving of the transformer models that make this happen is often left as a task to the application developer. If that developer is part of a large organization, they might have a machine learning or data engineering team to help them. But in any case, the generation of embeddings is not a one-time task, but a lifecycle that needs to be maintained. Embeddings need to be transformed on every search request, and inevitably the new source data is generated or updated, requiring a re-compute of embeddings.
Consistency between model training and inference
Traditionally, machine learning projects have two distinct phases: training and inference. In training, a model is generated from a historical dataset. The data that go into the model training are called features, and typically undergo transformations.
At inference, the model is used to make predictions on new data. The data incoming into the model for inference requires precisely the same transformations that were conducted at training. For example in classical ML, imagine you have a text classification model trained on TF-IDF vectors. At inference, any new text must undergo the same preprocessing (tokenization, stop word removal) and then be transformed into a TF-IDF vector using the same vocabulary as during training. If there’s a discrepancy in this transformation, the model’s output will be unreliable.
Similarly, in a vector database used for embedding search, if you’re dealing with text embeddings, a new text query must be converted into an embedding using the same model and preprocessing steps that were used to create the embeddings in the database. Embeddings stored in the database using OpenAI’s text-embedding-a[...]
为了高效地存储、索引和搜索浮点数组,我们需要向量数据库。生成和搜索嵌入向量的过程需要保持一致性,以确保模型的输出可靠。pg_vectorize解决了这个问题,它跟踪了用于生成嵌入向量的转换模型,并提供了管理转换的方法。pg_vectorize还支持定时和实时更新嵌入向量的方式。它可以使用不同的转换模型生成嵌入向量,并支持OpenAI和Hugging Face的嵌入模型。pg_vectorize是开源的,可在GitHub上获取。