MySQL AI  /  ...  /  About Vector Store and Vector Processing

5.6.1 About Vector Store and Vector Processing

This section describes the Vector Store functionality available with GenAI.

About Vector Store

A vector store is a relational database that lets you load unstructured data. It automatically parses unstructured data formats, which include PDF (including scanned PDF files), PPT, TXT, HTML, and DOC file formats, from the local filesystem. Then, it segments the parsed data, creates vector embeddings, and stores them for GenAI to perform semantic searches.

A vector store uses the native VECTOR data type to store unstructured data in a multidimensional space. Each point in a vector store represents the vector embedding of the corresponding data. Semantically similar data is placed closer in the vector space.

The large language models (LLMs) available in GenAI are trained on publicly available data. Therefore, the responses generated by these LLMs are based on publicly available information. To generate content relevant to your proprietary data, you must store your proprietary enterprise data, which has been converted to vector embeddings, in a vector store. This enables the in-database retrieval-augmented generation (RAG) system to perform a semantic search in the proprietary data stored in the vector stores to find appropriate content, which is then fed to the LLM to help it generate more accurate and relevant responses.

About Vector Processing

To create vector embeddings, GenAI uses in-database embedding models, which are encoders that converts sequence of words and sentences from documents into numerical representations. These numerical values are stored as vector embeddings in the vector store and capture the semantics of the data and relationships to other data.

A vector distance function measures the similarity between vectors by calculating the mathematical distance between two multidimensional vectors.

GenAI encodes your queries using the same embedding model that is used to encode the ingested data to create the vector store. It then uses the right distance function to find relevant content with similar semantic meaning from the vector store to perform RAG.

About Accelerated Processing of Queries on Vector-Based Tables

GenAI lets you run queries on tables that contain vector embeddings at an accelerated pace by offloading them to the MySQL AI Engine (AI engine). However, for query offload to be successful, the vector table must be offloaded to AI engine using the SECONDARY_LOAD clause with the ALTER TABLE statement, and the query (SELECT statement) must use at least one vector function in the SELECT LIST, FILTER, or ORDER BY expression. Additionally, only simple SELECT statements with LIMIT_OFFSET, FILTER and ORDER BY operations are offloaded to AI engine for accelerated processing.

To offload the vector table to AI engine, use the following statement:

mysql>ALTER TABLE tbl_name SECONDARY_LOAD;

Following are examples of queries that are offloaded to AI engine for accelerated processing:

  • mysql>SELECT name, STRING_TO_VECTOR(embedding) FROM demo_table;
  • mysql>SELECT name, STRING_TO_VECTOR(embedding) FROM demo_table limit 10;
  • mysql>SELECT name, STRING_TO_VECTOR(embedding) FROM demo_table;
  • mysql>SELECT name, ROUND(DISTANCE(@query_embedding_16, STRING_TO_VECTOR(embedding)), 4) 
    AS distance FROM demo_table ORDER BY distance DESC;

Other SQL operations such as JOIN, UNION, INTERSECT, GROUP BY, AGGREGATE, WINDOW, and so on, are not supported for accelerated processing. Following are examples of queries that are not offloaded to AI engine for accelerated processing:

  • Query containing no vector distance function:

    mysql>SELECT COMPRESS(embedding) FROM demo_table1;
  • Query containing GROUP BY or aggregates:

    mysql>SELECT name, COUNT(DISTINCT embedding) FROM demo_table1 GROUP BY name;
  • Query containing JOIN operation:

    mysql>SELECT ROUND(DISTANCE(demo_table1.embedding, UNHEX("8679613f")), 4) from demo_table1 JOIN demo_table2 on 
    demo_table1.name = demo_table2.name;

About Optical Character Recognition

Optical Character Recognition (OCR) lets you extract and encode text from images stored in unstructured documents. The text extracted from images is converted into vector embeddings and stored in a vector store the same way regular text in unstructured documents is encoded and stored in a vector store.

OCR is enabled by default when you ingest files into a vector store.

However, when OCR is enabled, the loading process slows down because GenAI scans all images available in the files and pages of scanned documents that you are ingesting into the vector store. If OCR is not required for the documents that you are ingesting, you can disable OCR to speed up the loading process.

GenAI supports OCR in the following unstructured data formats: PDF (including scanned PDF files), DOC, DOCX, PPT, and PPTX. However, GenAI doesn't support OCR in TXT and HTML files. Images stored in TXT and HTML files are ignored while ingesting the files.

OCR in GenAI also has the following limitations:

  • GenAI might not be able to extract and process the text from images with 100% accuracy. However, if there are minor character recognition errors, the overall meaning of the text is still preserved.

  • In some cases, text-like figures in images might incorrectly be treated as regular text.

  • GenAI doesn't support OCR for Scalable Vector Graphic (SVG) images in PDF files.

What's Next

Learn how to Ingest Files into a Vector Store.