This section describes the Vector Store functionality available with GenAI.
A vector store is a relational database that lets you load unstructured data. It automatically parses unstructured data formats, which include PDF (including scanned PDF files), PPT, TXT, HTML, and DOC file formats, from the local filesystem. Then, it segments the parsed data, creates vector embeddings, and stores them for GenAI to perform semantic searches.
A vector store uses the native
VECTOR
data type to store
unstructured data in a multidimensional space. Each point in
a vector store represents the vector embedding of the
corresponding data. Semantically similar data is placed closer
in the vector space.
The large language models (LLMs) available in GenAI are trained on publicly available data. Therefore, the responses generated by these LLMs are based on publicly available information. To generate content relevant to your proprietary data, you must store your proprietary enterprise data, which has been converted to vector embeddings, in a vector store. This enables the in-database retrieval-augmented generation (RAG) system to perform a semantic search in the proprietary data stored in the vector stores to find appropriate content, which is then fed to the LLM to help it generate more accurate and relevant responses.
To create vector embeddings, GenAI uses in-database embedding models, which are encoders that converts sequence of words and sentences from documents into numerical representations. These numerical values are stored as vector embeddings in the vector store and capture the semantics of the data and relationships to other data.
A vector distance function measures the similarity between vectors by calculating the mathematical distance between two multidimensional vectors.
GenAI encodes your queries using the same embedding model that is used to encode the ingested data to create the vector store. It then uses the right distance function to find relevant content with similar semantic meaning from the vector store to perform RAG.
GenAI lets you run queries on tables that contain vector
embeddings at an accelerated pace by offloading them to the
MySQL AI Engine (AI engine). However, for query offload to be successful,
the vector table must be offloaded to AI engine using the
SECONDARY_LOAD
clause with the
ALTER TABLE
statement, and the query
(SELECT
statement) must use at
least one vector
function in the SELECT LIST
,
FILTER
, or ORDER BY
expression. Additionally, only simple
SELECT
statements with
LIMIT_OFFSET
, FILTER
and
ORDER BY
operations are offloaded to
AI engine for accelerated processing.
To offload the vector table to AI engine, use the following statement:
mysql>ALTER TABLE tbl_name SECONDARY_LOAD;
Following are examples of queries that are offloaded to AI engine for accelerated processing:
mysql>SELECT name, STRING_TO_VECTOR(embedding) FROM demo_table;
mysql>SELECT name, STRING_TO_VECTOR(embedding) FROM demo_table limit 10;
mysql>SELECT name, STRING_TO_VECTOR(embedding) FROM demo_table;
mysql>SELECT name, ROUND(DISTANCE(@query_embedding_16, STRING_TO_VECTOR(embedding)), 4) AS distance FROM demo_table ORDER BY distance DESC;
Other SQL operations such as JOIN
,
UNION
, INTERSECT
,
GROUP BY
, AGGREGATE
,
WINDOW
, and so on, are not supported for
accelerated processing. Following are examples of queries that
are not offloaded to AI engine for accelerated processing:
-
Query containing no vector distance function:
mysql>SELECT COMPRESS(embedding) FROM demo_table1;
-
Query containing
GROUP BY
or aggregates:mysql>SELECT name, COUNT(DISTINCT embedding) FROM demo_table1 GROUP BY name;
-
Query containing
JOIN
operation:mysql>SELECT ROUND(DISTANCE(demo_table1.embedding, UNHEX("8679613f")), 4) from demo_table1 JOIN demo_table2 on demo_table1.name = demo_table2.name;
Optical Character Recognition (OCR) lets you extract and encode text from images stored in unstructured documents. The text extracted from images is converted into vector embeddings and stored in a vector store the same way regular text in unstructured documents is encoded and stored in a vector store.
OCR is enabled by default when you ingest files into a vector store.
However, when OCR is enabled, the loading process slows down because GenAI scans all images available in the files and pages of scanned documents that you are ingesting into the vector store. If OCR is not required for the documents that you are ingesting, you can disable OCR to speed up the loading process.
GenAI supports OCR in the following unstructured data formats: PDF (including scanned PDF files), DOC, DOCX, PPT, and PPTX. However, GenAI doesn't support OCR in TXT and HTML files. Images stored in TXT and HTML files are ignored while ingesting the files.
OCR in GenAI also has the following limitations:
GenAI might not be able to extract and process the text from images with 100% accuracy. However, if there are minor character recognition errors, the overall meaning of the text is still preserved.
In some cases, text-like figures in images might incorrectly be treated as regular text.
GenAI doesn't support OCR for Scalable Vector Graphic (SVG) images in PDF files.
Learn how to Ingest Files into a Vector Store.