The VECTOR_STORE_LOAD routine generates
vector embedding for the specified files or folders that are
stored in the bucket, and loads the embeddings into a new vector
store table.
This routine creates an asynchronous task which loads vector store tables in the background. It also returns a query that you can run to track the status of the vector store load task that is running in the background.
It is recommended that you create five or less vector store load tasks at a time. Too many tasks running at the same time might cause overloading issues.
This topic contains the following sections:
To learn about the privileges you need to run this routine, see Section 7.3, “MySQL HeatWave GenAI Roles and Privileges”.
mysql> CALL sys.VECTOR_STORE_LOAD('URI'[, options]);
options: JSON_OBJECT(keyvalue[, keyvalue]...)
keyvalue:
{
'formats', JSON_ARRAY('Format'[, 'Format'] ...)
|'schema_name', 'SchemaName'
|'table_name', 'TableName'
|'region', 'Region'
|'task_name', 'TaskName'
|'language', 'Language'
|'embed_model_id', {'EmbeddingModelID'}
|'description', 'Description'
|'uris', JSON_ARRAY(urioptions[, urioptions] ...)
|'ocr', {true|false}
|'chunking', chunkingoptions
}
Following are VECTOR_STORE_LOAD parameters:
-
URI: specifies the unique reference index (URI) or pre-authenticated request (PAR) of the bucket files or folders to be ingested into the vector store.A URI is considered to be one of the following:
A glob pattern, if it contains at least one unescaped
?or*character.A prefix, if it is not a pattern and ends with a
/character like a folder path.A file path, if it is neither a glob pattern nor a prefix.
To learn how to create PAR for your , see Creating a PAR request in .
-
options: specifies optional parameters as key-value pairs in JSON format. It can include the following parameters:-
formats: specifies the list of formats to be loaded. The supported file formats arepdf,ppt,txt,html, anddoc. By default, the routine uses all the supported formats.If the routine detects multiple files with the same or different file formats in a single load, it creates a separate table for every format it finds. The table name for each format is the specified or default table name followed by the format:
TableName_Format. schema_name: specifies the name of the schema where the vector embeddings are to be loaded. By default, this procedure uses the current schema from the session.table_name: specifies the name of the vector store table to create. By default, the routine generates a unique table name with formatvector_store_data_x, wherexis a counter.region: specifies the region of the bucket. Default value is the region where the current DB System is running.task_name: specifies a name for the loading task to be mentioned in the task status. Default value isVector Store Loader.-
language: specifies the text content language used in the files to be ingested into the vector store. To set the value of thelanguageparameter, use the two-letterISO 639-1code for the language. This parameter is available as of MySQL 9.0.1-u1.Default value is
en.For possible values, to view the list of supported languages, see Languages.
-
embed_model_id: specifies the embedding model to use for encoding the text.As of 9.3.0, default value is
multilingual-e5-small. In earlier versions of MySQL, default value isminilmfor English text andmultilingual-e5-smallfor text in languages other than English.For possible values, to view the list of available embedding models, see MySQL HeatWave In-Database Embedding Models and OCI Generative AI Service Embedding Models. OCI Generative AI Service models are supported as of MySQL 9.5.0.
The
VECTOR_STORE_LOADroutine does not support OCI Generative AI Service Embedding Models. If you want to use OCI Generative AI Service Embedding Models, then you can use the HEATWAVE_LOAD routine to ingest files into a vector store.This parameter is available as of MySQL 9.2.1.
description: specifies a description of document collection being loaded to be mentioned in the task status. Default value isNULL.-
uris: specifies a list of additional URIs to include along with an optional name for the vector store table to be created for the specified URI. Default value isNULL.Each object in the list can include the following parameters:
urioptions: JSON_OBJECT(urioptskeyvalue[, urioptskeyvalue]) urioptskeyvalue: { 'uri', 'URI' |'table_name', 'TableName' }uri: specifies the additional URI. If onlyuriis provided, the routine uses the specified URI as an additional URI, and loads it into the main tableoptions.table_nameor the generated table with the unique table name.-
table_name: if bothuriandtable_nameare provided, the routine loads the specified URI into the specified table.If only
table_nameis provided, the routine loads the specified vector store table into MySQL HeatWave.
-
ocr: specifies whether to enable or disable Optical Character Recognition (OCR). If set tofalse, disables OCR. Default value istrue, which means OCR is enabled by default.This parameter is available as of MySQL 9.1.1.
As of MySQL 9.1.2, the default value is
true.As of MySQL 9.1.1, default value is
false.
-
chunking: specifies parameter values for customized text segmentation during vector store creation.It can include the following parameters:
chunkingoptions: JSON_OBJECT(chunkingkeyvalue[, chunkingkeyvalue]...) chunkingkeyvalue: { 'split_by', {'page'|'paragraph'|'sentence'|'document'|'recursive'} |'max', MaxValue |'by', {'characters'|'words'} |'truncate', {true|false} |'overlap', OverlapValue }-
split_by: specifies the method to use for splitting the text into segments. It can be one of the following:page: for text segmentation based on the pagination available in the document. This segmentation method is supported for PDF and PPT documents only. If used for other documents, the routine falls back to thedocumenttext segmentation method for unsupported documents.paragraph: for text segmentation based on the paragraphs identified in the document. Wherein, a paragraph is a piece of text separated from another piece of text using\n\ncharacters.sentence: for text segmentation based on the sentences identified in the document. Wherein, a sentence is a sequence of words that is separated from another sequence of words using a punctuation that marks the end of a sentence:.,!, or?. For PDF files, when OCR is enabled, any sentence that spills into the next page is considered as a two separate sentences.document: for putting an entire document into one text segment. In case themaxparameter is not set, you cannot overlap across documents.recursive: for using the default text segmentation method used while creating vector store tables in previous versions of MySQL. This method provides backward compatibility with vector store tables created in previous versions of MySQL.
Default value is
recursive. max: specifies the maximum number of characters or words to be included in each segment. If left unspecified, there is no maximum size enforcement on the chunks of text. For words, this value can be up to100000and for characters, this value can be up to1000000. This is unspecified by default.by: specifies the unit to use for defining the maximum and overlap text segment limits using themaxandoverlapparameter. It can be set tocharactersorwords. Default value ischaracters.-
truncate: if set totrue, enables truncation of text segments that are too large for the embedding model to handle without truncation. However, when tructation is enabled, the end of the text segment is trimmed out, and you might lose the information from the part of the text segment that is trimmed out.If set to
false, disables truncation and instead throw an error when a text segment that is too large for the embeeding model is found. In this case, the entire load fails and exits with an error. You can adjust the text segmentation method used to ensure that the size of the text segments is within the segment size limit of the embedding model.Default value is
true. overlap: specifies the maximum number of characters or words to overlap between segments on each side. For words, this value can be up to50000and for characters, this value can be up to500000. Default value is0.
The
chunkingparameters are available as of MySQL 9.5.0. -
-
-
Specifying the file to ingest using the URI in
VECTOR_STORE_LOAD:mysql> CALL sys.VECTOR_STORE_LOAD('oci://demo_bucket@demo_namespace/heatwave-en.pdf', '{"table_name": "demo_embeddings"}'); -
Specifying the file to ingest using the PAR in
VECTOR_STORE_LOAD:mysql> CALL sys.VECTOR_STORE_LOAD('https://demo.objectstorage.us-ashburn-1.oci.customer-oci.com/p/demo-url/n/demo/b/demo-bucket/o/heatwave-en.pdf', '{"table_name": "demo_embeddings_par"}'); -
Specifying additional options such the schema name, table name, language, format, and table description in
VECTOR_STORE_LOAD:mysql> CALL sys.VECTOR_STORE_LOAD('oci://demo_bucket@demo_namespace/german_files/de*', '{"schema_name": "demo_db", "table_name": "german_embeddings", "language": "de", "formats": ["pdf"], "description": "Vector store table containing German PDF files."}'); -
Tracking the progress of a load task by running the task query displayed as output for the
VECTOR_STORE_LOADroutine:-
As of MySQL 9.3.1:
SELECT mysql_tasks.task_status_brief("TaskID");The output looks similar to the following:
+-----------------------------------------------------------------------------------------+ | mysql_tasks.task_status_brief("TaskID") | +-----------------------------------------------------------------------------------------+ | {"data": {"tables_to_load": "[{\"table_name\": \"quickstart_embeddings\", | | \"load_progress\": 40.0}]"}, "status": "RUNNING", "message": "Loading in progress...", | | "progress": 40} | +-----------------------------------------------------------------------------------------+ -
In earlier versions:
mysql> SELECT id, name, message, progress, status, scheduled_time,estimated_completion_time, estimated_remaining_time, progress_bar FROM mysql_task_management.task_status WHERE id=1\GThe output looks similar to the following:
id: 1 name: Vector Store Loader message: Task starting. progress: 0 status: RUNNING scheduled_time: 2024-07-02 14:42:38 estimated_completion_time: 2024-07-22 10:19:53 estimated_remaining_time: 52.50000 progress_bar: __________Getting more details about the load task by querying the task logs for the given task id:
SELECT * from mysql_task_management.task_log where task_id = 1;The output looks similar to the following:
+------------------------------------+---------+----------------------------+------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+-----------+ | id | task_id | log_time | message | data | progress | status | +------------------------------------+---------+----------------------------+------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+-----------+ | 0x11EF799F5D99054288CC020017091C01 | 1 | 2024-09-23 11:31:24.884514 | Task created by user. | NULL | 0 | SCHEDULED | | 0x11EF799F5D99812188CC020017091C01 | 1 | 2024-09-23 11:31:24.887685 | Task starting. | {"engine_attributes": "{\"demo_embeddings\":\"'{\\n\\\"dialect\\\": {\\\"format\\\": \\\"pdf\\\", \\\"language\\\": \\\"en\\\", \\\"is_strict_mode\\\": false },\\n\\\"file\\\": [\\n{\\n \\\"name\\\": \\\"demo_folder/demo_file.pdf\\\",\\n \\\"bucket\\\": \\\"demo_bucket\\\",\\n \\\"region\\\": \\\"demo_region\\\",\\n \\\"namespace\\\": \\\"demo_namespace\\\"\\n}\\n]}'\"}"} | 0 | RUNNING | | 0x11EF799F6390FD9788CC020017091C01 | 1 | 2024-09-23 11:31:34.898219 | Loading in progress... | {"tables_to_load": "[{\"table_name\": \"demo_embeddings\", \"load_progress\": 10.0}]"} | 10 | RUNNING | | 0x11EF799F668C172F88CC020017091C01 | 1 | 2024-09-23 11:31:39.899271 | Loading in progress... | {"tables_to_load": "[{\"table_name\": \"demo_embeddings\", \"load_progress\": 40.0}]"} | 40 | RUNNING | | 0x11EF799F6987348588CC020017091C01 | 1 | 2024-09-23 11:31:44.900419 | Loading in progress... | {"tables_to_load": "[{\"table_name\": \"demo_embeddings\", \"load_progress\": 40.0}]"} | 40 | RUNNING | | 0x11EF799F6C82547D88CC020017091C01 | 1 | 2024-09-23 11:31:49.901634 | Loading in progress... | {"tables_to_load": "[{\"table_name\": \"demo_embeddings\", \"load_progress\": 40.0}]"} | 40 | RUNNING | +------------------------------------+---------+----------------------------+------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+-----------+
-