MySQL :: HeatWave User Guide :: 7.7.2 Ingest Files into a Vector Store

HeatWave User Guide / ... / Ingest Files into a Vector Store

7.7.2 Ingest Files into a Vector Store

This section describes how to generate vector embeddings for files or folders stored in Object Storage, and load the embeddings into a vector store table.

The following sections in this topic describe how to ingest files into a vector store:

Before You Begin

Review the MySQL HeatWave GenAI requirements and privileges.
If not already done, create an Oracle Cloud Infrastructure (OCI) bucket for storing files that you want to ingest into the vector store. Then, upload the files to the bucket.

Vector store can ingest files in the following formats: PDF, PPT, TXT, HTML, and DOC.

To run the steps in this topic, create an Object Storage bucket with the name demo_bucket. Download the MySQL HeatWave user guide PDF, then upload it to demo_bucket.
If you want to use the OCI Generative AI Service Embedding Models when you Ingest Files Using Auto Parallel Load, complete the steps to authenticate OCI Generative AI Service.
Connect to your MySQL HeatWave DB System.
Note

X protocol is not supported for Ingest Files Using Asynchronous Load. To set up a vector store using this method, ensure that you use the classic MySQL protocol while connecting to the database:
```
mysqlsh -uDBSystemAdminName -pDBSystemPassword -hDBSystemPrivateIP --sqlc
```
To create and store vector store tables using the steps described in this topic, you can create a new database demo_db:
```
mysql> CREATE DATABASE demo_db;
```

Ingest Files Using Asynchronous Load

The VECTOR_STORE_LOAD routine creates and loads vector embeddings asynchronously into the vector store. You can ingest the source files into the vector store using the following methods:

Using the Uniform Resource Identifier with Asynchronous Load

This section describes how to load source documents from the bucket into the vector table using the uniform resource identifier (URI) of the object.

Note

To use this method, you need to enable the DB system to access the OCI Object Storage bucket. For more information, see Resource Principals.

To set up a new vector store using an object URI, perform the following steps:

To create the vector store table, use a new or existing database:
```
mysql> USE DBName;
```
Replace DBName with the database name.

For example:
```
mysql> USE demo_db;
```
As of MySQL 9.3.1, this step is not required. MySQL HeatWave no longer uses the mysql_task_management schema, so you can delete the schema if you are not using it.

In earlier versions of MySQL, if you are loading a vector store table on a database system for the first time, create a new schema dedicated for task management:
```
SELECT mysql_task_management_ensure_schema();
```
This creates a new schema, mysql_task_management, which keeps track of the tasks that run in the background. This schema contains internal tables that store the task details such as task ID and task logs. These internal tables must not be altered as it can cause the asynchronous task created by VECTOR_STORE_LOAD to fail.
Optionally, to specify a name for the vector store table and language to use, set the @options variable:
```
mysql> SET @options = JSON_OBJECT("table_name", "VectorStoreTableName", "language", "Language");
```
Replace the following:
- VectorStoreTableName: the name you want for the vector store table.
- Language: the two-letter ISO 639-1 code for the language you want to use. Default language is en, which is English. To view the list of supported languages, see Languages.
  
  The language parameter is supported as of MySQL 9.0.1-u1.
To view the list of available routine options, see Section 10.3.3, “VECTOR_STORE_LOAD”.

For example:
```
mysql> SET @options = JSON_OBJECT("table_name", "demo_embeddings", "language", "en");
```
As of MySQL 9.1.2, Optical Character Recognition (OCR) is enabled by default.
To ingest the file from the Object Storage, create vector embeddings, and load the vector embeddings into MySQL HeatWave, use the VECTOR_STORE_LOAD routine:
```
mysql> CALL sys.VECTOR_STORE_LOAD("oci://BucketName@Namespace/Path/Filename", @options);
```
Replace the following:
- BucketName: the OCI Object Storage bucket name.
- Namespace: the name of the bucket namespace.
- Path: path to the folder that contains the source file.
- Filename: the filename with the file extension.
For example:
```
mysql> CALL sys.VECTOR_STORE_LOAD("oci://demo_bucket@demo_namespace/heatwave-en.pdf", @options);
```
This creates an asynchronous task that runs in background and loads the vector embeddings into the specified vector store table. The output of the VECTOR_STORE_LOAD routine contains the following:
- An ID of the task which was created.
- A task query that you can use to track the progress of task.
If MySQL HeatWave GenAI detects multiple files with the same or different file formats in a single load, it creates a separate table for every format it finds. The table name for each format is the specified or default table name followed by the format. For example, demo_embeddings_pdf is the name of the table that contains PDF files.
After the task is completed, verify that embeddings are loaded in the vector store table:
```
mysql> SELECT COUNT(*) FROM VectorStoreTableName;
```
For example:
```
mysql> SELECT COUNT(*) FROM demo_embeddings;
```
If you see a numerical value in the output, your embeddings are successfully loaded in the vector store table.

To view the details of the vector store table, use the following statement:

mysql> DESCRIBE demo_embeddings;
+-------------------+---------------+------+-----+---------+-------+
| Field             | Type          | Null | Key | Default | Extra |
+-------------------+---------------+------+-----+---------+-------+
| document_name     | varchar(1024) | NO   |     | NULL    |       |
| metadata          | json          | NO   |     | NULL    |       |
| document_id       | int unsigned  | NO   | PRI | NULL    |       |
| segment_number    | int unsigned  | NO   | PRI | NULL    |       |
| segment           | varchar(1024) | NO   |     | NULL    |       |
| segment_embedding | vector(384)   | NO   |     | NULL    |       |
| segment_metadata  | json          | NO   |     | NULL    |       |
+-------------------+---------------+------+-----+---------+-------+

Using a Pre-Authenticated Request with Asynchronous Load

This section describes how to ingest source documents from the using pre-authenticated requests (PAR). Use this method if OCI bucket access is not enabled on your DB system.

Note

For confidential data, Using the Uniform Resource Identifier with Asynchronous Load is recommended for ingesting the source files into the vector store as it is a more secure method.

To set up a new vector store, perform the following steps:

To create the vector store table, use a new or existing database:
```
mysql> USE DBName;
```
Replace DBName with the database name.

For example:
```
mysql> USE demo_db;
```
As of MySQL 9.3.1, this step is not required.

In earlier versions of MySQL, if you are loading a vector store table on a database system for the first time, create a new schema dedicated for task management:
```
SELECT mysql_task_management_ensure_schema();
```
This creates a new schema, mysql_task_management, which keeps track of the tasks that run in the background. This schema contains internal tables that store the task details such as task ID and task logs. These internal tables must not be altered as it can cause the asynchronous task created by VECTOR_STORE_LOAD to fail.
Optionally, to specify a name for the vector store table and language to use, set the @options variable:
```
mysql> SET @options = JSON_OBJECT("table_name", "VectorStoreTableName", "language", "Language");
```
Replace the following:
- VectorStoreTableName: the name you want for the vector store table.
- Language: the two-letter ISO 639-1 code for the language you want to use. Default language is en, which is English. To view the list of supported languages, see Languages.
  
  The language parameter is supported as of MySQL 9.0.1-u1.
To view the list of available routine options, see Section 10.3.3, “VECTOR_STORE_LOAD”.

For example:
```
mysql> SET @options = JSON_OBJECT("table_name", "demo_embeddings_par", "language", "en");
```
As of MySQL 9.1.2, Optical Character Recognition (OCR) is enabled by default.
To ingest the file from the Object Storage, create vector embeddings, and load the vector embeddings into MySQL HeatWave, use the VECTOR_STORE_LOAD routine:
```
mysql> CALL sys.VECTOR_STORE_LOAD("PAR", @options);
```
Replace PAR with PAR of the bucket, folder, or file that you want to use to set up the vector store.

To learn how to create PAR for your Object Storage bucket, see Creating a PAR Request in .

Note

If you are creating a PAR for a folder or Object Storage bucket, then select Enable Object Listing in the Create Pre-Authenticated Request dialog to enable object listing.

For example:
```
mysql> CALL sys.VECTOR_STORE_LOAD("https://demo.objectstorage.demo-region.oci.customer-oci.com/p/demo-url/n/demo/b/demo-bucket/o/heatwave-en.pdf",
@options);
```
This creates a task that runs in background and loads the vector embeddings into the specified vector store table. The output of the VECTOR_STORE_LOAD routine contains the following:
- An ID of the task which was created.
- A task query that you can use to track the progress of task.
If MySQL HeatWave GenAI detects files with different file formats in a single load, it creates a separate table for every format it finds. The table name for each format is the specified or default table name followed by the format. For example, demo_embeddings_par_pdf is the name of the table that contains PDF files.
After the task is completed, verify that embeddings are loaded in the vector store table:
```
mysql> SELECT COUNT(*) FROM VectorStoreTableName;
```
For example:
```
mysql> SELECT COUNT(*) FROM demo_embeddings_par;
```
If you see a numerical value in the output, your embeddings are successfully loaded in the vector store table.

To view the details of the vector store table, use the following statement:

mysql> DESCRIBE demo_embeddings_par;
+-------------------+---------------+------+-----+---------+-------+
| Field             | Type          | Null | Key | Default | Extra |
+-------------------+---------------+------+-----+---------+-------+
| document_name     | varchar(1024) | NO   |     | NULL    |       |
| metadata          | json          | NO   |     | NULL    |       |
| document_id       | int unsigned  | NO   | PRI | NULL    |       |
| segment_number    | int unsigned  | NO   | PRI | NULL    |       |
| segment           | varchar(1024) | NO   |     | NULL    |       |
| segment_embedding | vector(384)   | NO   |     | NULL    |       |
| segment_metadata  | json          | NO   |     | NULL    |       |
+-------------------+---------------+------+-----+---------+-------+

Ingest Files Using Auto Parallel Load

The HEATWAVE_LOAD routine creates and loads vector embeddings into the vector store using auto parallel load.

To ingest files using the HEATWAVE_LOAD routine, perform the following steps:

To create a vector store table, use a new or existing database:
```
mysql> USE DBName;
```
Replace DBName with the name you want for the new database.

For example:
```
mysql> USE demo_db;
```
To ingest the file from Object Storage and create vector embeddings in a new vector store table, set the @dl_tables variable:
```
SET @dl_tables = '[{
        "db_name": "DBName",
        "tables": [{
                "table_name": "VectorStoreTableName",
                "engine_attribute": {
                    "dialect": {
                        "format": "FileFormat",
                        "language": "Language",
                        "ocr": OCRBoolValue,
                        "document_parser_model": "DocumentParserModel",
                      	"embed_model_id": "EmbeddingModel"
                    },
                    "file": [{"par": "PAR"}],
                    "chunking": {
                        "split_by": "SplitBy",
                        "max": Max,
                        "by": "By",
                        "truncate": Truncate,
                        "overlap": Overlap
                    }
                }
                  
            }]
    }]';
```
Replace the following:
- DBName: the database name.
- VectorStoreTableName: the name you want for the vector store table.
- FileFormat: the formats of the files to be ingested into the vector store table. The supported file formats are html, pdf, ppt, pptx txt, doc, and docx. To ingest multiple files with different unstructured data file formats into the vector store table in a single load, replace FileFormat with auto_unstructured.
- Language: the two-letter ISO 639-1 code for the language you want to use. Default language is en, which is English. To view the list of supported languages, see Languages.
  
  The language parameter is supported as of MySQL 9.0.1-u1.
- OCRBoolValue: set to true to enable Optical Character Recognition (OCR) or false to disable OCR.
  
  As of MySQL 9.1.2, OCR is enabled by default. In MySQL 9.1.0 and 9.1.1, to enable OCR, set the ocr parameter to true.
- DocumentParserModel: ID of an LLM that you want to use for parsing the files to be ingested. The specified LLM must be capable of processing both textual and visual inputs. This helps improve parsing of PDF documents that have a complex structure, or contain complex tables and images. By default, this is an empty string. For possible values, to learn how to view the list of Oracle Cloud Infrastructure (OCI) GenAI Service LLMs available in MySQL HeatWave, see OCI Generative AI Service LLMs.
  
  To use a document parsing model, OCR must be enabled. Therefore, if you are specifying a document parsing model, ensure that the ocr parameter is not set to false.
  
  Note
  
  This parameter is supported for PDF files only.
  
  The document_parser_model parameter is available as of MySQL 9.4.2.
- EmbeddingModel: ID of the embedding model you want to use. Default value is multilingual-e5-small.
  
  To view the lists of available embedding models, see MySQL HeatWave In-Database Embedding Models and OCI Generative AI Service Embedding Models. OCI Generative AI Service models are supported as of MySQL 9.4.1.
- PAR: the pre-authenticated request (PAR) detail of the bucket, folder, or file that you want to use to set up the vector store.
  
  To learn how to create PAR for your Object Storage bucket, see Creating a PAR request in Object Storage.
  
  Note
  
  If you are creating a PAR for a folder or Object Storage bucket, then select Enable Object Listing to enable object listing in the Create Pre-Authenticated Request dialog while creating the PAR.
- As of MySQL 9.3.2, you can optionally set the following chunking parameters for customized text segmentation during vector store creation:
  - SplitBy: method to use for splitting the text into segments. It can be one of the following:
    
    page: for text segmentation based on the pagination available in the document. This segmentation method is supported for PDF and PPT documents only. If used for other documents, the routine falls back to the document text segmentation method for unsupported documents.
    
    paragraph: for text segmentation based on the paragraphs identified in the document. Wherein, a paragraph is a piece of text separated from another piece of text using \n\n characters.
    
    sentence: for text segmentation based on the sentences identified in the document. Wherein, a sentence is a sequence of words that is separated from another sequence of words using a punctuation that marks the end of a sentence: ., !, or ?. For PDF files, when OCR is enabled, any sentence that spills into the next page is considered as a two separate sentences.
    
    document: for putting an entire document into one text segment. In case the max parameter is not set, you cannot overlap across documents.
    
    recursive: for using the default text segmentation method used while creating vector store tables in previous versions of MySQL. This method provides backward compatibility with vector store tables created in previous versions of MySQL.
    
    Default value is recursive. As of MySQL 9.4.2, if the document_parser_model parameter is set, the default value of the split_by parameter changes to page.
  - Max: maximum number of characters or words to be included in each segment. If left unspecified, there is no maximum size enforcement on the chunks of text. For words, this value can be up to 100000 and for characters, this value can be up to 1000000. This is unspecified by default.
  - Overlap: maximum number of characters or words to overlap between segments on each side. For words, this value can be up to 50000 and for characters, this value can be up to 500000. Default value is 0.
  - By: the unit to use for defining the maximum and overlap text segment limits using the max and overlap parameter. It can be set to characters or words. Default value is characters.
  - Truncate: set to true to enable truncation of text segments that are too large for the embedding model to handle without truncation. However, when tructation is enabled, the end of the text segment is trimmed out, and you might lose the information from the part of the text segment that is trimmed out.
    
    Set to false to disable truncation and instead throw an error when a text segment that is too large for the embeeding model is found. In this case, the entire load fails and exits with an error. You can adjust the text segmentation method used to ensure that the size of the text segments is within the segment size limit of the embedding model.
    
    Default value is true.
For example:
```
mysql> SET @dl_tables = '[{
        "db_name": "demo_db",
        "tables": [{
                "table_name": "demo_embeddings_apl",
                "engine_attribute": {
                    "dialect": {
                        "format": "pdf",
                      	"language": "en",
                      	"embed_model_id": "multilingual-e5-small"
                    },
                    "file": [{"par": "https://demo.objectstorage.demo-region.oci.customer-oci.com/p/demo-url/n/demo/b/demo_bucket/o/heatwave-en.pdf"}],
                    "chunking": {
                        "split_by": "recursive"
                    }
                }                                   
            }]
    }]';
```
To prepare for loading the vector embeddings into the MySQL HeatWave system, set the @options variable:
```
mysql> SET @options = JSON_OBJECT('mode', 'normal');
```
To load the vector embeddings into MySQL HeatWave, use the HEATWAVE_LOAD routine:
```
mysql> CALL sys.HEATWAVE_LOAD(CAST(@dl_tables AS JSON), @options);
```
This creates and stores the vector embeddings in the specified vector store table.
Verify that embeddings are loaded in the vector store table:
```
mysql> SELECT COUNT(*) FROM VectorStoreTableName;
```
For example:
```
mysql> SELECT COUNT(*) FROM demo_embeddings_apl;
```
If you see a numerical value in the output, your embeddings are successfully loaded in the table.

To view the details of the vector store table, use the following statement:

mysql> DESCRIBE demo_embeddings_apl;
+-------------------+---------------+------+-----+---------+-------+
| Field             | Type          | Null | Key | Default | Extra |
+-------------------+---------------+------+-----+---------+-------+
| document_name     | varchar(1024) | NO   |     | NULL    |       |
| metadata          | json          | NO   |     | NULL    |       |
| document_id       | int unsigned  | NO   | PRI | NULL    |       |
| segment_number    | int unsigned  | NO   | PRI | NULL    |       |
| segment           | varchar(1024) | NO   |     | NULL    |       |
| segment_embedding | vector(384)   | NO   |     | NULL    |       |
| segment_metadata  | json          | NO   |     | NULL    |       |
+-------------------+---------------+------+-----+---------+-------+

Clean Up

To avoid being billed for the resources that you created for this topic, ensure that you delete them:

Delete the database:
```
mysql> DROP DATABASE demo_db;
```
Delete demo_bucket. For more information, see Deleting the Bucket.

What's Next

Learn how to Update the Vector Store.
Learn how to Perform Vector Search With Retrieval-Augmented Generation.
Learn how to Generate Text Summaries for Documents Ingested Using Auto Parallel Load.
Learn how to Start a Conversational Chat.

PREV HOME UP NEXT