HeatWave User Guide  /  ...  /  Setting Up a Vector Store

4.4.2 Setting Up a Vector Store

This section describes how to generate vector embeddings for files or folders stored in Object Storage, and load the embeddings into a vector store table.

HeatWave GenAI supports the following methods to ingest files from the Object Storage bucket:

Before You Begin

  • If not already done, create an Oracle Cloud Infrastructure (OCI) Object Storage bucket for storing files that you want to ingest into the vector store.

    Then, upload the files to the Object Storage bucket.

    Vector store can ingest files in the following formats: PDF, PPT, TXT, HTML, and DOC.

  • Connect to your HeatWave Database System.

    Ensure that you pass the --sqlc flag while connecting to the database to use the classic MySQL protocol:

    mysqlsh -uAdmin -pPassword -hPrivateIP --sqlc

    Replace the following:

    • Admin: the database system admin name.

    • Password: the database system password.

    • PrivateIP: the private IP address of the database system.

Ingesting Files Using Asynchronous Load

The VECTOR_STORE_LOAD routine creates and loads vector embeddings asynchronously into the vector store. You can ingest the source files into the vector store using the following methods:

Using the Uniform Resource Identifier with Asynchronous Load

This section describes how to load source documents from the Object Storage bucket into the vector table using the uniform resource identifier (URI) of the object.

Note

To use this method, you need to enable the database system to access an Oracle Cloud Infrastructure Object Storage bucket. For more information, see Resource Principals.

To set up a new vector store using an object URI, perform the following steps:

  1. To create the vector store table, use a new or existing database:

    use DBName;

    Replace DBName with the database name.

  2. If you are loading a vector store table on a database system for the first time, call the following procedure to create a schema used for task management:

    select mysql_task_management_ensure_schema();
  3. Optionally, to specify a name for the vector store table and language to use, set the @options session variable:

    set @options = JSON_OBJECT("table_name", "VectorStoreTableName", "language", "Language");

    Replace the following:

    • VectorStoreTableName: the name you want for the vector store table.

    • Language: the two-letter ISO 639-1 code for the language you want to use. Default language is en, which is English. To view the list of supported languages, see Languages.

    For example:

    set @options = JSON_OBJECT("table_name", "demo_embeddings", "language", "en");

    To learn more about the available routine options, see VECTOR_STORE_LOAD Syntax.

  4. To ingest the file from the object storage, create vector embeddings, and load the vector embeddings into HeatWave, use the VECTOR_STORE_LOAD routine:

    call sys.VECTOR_STORE_LOAD('oci://BucketName@Namespace/Path/Filename', @options);

    Replace the following:

    • BucketName: the OCI Object Storage bucket name.

    • Namespace: the name of the Object Storage bucket namespace.

    • Path: path to the folder that contains the source file.

    • Filename: the filename with the file extension.

    For example:

    call sys.VECTOR_STORE_LOAD('oci://demo_bucket@demo_namespace/demo_folder/demo_file.pdf', @options);

    This creates an asynchronous task that runs in background and loads the vector embeddings into the specified vector store table. The output of the VECTOR_STORE_LOAD routine contains the following:

    • An ID of the task which was created.

    • A task query that you can use to track the progress of task.

    If HeatWave GenAI detects multiple files with the same or different file formats in a single load, it creates a separate table for every format it finds. The table name for each format is the specified or default table name followed by the format. For example, demo_embeddings_pdf is the name of the table that contains PDF files.

  5. After the task is completed, verify that embeddings are loaded in the vector store table:

    select count(*) from VectorStoreTableName;

    For example:

    select count(*) from demo_embeddings;

    If you see a numerical value in the output, your embeddings are successfully loaded in the vector store table.

Using a Pre-Authenticated Request with Asynchronous Load

This section describes how to ingest source documents from the object storage using pre-authenticated requests (PAR). Use this method if OCI Object Storage bucket access is not enabled on your database system.

Note

For confidential data, Using the Uniform Resource Identifier with Asynchronous Load is recommended for ingesting the source files into the vector store as it is a more secure method.

To learn how to create PAR for your object storage, see Creating a PAR Request in Object Storage.

If you are creating a PAR for a folder or the object storage, then select Enable Object Listing in the Create Pre-Authenticated Request dialog to enable object listing.

To set up a new vector store, perform the following steps:

  1. To create the vector store table, use a new or existing database:

    use DBName;

    Replace DBName with the database name.

  2. If you are loading a vector store table on a database system for the first time, call the following procedure to create a schema used for task management:

    select mysql_task_management_ensure_schema();
  3. Optionally, to specify a name for the vector store table and language to use, set the @options session variable:

    set @options = JSON_OBJECT("table_name", "VectorStoreTableName", "language", "Language");

    Replace the following:

    • VectorStoreTableName: the name you want for the vector store table.

    • Language: the two-letter ISO 639-1 code for the language you want to use. Default language is en, which is English. To view the list of supported languages, see Languages.

    For example:

    set @options = JSON_OBJECT("table_name", "demo_embeddings", "language", "en");

    To learn more about the available routine options, see VECTOR_STORE_LOAD Syntax.

  4. To ingest the file from the object storage, create vector embeddings, and load the vector embeddings into HeatWave, use the VECTOR_STORE_LOAD routine:

    call sys.VECTOR_STORE_LOAD('PAR', @options);

    Replace PAR with PAR of the bucket, folder, or file that you want to use to set up the vector store.

    For example:

    call sys.VECTOR_STORE_LOAD('https://demo.objectstorage.us-ashburn-1.oci.customer-oci.com/p/demo-url/n/demo/b/demo-bucket/o/demo_doc.pdf', @options);

    This creates a task that runs in background and loads the vector embeddings into the specified vector store table. The output of the VECTOR_STORE_LOAD routine contains the following:

    • An ID of the task which was created.

    • A task query that you can use to track the progress of task.

    If HeatWave GenAI detects files with different file formats in a single load, it creates a separate table for every format it finds. The table name for each format is the specified or default table name followed by the format. For example, demo_embeddings_pdf is the name of the table that contains PDF files.

  5. After the task is completed, verify that embeddings are loaded in the vector store table:

    select count(*) from VectorStoreTableName;

    For example:

    select count(*) from demo_embeddings;

    If you see a numerical value in the output, your embeddings are successfully loaded in the vector store table.

Ingesting Files Using Auto-Parallel Load

The HEATWAVE_LOAD routine creates and loads vector embeddings into the vector store using auto parallel load.

To ingest files using the HEATWAVE_LOAD routine, perform the following steps:

  1. In your HeatWave Database System, create and use a new database:

    create database DBName;
    use DBName;

    Replace DBName with the name you want for the new database.

  2. To ingest the file from the object store and create vector embeddings in a new vector store table, set the @dl_tables session variable:

    set @dl_tables = '[
      {
        "db_name": "DBName",
        "tables": [
        {
            "table_name": "VectorStoreTableName",
            "engine_attribute": {
                "dialect": {"format": "FileFormat"},
                "file": [
        {"par": "PAR"}
      ]
            }
          }
        ]
      }]';

    Replace the following:

    • DBName: the database name.

    • VectorStoreTableName: the name you want for the vector store table where the vector embeddings are stored.

    • FileFormat: the formats of the files you uploaded to the Object Storage bucket. The supported file formats are pdf, ppt, txt, and doc.

    • PAR: the pre-authenticated request (PAR) detail of the bucket, folder, or file that you want to use to set up the vector store.

    To learn how to create PAR for your object storage, see Creating a PAR request in Object Storage.

    Note

    If you are creating a PAR for a folder or the object store, then select Enable Object Listing to enable object listing in the Create Pre-Authenticated Request dialog while creating the PAR.

    For example:

    set @dl_tables = '[
      {
        "db_name": "demo_db",
        "tables": [
        {
            "table_name": "demo_embeddings",
            "engine_attribute": {
                "dialect": {"format": "pdf"},
                "file": [
        {"par": "https://demo.objectstorage.us-ashburn-1.oci.customer-oci.com/p/demo-url/n/demo/b/demo_bucket/o/heatwave-en.a4.pdf"}
      ]
            }
          }
        ]
      }]';
  3. To prepare for loading the vector embeddings into the HeatWave system, set the @options session variable:

    set @options = JSON_OBJECT('mode', 'normal');
  4. To load the vector embeddings into HeatWave, use the HEATWAVE_LOAD routine:

    call sys.HEATWAVE_LOAD(CAST(@dl_tables AS JSON), @options);

    This creates and stores the vector embeddings in the specified vector store table.

  5. Verify that embeddings are loaded in the vector store table:

    select count(*) from VectorStoreTableName;

    For example:

    select count(*) from demo_embeddings;

    If you see a numerical value in the output, your embeddings are successfully loaded in the table.