Documentation Home
HeatWave User Guide
Related Documentation Download this Manual
PDF (US Ltr) - 3.8Mb
PDF (A4) - 3.8Mb


HeatWave User Guide  /  ...  /  Refresh Data Using Event-Based Incremental Load

4.3.11.1 Refresh Data Using Event-Based Incremental Load

As of MySQL 9.4.1, you can load data from Object Storage by using event-based incremental load. Once an event for an external file location is emitted due to a change to an object, Lakehouse automatically refreshes the data into the external table. This feature is currently only available for MySQL HeatWave on OCI.

Event-based incremental load automates the process of updating external tables based on changes to objects in Object Storage buckets. To do this, you set up streams in OCI and configure external tables to read from that stream to listen to Object Storage object-level events. To learn more about OCI streams, see Streaming. When an event is emitted, Lakehouse incrementally loads the object changes to the corresponding external tables. These changes include adding new files, removing deleted files, and updating modified files.

When setting up event-based incremental load, you specify the stream OCID in each external table by using the AUTO_REFRESH_SOURCE parameter when using SQL syntax, or the ENGINE_ATTRIBUTE parameter auto_refresh_event_source when using JSON syntax.

You can set up event-based incremental load before or after the external table is loaded. You can also update or remove event-based incremental load while an external table is loaded.

When a stream OCID for auto_refresh_event_source or AUTO_REFRESH_SOURCE is set or updated for a loaded external table, Lakehouse attempts to fetch stream information and call the get-messages API to verify that it can read from the stream. If either of these two operations fail, a WARNING is returned with one of the following messages:

  • Could not fetch stream information. Please check the Stream OCID and your access policies.

  • Could not read messages from the stream. Please check your access policies.

You can modify the auto_refresh_event_source or AUTO_REFRESH_SOURCE stream OCID for a loaded external table with one of the following results:

  • OCID Added: Validations and potential warnings as above.

  • OCID Removed: Lakehouse drops all remaining queued events for this table. Any currently ongoing auto refresh is finished.

  • OCID Updated: Validations and potential warnings as above. Lakehouse performs a full incremental load of the table to guarantee consistency.

Keep in mind the following when setting up event-based incremental load:

  • There is no guarantee that updates to external tables will be applied in the same order as updates to the files in Object Storage.

  • For event-based incremental load, the strict mode (enabled by STRICT_LOAD and is_strict_mode parameters) does not verify whether the external path used for loading data becomes empty. This can lead to loaded external tables being empty.

  • The setting for the ALLOW_MISSING_FILES parameter is ignored for event-based incremental load, which means it is possible to have an empty external table.

This topic contains the following sections:

Before You Begin
  • Prepare to load data by reviewing the following:

  • Review the following requirements and recommendations for setting up event-based incremental load:

    • Object Storage object-level events must be enabled for the bucket that stores the files for the external tables. See Set Up Object-Level Events.

    • External tables and stream OCIDs have the following restrictions:

      • Each external table can only have one stream OCID.

      • A single stream OCID can be configured for multiple external tables.

    • When setting up an external table with event-based incremental load, it is not possible to use the following types of PARs:

      • Bucket PARs (end in /o/)

      • Prefix PARs (end in /o/)

      • PAR URIs that do not have an object path (end in /o/).

      If an external table specifies one of these types of PARs and also specifies a stream for event-based incremental load, an error is returned.

    • We recommend not sharing stream OCIDs across DB Systems, and limiting the usage of each stream OCID for only event-based incremental load.

    • Renaming an object in an Object Storage bucket does not trigger an event. To trigger an event, you must delete and re-upload the object, or overwrite the object.

    • It is typically sufficient to set up streams with only one partition. You should set up a stream with multiple partitions when you plan to do many object operations that should be incrementally loaded automatically through events. Lakehouse reads at most 1 MB of raw stream message payload per partition per second. See Limits on Streaming Resources to learn more. Consider this limit when calculating how many partitions are needed. Object Storage events are JSON objects. For relevant examples, see Contents of an Events Message and Services that Produce Events: Object Storage.

Create and Set Up Streams

The Events service requires an intermediary for events persistence and forwarding. To achieve this, the OCI Streams service is used, due to its benefit of persistence guarantees and asynchronous event delivery.

Refer to the following resources from Oracle Cloud Infrastructure Documentation to learn more about streaming:

To create the stream, you have the option to use the OCI console, the command-line, or API. See Creating a Stream.

The default limit of partitions you can create is 200. To check your current limit, use the following command:

$ oci limits value list --compartment-id ocid1.tenancy.oc1... --service-name streaming

Replace ocid1.tenancy.oc1... with the appropriate compartment-id value.

Set Up Policy Statements

After creating the required streams, you need to set up policy statements that give appropriate dynamic groups access to the streams.

Refer to the following resources from Oracle Cloud Infrastructure Documentation to learn more about policies and dynamic groups:

You need to update the appropriate policy with policy statements that give dynamic groups access to the streams. The dynamic group you specify needs to be the one the DB System belongs to. You can include multiple stream IDs in one policy statement. To get the required stream OCID, select the stream from the Streams page in the OCI console.

To create the policy statement, use the following syntax:

Allow dynamic-group dynamic_group_name to use stream-pull in compartment compartment_name where any {target.stream.id = 'stream_OCID_1', target.stream.id = 'stream_OCID_2'}

Replace dynamic_group_name, compartment_name, and stream_OCID_N with your own values for the DB System group and all streams you use for Lakehouse, respectively.

Refer to the following resources from Oracle Cloud Infrastructure Documentation to learn more about stream details and policy statements:

Set Up Object-Level Events

The Object Storage buckets that store the files need to enable the emitting of events. See Managing Emitting Events for Object State Changes in an Object Storage Bucket.

Set Up Event Rules

You need to set up the rules to have events routed from Object Storage to the streams. It is important to filter events appropriately before they are sent to the stream, so that only events relevant for the external file location are emitted. It is not possible to filter events after they are sent to the stream. See Matching Events with Filters to learn more.

The default limit of event rules you can create is 50. To check your current limit, use the following command:

$ oci limits value list --compartment-id ocid1.tenancy.oc1... --service-name events

Replace ocid1.tenancy.oc1... with the appropriate compartment-id value.

Creating a rule involves the following steps:

  1. Create the rule.

  2. Extend the rule with attribute filters.

When creating the event rule in the OCI console, do the following:

  • For Condition, select Event Type.

  • For Service Name, select Object Storage

  • For Event Type, select all of Object - Create, Object - Update, and Object - Delete.

  • For Action Type, select Streaming. Then, also select the appropriate compartment and streams. You can add multiple destinations by clicking + Another Action.

  • Optionally, add tags to the rule in Advanced options.

  • Click Create Rule.

After creating an event-type filter condition, you should add more filters on Object Storage-specific attributes by clicking + Another Condition. Select the Condition of Attribute, Attribute Name of resourceName, and for Attribute Values enter the prefix string. You should only add more filters for the following attributes.You can use an asterisk (*) as a wildcard if needed.

  • bucketId

  • bucketName

  • compartmentId

  • compartmentName

  • namespace

  • resourceName

Please note that Lakehouse cannot cross-check file locations specified for the external table in ENGINE_ATTRIBUTE options.

It is also possible to change the event attribute filters later. To do that, go to the event's details page, and on the left-side under Resources select Event Matching. Then, under Attributes, select Add Attribute.

See Creating an Events Rule to learn more.

Set Up External Table with Event-Based Incremental Load

The final step is to set up the external table with the streams so that it can load data automatically with event-based incremental load. For the initial load, you must specify at least one external file location with the file or FILES option.

If you create the table manually, you can use the CREATE EXTERNAL TABLE command to set it up before loading data into the table. If the table was previously created, use ALTER TABLE.

To set up the external table with a stream, use the AUTO_REFRESH_SOURCE parameter when using SQL syntax, or the ENGINE_ATTRIBUTE parameter auto_refresh_event_source when using JSON syntax.

The following example uses Auto Parallel Load to automatically create a new table and set up a stream.

mysql> SET @input_list = '[{
  "db_name": "lakehouse_db",
  "tables": [{
    "table_name": "table_1",
    "engine_attribute": {
      "dialect": {"format": "csv"},
      "auto_refresh_event_source": "ocid1.stream.oc1.iad...",
      "file": [{"uri": "oci://mybucket@mynamespace/data_files/data_file_1.csv"}]
    }
  }]
}]';
mysql> CALL sys.HEATWAVE_LOAD(CAST(@input_list AS JSON), NULL);

The following example uses SQL syntax to create a new external table and set up a stream. The HEATWAVE_LOAD command loads the table with Auto Parallel Load.

mysql> CREATE EXTERNAL TABLE table_1(
        col_1 int,
        col_2 int,
        col_3 int)
        FILE_FORMAT = (FORMAT csv)
        FILES = (URI = 'oci://mybucket@mynamespace/data_files/data_file_1.csv')        
        AUTO_REFRESH_SOURCE = 'ocid1.stream.oc1.iad...';
mysql> CALL sys.HEATWAVE_LOAD('[ {"db_name": "lakehouse_db", "tables": ["table_1"]} ]', NULL);

The following example uses SQL syntax to create a new external table and set up a stream. The SECONDARY_LOAD command loads the table with Guided Load.

mysql> CREATE EXTERNAL TABLE table_1(
        col_1 int,
        col_2 int,
        col_3 int)
        FILE_FORMAT = (FORMAT csv)
        FILES = (URI = 'oci://mybucket@mynamespace/data_files/data_file_1.csv')        
        AUTO_REFRESH_SOURCE = 'ocid1.stream.oc1.iad...';
mysql> ALTER TABLE table_1 SECONDARY_LOAD;

The following example uses JSON syntax and updates a table with a stream. The SECONDARY_LOAD command loads the table with Guided Load.

mysql> CREATE EXTERNAL TABLE table_1(
        col_1 int,
        col_2 int,
        col_3 int)
        FILE_FORMAT = (FORMAT csv)
        FILES = (URI = 'oci://mybucket@mynamespace/data_files/data_file_1.csv');
mysql> ALTER TABLE table_1
        ENGINE_ATTRIBUTE = '{"auto_refresh_event_source": "ocid1.stream.oc1.iad..."}';
mysql> ALTER TABLE table_1 SECONDARY_LOAD;
What's Next

After successfully setting up a table with event-based data loading, learn how to do the following: