10.2.13 NL2ML

MySQL 9.4.0 introduces the NL2ML (natural language to machine learning) routine, which provides step-by-step guidelines on using MySQL HeatWave AutoML for a particular business problem. Use this routine to ask general questions about MySQL HeatWave AutoML and to generate AutoML queries you can copy and run.

Note
  • Before MySQL 9.6.0, NL2ML requires Oracle Cloud Infrastructure Generative AI, which is available in specific regions. To confirm if your region has Generative AI, see Regions with Generative AI.

    As of MySQL 9.6.0, if your region does not have OCI Generative AI, or you want to bypass OCI Generative AI and instead use in-database LLMs to generate responses, you can set the skip_generate option to true. See Use NL2ML with In-Database LLMs.

  • NL2ML uses pretrained foundational models supported by OCI Generative AI.

    • Before MySQL 9.6.0, MySQL HeatWave in-database LLMs are not supported. To review available OCI Generative AI models, see Pretrained Foundational Models in Generative AI. For every major MySQL release, NL2ML creates a new vector store from the MySQL HeatWave User Guide documentation. After creating each new vector store, the old vector store gets deleted. Due to the creation of vector stores, some NL2ML calls may take longer to complete.

    • As of MySQL 9.6.0, the skip_generate option allows you to use in-database LLMs instead of foundational models supported by OCI Generative AI. See Use NL2ML with In-Database LLMs.

NL2ML Syntax

mysql> CALL sys.NL2ML (query, response);

NL2ML parameters:

  • query: Enter a question in natural language related to MySQL HeatWave AutoML. For example, "What are the different types of machine learning models I can create?".

  • response: The name of the JSON object session variable that contains the response to the question.

The nl2ml_options Session Variable

The @nl2ml_options session variable allows you to do the following:

  • Set the skip_generate option to true or false. The default value is false. You must set this to true to bypass OCI Generative AI and only provide citations. This also allows you to leverage the citations as context to generate model training guidance and SQL queries with in-database LLMs. To learn more, see Use NL2ML with In-Database LLMs.

  • View the chat history of the last 20 questions and current OCI Generative AI model:

    mysql> SELECT JSON_PRETTY(@nl2ml_options);
  • Reset the chat history. Use one of the following commands:

    mysql> SET @nl2ml_options = NULL;
    
    mysql> SET @nl2ml_options = JSON_REMOVE(@nl2ml_options, '$.chat_history');
  • Change the current OCI Generative AI model. To review available options for model_id, see Pretrained Foundational Models in Generative AI.

    Change the current model and reset the chat history:

    mysql> SET @nl2ml_options = JSON_OBJECT("model_id", "meta.llama-3.1-405b-instruct");

    Change the current model and maintain the current chat history:

    mysql> SET @nl2ml_options = JSON_SET(@nl2ml_options, '$.model_id', "meta.llama-3.1-405b-instruct");

Syntax Examples

  • Ask a question about using MySQL HeatWave AutoML:

    mysql> CALL sys.NL2ML("Is there any limit on the size of training table?", @output);
    Query OK, 0 rows affected (11.0086 sec)
    
    mysql> SELECT @output;
    +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | @output |
    +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | {"text": "Yes, there are limits on the size of the training table for HeatWave AutoML models. The table used to train a model cannot exceed 10 GB in size, 100 million rows, or 1017 columns."} |
    +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    1 row in set (0.0003 sec)
  • Ask a question, and then ask a follow-up question based on the first one:

    mysql> CALL sys.NL2ML("What is the purpose of task in ML_TRAIN?", @output);
    Query OK, 0 rows affected (32.4279 sec)
    
    mysql> SELECT @output;
    +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | @output |
    +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | {"text": "The purpose of 'task' in ML_TRAIN is to specify the machine learning task, such as classification, regression, forecasting, anomaly detection, recommendation, or topic modeling."} |
    +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    1 row in set (0.0003 sec)
    mysql> CALL sys.NL2ML("Can you give me an example of using that?", @output);
    Query OK, 0 rows affected (11.7221 sec)
    
    mysql> SELECT @output;
    +--------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | @output |
    +--------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | {"text": "Here's an example of specifying a machine learning task in ML_TRAIN:
               \n\n```sql\nCALL sys.ML_TRAIN('mlcorpus.emails', 'spam_label', JSON_OBJECT('task', 'classification'), @model);
               \n```\n\nIn this example, the machine learning task is specified as 'classification' and the target column is 'spam_label' in the 'emails' table."} |
    +--------------------------------------------------------------------------------------------------------------------------------------------------------------+
    1 row in set (0.0003 sec)
  • Ask for an appropriate table in the DB System to use for a training query:

    mysql> CALL sys.NL2ML("Can you give me a training query for email spam detection?", @output);
    Query OK, 0 rows affected (8.5084 sec)
    
    mysql> SELECT @output;
    +-------------------------------------------------------------------------------------------------------------------------------------------+
    | @output |
    +-------------------------------------------------------------------------------------------------------------------------------------------+
    | {"text": "```sql\nCALL sys.ML_TRAIN('mlcorpus.emails', 'spam_label', \n JSON_OBJECT('task', 'classification'), @email_spam_model);\n```"} |
    +-------------------------------------------------------------------------------------------------------------------------------------------+
    1 row in set (0.0003 sec)
  • Request debugging assistance for an error message:

    mysql> CALL sys.NL2ML("When I tried to train a model I got this error: can not train a regression or recommendation task with non-numeric targte column What does this mean? How can I resolve the issue?", @output);
    Query OK, 0 rows affected (18.0766 sec)
    
    mysql> SELECT @output;
    +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | @output                                                                                                                                                                                                                                                                                                                                |
    +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | {"text": "The error \"can not train a regression or recommendation task with non-numeric target column\" means that you are trying to train a regression or recommendation model on a dataset where the target column (the column you are trying to predict) is not numeric.
                \n\nRegression models require a numeric target column, as they are designed to predict continuous values. Recommendation models also require numeric target columns, as they are designed to predict ratings or scores.
                \n\nTo resolve this issue, you need to either:\n\n1. Choose a different machine learning task that can handle non-numeric target columns, such as classification.\n2. Convert your target column to a numeric data type, if possible. 
                For example, if your target column is a categorical variable with a limited number of categories, you can use one-hot encoding or label encoding to convert it to a numeric representation.\n3. Select a different column as the target column that is numeric.
                \n\nIn your case, since you are using the \"food_delivery\" table, you may want to check the data type of the column you are trying to predict and see if it can be converted to a numeric data type. If not, you may want to choose a different machine learning task or select a different column as the target column."} |
    +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    1 row in set (0.0003 sec)
  • Retrieve the chat history of questions and responses, along with the current OCI Generative AI model, with the @nl2ml_options session variable:

    mysql> SELECT JSON_PRETTY(@nl2ml_options);
    +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | JSON_PRETTY(@nl2ml_options) |
    +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | {
    "model_id": "meta.llama-3.1-405b-instruct",
    "chat_history": [
    {
    "user_message": "Is there any limit on the size of training table?",
    "chat_bot_message": "Yes, there is a size limit for training tables in HeatWave. The table used to train a model cannot exceed 10 GB, 100 million rows, or 1017 columns."
    }
    ]
    } |
    +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    1 row in set (0.0003 sec)
  • Use the skip_generate option to bypass OCI Generative AI content generation and only provide citations to relevant documentation. Then, retrieve a relevant table schema related to the question (ML_RETRIEVE_SCHEMA_METADATA), build a compact context string from the citations (GROUP_CONCAT), and generate a response that includes the citations, context, and retrieved table schema (ML_GENERATE).

    mysql> SET @input = "How can I train a model to predict net worth of a singer?";
      
    mysql> SET @nl2ml_options = JSON_OBJECT("skip_generate", true);
    
    mysql> CALL sys.NL2ML(@input, @out);
    
    mysql> SELECT JSON_PRETTY(@out);
    JSON_PRETTY(@out)
    {
      "citations": [
        {
          "segment": "<segment content>",
          "distance": 0.1023,
          "document_name": "https://.../heatwave-en-ml-9.6.0.pdf",
          "segment_number": <segment number>
        },
        ...
      ],
      "vector_store": [
        "`ML_SCHEMA_userOne`.`hwml_vector_store_heatwave_en_ml_9_6_0`"
      ],
      "retrieval_info": {
        "method": "n_citations",
        "threshold": 0.114
      }
    }
    
    mysql> CALL sys.ML_RETRIEVE_SCHEMA_METADATA(@input, @retrieved, NULL);
      
    mysql> SELECT @retrieved;
    @retrieved
    CREATE TABLE `mlcorpus`.`singer`(
      `Singer_ID` int,
      `Name` varchar,
      `Birth_Year` double,
      `Net_Worth_Millions` double,
      `Citizenship` varchar
    );
    
    mysql> SELECT GROUP_CONCAT(seg SEPARATOR '\n') INTO @ctx
    FROM JSON_TABLE(JSON_EXTRACT(@out,'$.citations'),
    '$[*]' COLUMNS (seg LONGTEXT PATH '$.segment')) AS jt;
      
    mysql> SET @final_ctx = CONCAT(@ctx, '\n\nRetrieved tables:\n', @retrieved);
      
    mysql> SELECT sys.ML_GENERATE(
    @input,
    JSON_OBJECT(
    "task", "generation",
    "model_id", "llama3.2-3b-instruct-v1",
    "context", @final_ctx
    )
    ) INTO @result;
    
    mysql> SELECT JSON_UNQUOTE(JSON_EXTRACT(@result,'$.text')) AS generated_sql;
    generated_sql
    To train a model to predict the net worth of a singer, you can use the ML_TRAIN routine. First, prepare your dataset, 
    which in this case seems to be the 'singer' table in the 'mlcorpus' schema. Ensure that the table has the necessary columns, 
    such as 'Singer_ID', 'Name', 'Birth_Year', 'Net_Worth_Millions', and 'Citizenship'. 
    
    The 'Net_Worth_Millions' column will be your target column for prediction. You may need to preprocess your data, 
    for example, converting categorical variables like 'Name' and 'Citizenship' into numerical variables if necessary.
    
    Then, you can call the ML_TRAIN routine with the appropriate options. For a regression task like predicting net worth, 
    you would specify the task as 'regression' in the JSON options. Here's a simplified example:
    
    ```sql
    CALL sys.ML_TRAIN('mlcorpus.singer', 
                      @model_handle, 
                      'Net_Worth_Millions', 
                      JSON_OBJECT('task', 'regression', 
                                  'algorithm', 'XGBRegressor'));
    ```
    
    Replace '@model_handle' with your actual model handle variable. This will train a model to predict the 'Net_Worth_Millions' 
    based on the other columns in your 'singer' table. After training, you can use the ML_PREDICT_ROW or ML_PREDICT_TABLE routine 
    to generate predictions for new, unseen data.