RapidCanvas Docs
  • Welcome
  • GETTING STARTED
    • Quick start guide
    • Introduction to RapidCanvas
    • RapidCanvas Concepts
    • Accessing the platform
  • BASIC
    • Projects
      • Projects Overview
        • Creating a project
        • Reviewing the Projects listing page
        • Duplicating a Project
        • Modifying the project settings
        • Deleting Project(s)
        • Configuring global variables at the project level
        • Working on a project
        • Generating the about content for the project
        • Generating AI snippets for each node on the Canvas
        • Marking & Unmarking a Project as Favorite
      • Canvas overview
        • Shortcut options on canvas
        • Queuing the Recipes
        • Bulk Deletion of Canvas Nodes
        • AI Guide
      • Recipes
        • AI-assisted recipe
        • Rapid model recipe
        • Template recipe
        • Code Recipe
        • RAG Recipes
      • Scheduler overview
        • Creating a scheduler
        • Running the scheduler manually
        • Managing schedulers in a project
        • Viewing the schedulers in a project
        • Viewing the run history of a specific scheduler
        • Publishing the updated data pipeline to selected jobs from canvas
        • Fetching the latest data pipeline to a specific scheduler
        • Comparing the canvas of the scheduler with current canvas of the project
      • Predictions
        • Manual Prediction
        • Prediction Scheduler
      • Segments and Scenarios
      • DataApps
        • Model DataApp
        • Project Canvas Datasets
        • Custom Uploaded Datasets
        • SQL Sources
        • Documents and PDFs
        • Prediction Service
        • Scheduler
        • Import DataApp
    • Connectors
      • Importing dataset(s) from the local system
      • Importing Text Files from the Local System
      • Connectors overview
      • Connect to external connectors
        • Importing data from Google Cloud Storage (GCS)
        • Importing data from Amazon S3
        • Importing data from Azure Blob
        • Importing data from Mongo DB
        • Importing data from Snowflake
        • Importing data from MySQL
        • Importing data from Amazon Redshift
        • Importing data from Fivetran connectors
    • Workspaces
      • User roles and permissions
    • Artifacts & Models
      • Adding Artifacts at the Project Level
      • Adding Models at the Project Level
      • Creating an artifact at the workspace level
      • Managing artifacts at the workspace level
      • Managing Models at the Workspace Level
      • Prediction services
    • Environments Overview
      • Creating an environment
      • Editing the environment details
      • Deleting an environment
      • Monitoring the resource utilization in an environment
  • ADVANCED
    • Starter Guide
      • Quick Start
    • Setup and Installation
      • Installing and setting up the SDK
    • Helper Functions
    • Notebook Guide
      • Introduction
      • Create a template
      • Code Snippets
      • DataApps
      • Prediction Service
      • How to
        • How to Authenticate
        • Create a new project
        • Create a Custom Environment
        • Add a dataset
        • Add a recipe to the dataset
        • Manage cloud connection
        • Code recipes
        • Display a template on the UI
        • Create Global Variables
        • Scheduler
        • Create new scenarios
        • Create Template
        • Use a template in a flow notebook
      • Reference Implementations
        • DataApps
        • Artifacts
        • Connectors
        • Feature Store
        • ML model
        • ML Pipeline
        • Multiple Files
      • Sample Projects
        • Model build and predict
  • Additional Reading
    • Release Notes
      • April 21, 2025
      • April 01, 2025
      • Mar 18, 2025
      • Feb 27, 2025
      • Jan 27, 2025
      • Dec 26, 2024
      • Nov 26, 2024
      • Oct 24, 2024
      • Sep 11, 2024
        • Aug 08, 2024
      • Aug 29, 2024
      • July 18, 2024
      • July 03, 2024
      • June 19, 2024
      • May 30, 2024
      • May 15, 2024
      • April 17, 2024
      • Mar 28, 2024
      • Mar 20, 2024
      • Feb 28, 2024
      • Feb 19, 2024
      • Jan 30, 2024
      • Jan 16, 2024
      • Dec 12, 2023
      • Nov 07, 2023
      • Oct 25, 2023
      • Oct 01, 2024
    • Glossary
Powered by GitBook
On this page
  1. ADVANCED
  2. Notebook Guide
  3. Sample Projects

Model build and predict

This example file will walk you through the steps involved to build an ML model using historic data and predict on new incoming data.

The example provides historic sensor data of windturbines and their failures. A model will be built based on this historic data. New daily sensor data will be passed through the model to predict failures.

Download the project files here: Reference Project

Building an ML model and predicting on RapidCanvas involves the following steps:

  • Import functions

  • Authenticate your client

  • Create a custom environment

  • Create a new project

    • Fetch pre-built templates

  • Set project variables and create scenarios

    • Set project varaibles

    • Create relevant scenarios

      • Create a build scenario

      • Create a predict scenario

  • Create a build pipeline

    • Add Input Datasets

    • Transform your raw data

      • Create recipe to fill nulls

      • Create recipe to clean data

    • Create recipe for tsfresh features

    • Build ML model

      • Create a recipe to add labels to the dataset

      • Create a recipe to build a random forest model

  • Create a predict pipeline

    • Add Input Datasets

    • Transform your raw data

      • Create recipe to fill nulls

      • Create recipe to clean raw data

    • Create recipe to add prediction features

    • Create recipe for tsfresh features

    • Model Prediction

      • Create a recipe for model prediction

  • Run scenarios

    • Run predict scenario for model prediction

    • Run build scenario for model building

Import function


    from utils.rc.client.requests import Requests
    from utils.rc.client.auth import AuthClient
    
    from utils.rc.dtos.project import Project
    from utils.rc.dtos.dataset import Dataset
    from utils.rc.dtos.recipe import Recipe
    from utils.rc.dtos.transform import Transform
    from utils.rc.dtos.template import Template
    from utils.rc.dtos.template import TemplateTransform
    from utils.rc.dtos.template import TemplateInput
    from utils.rc.dtos.env import Env
    from utils.rc.dtos.env import EnvType
    from utils.rc.dtos.template_v2 import TemplateV2, TemplateTransformV2
    from utils.rc.dtos.global_variable import GlobalVariable
    from utils.rc.dtos.scenario import RecipeExpression
    from utils.rc.dtos.scenario import Operator
    from utils.rc.dtos.scenario import Scenario
    
    from utils.rc.dtos.dataSource import DataSource
    from utils.rc.dtos.dataSource import DataSourceType
    from utils.rc.dtos.dataSource import GcpConfig
    
    import logging
    
    logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.INFO)

## Authenticate your client

## Create a custom environment

Here are the available custom environments and their usage gudelines

| SMALL: 1 Core, 2GB Memmory
| MEDIUM: 2 Cores, 4GB Memmory
| LARGE: 4 Cores, 8GB Memmory
| CPU_LARGE: 8 Cores, 16GB Memmory
| MAX_LARGE: 12 Cores, 32GB Memmory
| EXTRA_MAX_LARGE: 12 Cores, 48GB Memmory

```ipython3

    ## Environment Creation
    env = Env.createEnv(
        name="env_build_predict",
        description="Max large env for running build and predict",
        envType=EnvType.MAX_LARGE,
        requirements=""
    )
    env.id

## Create a Project

Create a new project under your tenant

```ipython3

    project_name = "Build and Predict"
    description = "One project for build and predict with 2 pipelines"
    icon = "https://rapidcanvas.ai/wp-content/uploads/2022/09/windturbine_med.jpg"
    project = Project.create(
        name=project_name,
        description=description,
        icon=icon,
        envId=env.id,
    #     createEmpty=True
    )
    project.id

**This has now created a new project named “Build and Predict” under
your tenant. You can check the created project on the RapidCanvas UI by logging in
here:** [RapidCanvas UI](https://staging.dev.rapidcanvas.net/)

### Getting Templates

You can use pre-built RapidCanvas templates as part of your project.
In this section, we are defining some prebuilt templates which will be
used during the build pipeline.

```ipython3

    # This gets all available templates
    templates = TemplateV2.get_all()

```ipython3

    # Relevant templates for this project are being fetched
    fill_null_template = TemplateV2.get_template_by('Fill Null Timeseries')
    undersample_timeseries_template = TemplateV2.get_template_by('Undersample Timeseries Data')
    tsfresh_template = TemplateV2.get_template_by('Tsfresh Features')
    time_to_event_template = TemplateV2.get_template_by('Time To Event')
    RandomForest_template = TemplateV2.get_template_by('Random Forest')

## Set project variables and create scenarios

### Add project variables

Project variables are stored as key value pairs at the project level and
the name of the variable can be referred to using the “@variable name”
notion to pass the corresponding value. In this case we are creating a
global variable called model_global which can be used to determine
whether to run the build pipeline or a predict pipeline.

```ipython3

    globalVariable = GlobalVariable(
        name="mode_global",
        project_id=project.id,
        type="string",
        value="build"
    )
    globalVariable.create()

### Create relevant scenarios

A scenario is created with in a project and allows to run a pipeline or
a recipe only when certain conditions are met. We are using scenarios in
this example to either just run the build pipeline or the predict
pipeline.

#### Build Scenario

As part of the build scenario, our global variable mode_global is
changed to “build” which will only run the build pipeline and skip the
predict pipeline. After your first build run, you will typically only
re-run your predict pipeline. However if you have some new historic
data, or want to rebuild the model, you can re-run the build scenario.

```ipython3

    build_scenario = project.add_scenario(
        name='build_scenario',
        description='Model Build',
        shared_variables=dict(mode_global="build")
    )

#### Predict Scenario

As part of the predict scenario, our global variable mode_global is
changed to “predict” which will only run the predict pipeline and skip
the build pipeline.

In our example, we we will run predict scenario, every time we have a
new file to predict. During prediction we will be using the model which
was already built during the build pipeline.

```ipython3

    predict_scenario = project.add_scenario(
        name='predict_scenario',
        description='Model Predict',
        shared_variables=dict(mode_global="predict")
    )

## Create a build pipeline

As part of the section, we will follow all the relevant steps to build
an ML model using historic data.

### Add Input Datasets - Build pipeline

As part of the build pipeline, we are adding 2 data sets to the project,
sensor data and failures data. Sensor data contains all the historic
data which was collected from wind turbine sensors for a given time
period.

Failures data contains the list of turbine components and their
corresponding failure timestamp along with remarks

```ipython3

    sensorsDataset = project.addDataset(
        dataset_name="sensor_events",
        dataset_description="Sensor data of wind turbines",
        dataset_file_path="data/sensor_edp.csv"
    )
    
    labelsDataset = project.addDataset(
        dataset_name="incident_events",
        dataset_description="Labels data of wind turbines",
        dataset_file_path="data/failures_edp.csv"
    )

## Transform your raw data

### Create a recipe to fill nulls

This recipe cleans up the sensor data by identifying any nulls and
filling them as per the chosen method.

Note that we have added a first line before the recipe, to define a
build_mode. A build_mode will make sure this recipe will only run when
the value of “mode_global” is set to “build”. If not, this recipe run
will be skipped. Please note if this recipe run is skipped, everything
downstream associated with this recipe will also be skipped.

```ipython3

    build_mode = RecipeExpression(field='@mode_global', operator=Operator.EQUAL_TO, value='build')
    fill_null_recipe = project.addRecipe([sensorsDataset], name='fill_null_recipe', condition=build_mode)
    fill_null = Transform()
    fill_null.templateId = fill_null_template.id
    fill_null.name='fill_null'
    fill_null.variables = {
        'inputDataset': "sensor_events",
        'columns':'',
        'Group_by':'Turbine_ID',
        'how':'ffill',
        'Timestamp': 'Timestamp',
        'outputDataset':'fill_null_output'
    }
    fill_null_recipe.add_transform(fill_null)
    fill_null_recipe.run()

```ipython3

    fill_null_dataset = fill_null_recipe.getChildrenDatasets()['fill_null_output']
    fill_null_dataset.getData(5)

### Create a recipe to clean data

This recipe takes the fill null output and uses undersample timeseries
to clean the sensor dataset

Note that we have not added any condition to run this only for the build
pipeline. It is optional at this point because this is connected to the
output of the previous recipe which had the condition in place. If the
satisfied on the first recipe, we are expecting everything else to run
downstream. If the condition is not met on the first recipe, it will be
skipped along with everything downstream.

```ipython3

    sensor_cleaning_recipe = project.addRecipe([fill_null_dataset], name='sensor_cleaning_recipe')
    undersample_timeseries = Transform()
    undersample_timeseries.templateId = undersample_timeseries_template.id
    undersample_timeseries.name='undersample_timeseries'
    undersample_timeseries.variables = {
        'inputDataset': "fill_null_output",
        'Col_to_undersample_by':'Turbine_ID',
        'Timestamp':"Timestamp",
        'Frequency': "D",
        'Resample_type': "MEAN",
        'outputDataset':'sensor_cleaned'
    }
    sensor_cleaning_recipe.add_transform(undersample_timeseries)
    sensor_cleaning_recipe.run()

#### Output dataset and review sample

```ipython3

    sensor_cleaned = sensor_cleaning_recipe.getChildrenDatasets()['sensor_cleaned']
    sensor_cleaned.getData(5)

### Create a recipe for tsfresh features

This recipe takes the historic cleaned sensor output data and generates
30 day aggregates for each row of data. This generates all the
additional features we need.

Please note that rows where 30 days of historic data is not available
will be dropped at this step.

```ipython3

    sensor_tsfresh = project.addRecipe([sensor_cleaned], name='sensor_tsfresh')
    tsfresh = Transform()
    tsfresh.templateId = tsfresh_template.id
    tsfresh.name='tsfresh'
    tsfresh.variables = {
        'inputDataset': "sensor_cleaned",
        "max_timeshift":30,
        "min_timeshift":30,
        "entity":'Turbine_ID',
        "time":'Timestamp',
        "large":"True",
        "outputDataset": "sensor_ts_fresh"
    }
    sensor_tsfresh.add_transform(tsfresh)
    sensor_tsfresh.run()

#### Output dataset and review sample

```ipython3

    sensor_tsfresh_dataset = sensor_tsfresh.getChildrenDatasets()['sensor_ts_fresh']
    sensor_tsfresh_dataset.getData(5)

## Build ML model

### Create a recipe to add labels to the dataset

As part of the model building step, we first join our dataset with new
features with the failures dataset which we uploaded at the start of the
build pipeline.

```ipython3

    join_time_to_failure_recipe=project.addRecipe([sensor_tsfresh_dataset, labelsDataset], name='join_time_to_failure_recipe')
    time_to_failure = Transform()
    time_to_failure.templateId = time_to_event_template.id
    time_to_failure.name='time_to_failure'
    time_to_failure.variables = {
        'EventDataset':labelsDataset.name,
        'TimeSeriesDataset':'sensor_ts_fresh',
        'Eventkey':'Turbine_ID',
        'TimeSerieskey':'Turbine_ID',
        'EventTimestamp':'Timestamp',
        'TimeSeriesTimestamp':'Timestamp',
        'UnitOfTime':'days',
        'outputDataset':'time_to_failure_dataset'
    }
    join_time_to_failure_recipe.add_transform(time_to_failure)
    join_time_to_failure_recipe.run()

#### Output dataset and review sample

```ipython3

    time_to_failure_dataset = join_time_to_failure_recipe.getChildrenDatasets()['time_to_failure_dataset']
    time_to_failure_dataset.getData(5)

### Create a recipe to build a random forest model

In this step we build the ML model. Please note that once the model is
built, it is automatically stored in RapidCanvas repository and can be
retrieved for prediction in the later steps.

Note that this marks the end of the build pipeline

```ipython3

    template = TemplateV2(
        name="LocalRandomForest", description="LocalRandomForest", project_id=project.id, source="CUSTOM", status="ACTIVE", tags=["Number", "datatype-long"]
    )
    template_transform = TemplateTransformV2(type = "python", params=dict(notebookName="Local-Random-Forest.ipynb"))
    template.base_transforms = [template_transform]
    template.publish("transforms/Local-Random-Forest.ipynb")

```ipython3

    randomforest_recipe=project.addRecipe([time_to_failure_dataset], 'LocalRandomForest')
    RandomForest = Transform()
    RandomForest.templateId = template.id
    RandomForest.name='RandomForest'
    RandomForest.variables = {
        'inputDataset': "time_to_failure_dataset",
        'target':'time_to_event',
        'train_size':0.8,
        'model_to_save':'ml_random_forest_v15'
    }
    randomforest_recipe.add_transform(RandomForest)

```ipython3

    randomforest_recipe.run()

#### Output dataset and review sample

```ipython3

    children = randomforest_recipe.getChildrenDatasets()

```ipython3

    children['Test_with_prediction'].getData(5)

```ipython3

    children['Train_with_prediction'].getData(5)

## Model Prediction Pipeline

Now that we have built an ML model, we can start building our pipeline to
utilize the model for predicting on new sensor data.

### Add input dataset - Daily prediction files

This is the new sensor data which has not been used by the model and
will be predicted on.

```ipython3

    dailyDataset = project.addDataset(
        dataset_name="daily_events",
        dataset_description="Daily data of wind turbine for prediction",
        dataset_file_path="data/daily_data/data-with-features.csv"
    )

## Transform your raw data

### Create recipe to fill nulls

We follow the same set of data cleaning steps for the new data as we
have done during the build pipeline. Do note that we have added a first
line before the recipe, to define a predict mode.

A predict mode will make sure this recipe will only run when the value
of “mode_global” is set to “predict”. If not, this recipe run will be
skipped. Please note if this recipe run is skipped, everything
downstream associated with this recipe will also be skipped.

```ipython3

    predict_mode = RecipeExpression(field='@mode_global', operator=Operator.EQUAL_TO, value='predict')
    fill_null_recipe_predict = project.addRecipe([dailyDataset], name='fill_null_recipe_predict', condition=predict_mode)
    fill_null = Transform()
    fill_null.templateId = fill_null_template.id
    fill_null.name='fill_null'
    fill_null.variables = {
        'inputDataset': "daily_events",
        'columns':'',
        'Group_by':'Turbine_ID',
        'how':'ffill',
        'Timestamp': 'Timestamp',
        'outputDataset':'fill_null_output_predict'
    }
    fill_null_recipe_predict.add_transform(fill_null)
    fill_null_recipe_predict.run()

```ipython3

    fill_null_predict_dataset = fill_null_recipe_predict.getChildrenDatasets()['fill_null_output_predict']
    fill_null_predict_dataset.getData(5)

### Create recipe to clean data

This recipe takes the fill null output of the new data and uses
undersample timeseries to clean the dataset

Note that we have not added any condition to run this only for the
predict pipeline. It is optional at this point because this is connected
to the output of the previous recipe which had the condition in place.
If the condition is satisfied on the first recipe, we are expecting
everything else to run downstream. If the condition is not met on the
first recipe, it will be skipped along with everything downstream.

```ipython3

    sensor_cleaned_recipe_predict = project.addRecipe([fill_null_predict_dataset], name='sensor_cleaned_recipe_predict')
    undersample_timeseries = Transform()
    undersample_timeseries.templateId = undersample_timeseries_template.id
    undersample_timeseries.name='undersample_timeseries'
    undersample_timeseries.variables = {
        'inputDataset': "fill_null_output_predict",
        'Col_to_undersample_by':'Turbine_ID',
        'Timestamp':"Timestamp",
        'Frequency': "D",
        'Resample_type': "MEAN",
        'outputDataset':'sensor_cleaned_predict'
    }
    sensor_cleaned_recipe_predict.add_transform(undersample_timeseries)
    sensor_cleaned_recipe_predict.run()

```ipython3

    sensor_cleaned_predict_dataset = sensor_cleaned_recipe_predict.getChildrenDatasets()['sensor_cleaned_predict']
    sensor_cleaned_predict_dataset.getData(5)

### Create recipe to add prediction features

This step is new to the predict pipeline compared to the build pipeline.
During the build pipeline, all the historic data was available to
generate aggregates. However during predict pipeline, we only have
access to that days data. To generate 30 day aggregates we need to go to
the historic data and pull the relevant 30 days historic data for each
of these rows.

This recipe provides the ability to go to the feature store and pull the
necessary data for generating 30 day aggregates.

**Do note that this feature is still in beta and might change in the
future**

```ipython3

    addFeaturesRecipe = project.addRecipe([sensor_cleaned_predict_dataset], name="addFeatures")

```ipython3

    template = TemplateV2(
        name="AddFeaturesPredict", description="AddFeaturesPredict", project_id=project.id, source="CUSTOM", status="ACTIVE", tags=["Number", "datatype-long"]
    )
    template_transform = TemplateTransformV2(type = "python", params=dict(notebookName="Add-Features-Predict.ipynb"))
    template.base_transforms = [template_transform]
    template.publish("transforms/Add-Features-Predict.ipynb")

```ipython3

    transform = Transform()
    transform.templateId = template.id
    transform.name = "addFeatures"
    transform.variables = {
        "cleanedDataset": "sensor_cleaned_predict",
        "outputDataset": "addFeaturesOutput"
    }

```ipython3

    addFeaturesRecipe.addTransform(transform)
    addFeaturesRecipe.run()

```ipython3

    added_features_dataset = addFeaturesRecipe.getChildrenDatasets()['addFeaturesOutput']
    added_features_dataset.getData(5)

### Create recipe for tsfresh features

Now that we have the necessary historic data available, this recipe
generates 30 day aggregates for each row of data. This generates all the
tsfresh features we need.

Please note that rows where 30 days of historic data is not available
will be dropped at this step.

```ipython3

    sensor_tsfresh_predict = project.addRecipe([added_features_dataset], name='sensor_tsfresh_predict')
    tsfresh = Transform()
    tsfresh.templateId = tsfresh_template.id
    tsfresh.name='tsfresh'
    tsfresh.variables = {
        'inputDataset': "addFeaturesOutput",
        "max_timeshift":30,
        "min_timeshift":30,
        "entity":'Turbine_ID',
        "time":'Timestamp',
        "large":"True",
        "outputDataset": "sensor_ts_fresh_predict"
    }
    sensor_tsfresh_predict.add_transform(tsfresh)
    sensor_tsfresh_predict.run()

```ipython3

    sensor_tsfresh_predict_dataset = sensor_tsfresh_predict.getChildrenDatasets()['sensor_ts_fresh_predict']

```ipython3

    sensor_tsfresh_predict_dataset.getData(5)

### Create a recipe for model prediction

In this step, we pass the feature enriched daily dataset to our
previously stored model. All you need to provide is the model name and
RapidCanvas run the dataset through it.

```ipython3

    prediction_template = TemplateV2(
        name="Model Prediction", description="Pick a model to run the prediction on the dataset",
        source="CUSTOM", status="ACTIVE", tags=["UI", "Aggregation"], project_id=project.id
    )
    prediction_template_transform = TemplateTransformV2(
        type = "python", params=dict(notebookName="Prediction.ipynb"))
    
    prediction_template.base_transforms = [prediction_template_transform]
    prediction_template.publish("transforms/Prediction.ipynb")

```ipython3

    predictor_transform = Transform()
    predictor_transform.templateId = prediction_template.id
    predictor_transform.name='predictor'
    predictor_transform.variables = {
        'inputDataset': "sensor_ts_fresh_predict",
        "modelName":"ml_random_forest_v15"
    }

```ipython3

    predictor = project.addRecipe([sensor_tsfresh_predict_dataset], name='predictor')

```ipython3

    predictor.add_transform(predictor_transform)
    predictor.run()

#### Output dataset and review sample

```ipython3

    predictions = predictor.getChildrenDatasets()['prediction']
    predictions.getData(5)

## Run only predict scenario

If you get new data sets on a daily basis, you can update the
daily_events dataset and just run the predict scenario, which will
ensure that only the predict pipeline runs and the build pipeline is
skipped.

You can review the same by changing the scenario dropdown to
predict_scenario on the canvas view on [RapidCanvas
UI](https://staging.dev.rapidcanvas.net/)

```ipython3

    #project.run_scenario(predict_scenario._id)

## Run only build scenario

If you want to re-build the model for any reason, you can just run the
build scenario which will ensure the build pipeline runs and predict
pipeline is skipped

You can review the same by changing the scenario dropdown to
build_scenario on the canvas view on [RapidCanvas
UI](https://staging.dev.rapidcanvas.net/)

```ipython3

    #project.run_scenario(build_scenario._id)
PreviousSample ProjectsNextRelease Notes