How OVO delivers Data Observability at scale with Monte Carlo and GitHub Actions
Introduction
OVO has long had a strategy of delegating data ownership to the teams that know the data best. While not quite a Data Mesh - yet - it has enabled Domains to move fast by reducing dependencies on central teams and encouraging squads of engineers to make their own decisions.
We chose Monte Carlo to deliver Data Observability across all of OVO’s Domains because of its high degree of automation, making it easier for us to set benchmarks, measure our performance, and ultimately drive data quality improvements across all our data.
Monte Carlo at OVO
Monte Carlo provides three automated monitors out of the box - Freshness, Volume and Schema. Monte Carlo makes it easy to create additional monitors through the platform’s user interface that provide deeper insights and data quality monitoring at the field level, similar to ‘assertion’ based monitoring that’s available in other well-known tools.
Adoption of Monte Carlo at OVO has been quick; the Incident Management and Root Cause Analysis features provide substantial benefit to the teams that have already onboarded, and dashboards give us greater visibility into the health of our data and areas for improvement.
We didn’t just want more visibility into our data quality - we wanted to embed data observability directly into our data product development lifecycle.
To do that, we turned to Monte Carlo’s ‘Monitors as Code’ API.
Leveraging Monitors as Code to drive adoption
We wanted to make it super-easy for Data Engineers to adopt Monte Carlo as the preferred approach for Data Quality and Observability - having everything in one place delivers cumulative benefits to OVO by allowing us to keep an eye on the state of our data, identify key datasets and set targets for improving our overall data quality.
Monitors as Code is a YAML-based monitors configuration that helps teams deploy monitors as part of their CI/CD process. At OVO we’ve built some Python code around this YAML configuration to give us even more flexibility.
A ‘monitors as code’ approach was critical to our team for several reasons:
- Testing and validation - These give us trust in the consistency of our config - Clicking buttons in a web UI is always going to result in some degree of inconsistency. If there’s a setting that should be present for every monitor, we can introduce unit test(s) to verify that this is indeed the case.
- Version control - Monte Carlo’s web console gives us some history, but having history in git makes it much easier to understand what changed and when in a format that’s consistent with how our engineers are used to working. It also means it’s very easy to revert to an older version if we ever need to.
- Avoid accidental deletion - If monitors are accidentally deleted or modified, it’s very easy to re-run the CI/CD pipeline and redeploy. Monte Carlo even blocks changes in the UI for monitors in code, forcing these changes to go through the repo.
- Change review and collaboration - Sometimes when a monitor detects a false positive, it requires the logic to be changed. This logic is important to us! Having a pull request gives our colleagues a chance to review and approve these changes before we push them live.
- Consistency - We use Terraform to define our cloud infrastructure and other monitoring, so using YAML to define monitors stays close to this approach.
- Dry-runs - with the dry run flag enabled we’re able to see what changes we’re about to make before we make them, to check it’s not doing something unexpected!
GitHub Actions is our preferred continuous integration and continuous delivery (CI/CD) platform and it made sense to look to build an Action that would make it easy to develop and deploy our Monitors as quickly and easily as possible.
How we did it
YAML Monitors as Code
Monte Carlo have developed YAML-based monitor configurations to enable teams to deploy this as part of their CI/CD process. At OVO we have set up a central GitHub repository with a folder for each team where they are able to specify their monitors in code as YAML files, which then get applied as part of a GitHub Actions workflow.
These YAMLs can also be seen in the Monte Carlo console so it’s possible to copy an existing monitor and add it to our repo, or compare one we’ve specified in code with our console-created ones.
Here’s an example.
---
montecarlo:
custom_sql:
- name: My Monitor [Monitors-As-Code]
description: Something looks wrong with the data
labels:
- My Team
notify_rule_run_failure: true
event_rollup_until_changed: true
severity: SEV-2
sql: "SELECT..."
comparisons:
- type: threshold
operator: GT
threshold_value: 0.0
schedule:
type: fixed
start_time: '2023-07-05T11:00:17.400000+00:00'
timezone: UTC
interval_minutes: 1440
Python Generated YAMLs
Defining monitors as YAMLs in code is a great step in the right direction, but we decided to go one step further and allow teams to configure their monitors in Python, which then generates YAML monitors. We still allow teams to write YAML files, but the Python code offers some additional benefits.
We created classes for each monitor type that exists in Monte Carlo. This allows us to be consistent across monitors and make it easier to update the required fields without having to refer to the Monte Carlo docs each time. These are likely to be less error prone too as your IDE will warn you if you’ve misspelt a field name, are using the wrong type, etc. whereas this could easily happen in a YAML config.
We have tests on our classes and supporting functions, and with Python it’s easy for teams creating monitors to build tests around those too.
We could also add additional validation to ensure we conform to the monitor specifications. We have added some defaults to make it easier for teams to set up a monitor, which they can then tweak over time as they learn more about how it behaves. As we learn more about the monitors we need to create, we should be able to identify patterns and opportunities to reuse and simplify the creation of new monitors.
As a general principle, with Python we have increased flexibility. This means we can, for example, create helpful reusable functions, or use loops to be able to create many similar monitors that wouldn’t be doable using YAML alone.
The below code snippet shows the base monitor, which gives us the main fields used across all monitors, and some functionality to convert to YAML. You will see in a few places we’ve tried to introduce some code to make it easier for users, such as passing a number in severity and then casting this to a string in the format required for Monte Carlo, or converting the name of the monitor to the filename so we have consistency in the filenames teams create. In each filename we have also added a suffix [Monitors-As-Code] to allow us to see in the UI which monitors were created in code.
class MCMonitor(ABC):
"""
Defines the outline of monte carlo monitor
"""
name: str
alert_text: str
labels: List[str]
type: str
comparison_flag: bool
comparison_threshold: float = 0.0
comparison_operator: str = "GT"
comparison_type: str = "threshold"
severity: int = 4
additional_fields: dict = field(default_factory=dict)
interval_minutes: int = 1440
event_rollup_until_changed: bool = True
schedule_type: str = "fixed"
notify_rule_run_failure: bool = True
@property
def specific_fields(self) -> dict
"""
Dict of fields specific to the monitor type
"""
pass
@property
def comparisons(self) -> dict:
return {
"comparisons": [
{
"type": self.comparison_type,
"operator": self.comparison_operator,
"threshold_value": self.comparison_threshold,
}
]
}
@property
def schedule(self) -> dict:
schedule = {
"schedule": {
"type": self.schedule_type,
"start_time": "2023-07-05T11:00:17.400000+00:00",
"timezone": "UTC",
}
}
if self.schedule_type in ["fixed", "loose"]:
schedule["schedule"]["interval_minutes"] = self.interval_minutes
return schedule
@property
def yaml(self) -> str:
monitor_dict = {
"name": f"{self.name} [Monitors-As-Code]",
"description": self.name,
"notes": self.alert_text,
"labels": self.labels,
"notify_rule_run_failure": self.notify_rule_run_failure,
"event_rollup_until_changed": self.event_rollup_until_changed,
"severity": f"SEV-{self.severity}",
}
comparisons_block = self.comparisons if self.comparison_flag is not None else {}
monitor = {
**monitor_dict,
**self.specific_fields,
**comparisons_block,
**self.schedule,
**self.additional_fields,
}
complete_monitor = {"montecarlo": {f"{self.type}": [monitor]}}
return to_yaml(complete_monitor)
@property
def filename(self) -> str:
return f"{self.name.replace(' ', '_')}.yml"
Note the above class uses a function called to_yaml to convert our dict to a YAML, which just looks like this:
import io
from typing import Any
from ruamel.yaml import YAML
ruamel_yaml = YAML()
ruamel_yaml.explicit_start = True
def to_yaml(data: Any) -> str:
stream = io.StringIO()
ruamel_yaml.dump(data, stream)
return stream.getvalue()
We have created classes for each monitor type which inherit from the base class above. As an example, we have created a class for the custom SQL rules. Here, we set the default monitor type as custom_sql and add the fields required for this monitor. Namely, a sql field to pass in our SQL query and a sampling_sql which is an optional query which can be used if triggered to understand what’s going on.
@dataclass(frozen=True)
class MCSqlRule(MCMonitor):
"""
Base class for configuring a Monte Carlo custom SQL Monitor
"""
name: str
alert_text: str
labels: List[str]
sql: str
interval_minutes: int
severity: str
schedule_type: str
comparison_threshold: float
comparison_operator: str
comparison_type: str
event_rollup_until_changed: bool
investigation_query: str = ""
additional_fields: dict = field(default_factory=dict)
type: string = "custom_sql"
comparison_flag = True
@property
def specific_fields(self) -> dict:
return {
"sql": self.sql,
"sampling_sql": self.investigation_query,
}
One specific benefit we get for the SQL monitor is being able to pass in a SQL file, and therefore have SQLFluff formatting as well as Jinja templating on our SQL query. We can therefore reuse the same query in a number of monitors, as opposed to having to copy and paste the query into multiple YAML files, which increases the risk of mistakes occurring or them getting out of sync. By using SQL formatting we can also make it easier to review the query, as well as spot any validation errors, before we try to create the monitor.
The following is an example of a class that might be created by a team, inheriting from the MCSqlRule class specified above. This uses another function we’ve defined that allows us to pass in a SQL file, and replace a table_name parameter with the table name. In this example we have a separate folder of SQL files, which helps us organise our code.
SQL_PATH = "namespaces/my_namespace/sql/"
class MySQLMonitor(MCSqlRule):
name = "My Monitor"
alert_text = "Something looks wrong with the data"
labels = [Audience.MY_TEAM.value]
sql = apply_sql_template(
SQL_PATH,
"my_sql_file.sql",
{"table_name": "my_table"},
)
def __init__(self) -> None:
super().__init__(
name=self.name,
alert_text=self.alert_text,
severity=self.severity,
labels=self.labels,
sql=self.sql,
)
The definition of this function is as follows. This is a good example of code that can now be shared and used by all teams to create SQL template files to enable reusable SQL queries, using Jinja templating.
from jinja2 import Environment, FileSystemLoader
def apply_sql_template(template_file_path: str, template_file_name: str, params: dict):
"""
Return SQL from template
:param template_file_path: folder containing sql templates
:param template_file_name: name of sql file to use
:param params: dict of key pairs to pass into template {{ key }}
"""
environment = Environment(loader=FileSystemLoader(Path(template_file_path)))
template = environment.get_template(template_file_name)
sql = template.render(**params)
return sql
Configuring Namespaces
To enable Monte Carlo monitors as code to work, we also need a montecarlo.yml file for each namespace, which specifies the default data source and namespace. Namespaces are a concept in Monte Carlo that allows us to group monitors. At OVO we generally have a namespace per team.
We have a Python function for generating the montecarlo.yml file, so that when a new team is added they can use this function to generate their file with the specified namespace in a new folder with the same name. You’ll see later on that in the GitHub action we loop through the namespace folders and apply the monitor configurations within that folder, using the namespace and resource defined.
---
version: 1
default_resource: data_source_id
namespace: my_namespace
There is also a function we use across all namespaces which accepts inputs of a list of monitor classes to generate, namespace and the top level config class. This is used to create the required YAML files. Within this function we provide a prefix for the Python generated YAML files, to differentiate between these and YAMLs created directly.
Each team will have something that looks like the following code, where they specify their namespace name, default resource and then list out their monitor classes. The send_all_objects_to_yaml_files function takes care of the rest, generating files and file names consistently across namespaces.
def my_namespace():
namespace = "my_namespace"
default_resource = "resource_id"
monitors = [
group_of_monitors.MyMonitorOne(),
group_of_monitors.MyMonitorTwo(),
]
top_level_config = MCTopLevelConfig(default_resource=default_resource, namespace=namespace)
send_all_objects_to_yaml_files(
namespace=namespace,
monitors_to_generate=monitors,
top_level_config=top_level_config,
)
The Workflow
For teams using this repo we provide a detailed README on how to set up their environment. This is critical because in its current form when you create a monitor in Python code, there is a pre-commit hook that runs the Python to generate the YAML monitors. This step first removes all Python generated YAML files (uses the prefix mentioned above to ensure it doesn’t remove monitors created in YAML directly), and then generates from the updated Python code. This ensures that any deleted or updated monitors are also overwritten.
Below is a section of our pre-commit hooks config, which runs the monitor generation followed by a compile step to validate our YAML monitors, to identify issues with our monitors before we push to GitHub.
...
- repo: local
hooks:
- id: montecarlo-monitors
name: montecarlo_monitors_generate
entry: bash -c 'pushd src && python3 main.py && popd'
language: system
verbose: false
- repo: local
hooks:
- id: montecarlo-compile
name: montecarlo_monitors_compile
entry: |
bash -c '
for dir in monitors/* ; do
pushd $dir && montecarlo monitors compile && popd
done'
language: system
verbose: true
...
This means that at the point at which you push to GitHub the YAML configuration and Python configuration are in sync. We use a GitHub Actions workflow to deploy our monitors, this workflow currently works by applying the YAML and doesn’t need to run the Python code.
There are additional pre-commit hooks that run Python and SQL formatting and run the Python tests. These are also run as part of our GitHub Actions workflow to ensure these have been installed and run successfully on the developer’s machine.
Once pushed to a branch in our GitHub repo, the developer can open a Pull Request and ask someone in their team to review the changes. We use a CODEOWNERS file and branch protection rules to ensure that all changes are reviewed by someone within the relevant team.
GitHub Actions - Dry Run
We have published the Monte Carlo monitors deploy GitHub Action that we’re using in our workflow to the GitHub Marketplace (here). This configures Monte Carlo and then applies the monitors. The action has an optional input to enable a dry run which means it will just log what actions this PR will do. It loops through the namespaces in the monitors directory to apply the monitors.
In our workflow we have two separate steps for Monte Carlo deploy - the first does a dry run to check what would happen if our PR was merged, and we enable this to run on our feature branch pull requests. The next step only runs on the main branch and is not done in dry run mode which means the monitors are actually deployed. The API ID and API token are stored as GitHub Actions secrets in our repo.
The below is an example output when the workflow runs on our feature branch. In this example we have a dry run log output which tells us a new monitor with the name My Monitor will be created. As mentioned above, if this is what the team wants, they can approve and merge into main. The same workflow will run on the main branch but this time won’t apply the dry run flag and so will create the monitor.
----------------------------------------------------------------
Computing changes for namespace: my_namespace
----------------------------------------------------------------
Modifications:
- [DRY RUN] ResourceModificationType.CREATE
SQL Rule: name=My Monitor [Monitors-As-Code], sql=SELECT ...
Local Development
One of the challenges with this project is enabling developers to be able to build and iterate on their monitors until they’re happy, so we’ve tried to set up a local development workflow, but there’s lots more to be done in this area as adoption increases.
We have some additional setup instructions that also allow users to write their Python, generate the YAML and then copy this YAML into a new file under their own personal namespace. This means monitors can be applied in their own namespace and torn down once they’ve checked the configuration. This is done by asking them to copy an example .envrc.template file when they first clone the repo, and update the following environment variables with their own values.
It’s worth noting that Monte Carlo will always use the namespace in the montecarlo.yml file over one passed in as a flag. We therefore have a local template folder with montecarlo.yml without a specified namespace, and a file to copy a monitor to. We use a make command to apply the monitor from their local environment and create it in their own namespace, as well as the command to destroy all monitors in the namespace. There is a risk here that giving users an API key would potentially mean they could wipe out a whole namespace, but with the monitors in code it’s easy to recreate them!
export MCD_DEFAULT_API_ID=xxx
export MCD_DEFAULT_API_TOKEN=xxxx
export MC_MONITORS_NAMESPACE=mylocalnamespace
GitHub Repo Structure
If you’re interested in how this looks, the below shows an outline of the repository structure. At the top level we have a directory called monitors which contains the YAML namespaces and is separated by namespace folders.
There is also a folder called src which is where the Python code is contained. This is separated into a base directory which contains the monitor classes, as well as helpers which is where we can add additional useful functions such as the sql templating function we talked about above. Additionally in src we also have a directory per namespace where teams configure their monitors. You may notice as well that there is a main.py file in each namespace, which is where the boilerplate code exists for generating the top level config and converting the monitors to YAML.
The src/main.py is the entrypoint for generating the YAML monitors and runs the functions created in the namespace main files.
There is also a folder for tests, to ensure our classes are generating the expected YAML files.
.
└── monte-carlo/
├── .github
├── monitors/
│ └── namespace_1/
│ ├── monte-carlo.yml
│ └── monitor_1.yml
└── src/
├── base/
│ ├── monitors/
│ │ ├── base_monitor.py
│ │ └── sql_rule.py
│ └── helpers/
│ ├── sql_generate.py
│ └── yaml.py
├── namespaces/
│ └── namespace_1/
│ ├── monitor_1.py
│ └── main.py
├── tests/
└── main.py
Where can I find this Action?
We think the GitHub Action might be useful for other customers of Monte Carlo and have made the code available in the GitHub Action Marketplace here - GitHub Marketplace.
This will allow other Monte Carlo users to set up a repo in GitHub with monitors specified in YAML files and execute a dry run on their branches, and apply on their main branch. Please do feel free to raise any issues on GitHub or contribute to the code.
The future
We have started working on this recently so have a lot to learn about how this will work at OVO once we see adoption increase. However, there are things we can already anticipate and start thinking about.
Handling changes to monitor classes
- If we decide to make changes to classes, such as removing a default value, this could have widespread impact when we have hundreds or even thousands of monitors.
Non “developer” users
- We might have users that are less familiar with Python or GitHub so will explore how to make this easy to adopt, supporting these users as well.
Relying on developers’ local environment
- Because of the pre-commit YAML generation teams need to have their local environment setup properly.
- We will likely want to explore options where the CI/CD pipeline runs the Python code and generates the YAML, either writing back to the repo or writing all YAMLs to an external store such as S3 and applying from there.
Scalability
- Every time we make a change to the code and re-generate the YAML files it removes all Python generated files and re-creates them. This is fine for now with a small number of monitors, but we may want to improve this in future by only updating the ones that have changed.
Developer feedback loop
- There is more work to be done to make the user experience for Data Engineers better. Being able to test new or updated monitors and get feedback will be crucial to the success of this project.
- Another improvement we can make is to output the dry run to a PR comment. This works well in other repos where we create a Terraform plan comment so the developer doesn’t have to leave the PR to see what will be applied when merged.
Thank you
We hope you enjoyed this write up on our approach to increasing the adoption of Monte Carlo’s Data Observability platform at OVO.
We’d like to thank Anselm Foo (Data Engineer) for taking an initial look into Monte Carlo monitors as code and providing feedback throughout the project.
We’d also like to thank Andy Herd (Data Engineer) for setting up the repository and initial code structure and working with Chloe and others throughout to develop and test the Python code.
Another thank you to Oscar Bocking (Data Engineer) who contributed to the project and was a key code reviewer.
Finally, we’d like to thank all those involved in setting up Monte Carlo at OVO and helping to build, review and discuss Monte Carlo monitors as code and we look forward to even more contributions going forward!
Authors
Chloe Connor - Platform Site Reliability Engineer
Max Illis - Senior Principal Product Manager