Example Projects¶
Introduction¶
ALFA has a very wide scope in Data Architecture and Engineering.
To help showcase the breadth of coverage, example projects will be added here. Most, if not all of these examples, have been created based on feedback from ALFA users. If you have an interesting data domain or technical challenge around data, please reach out to us. We would be happy to create a sample application.
Projects¶
Name | Description |
---|---|
Feature Tour Sample | Project showcasing ALFA features |
Energy Trading Data - Validate, Transform and Standardize | Model data ingestion to standard model transformation |
Financial Regulations - Importing and extending an external model | Help delivery regulatory reporting, effortlessly by extending Suade Fire reglatory format |
Cloud Integration - Use ALFA in Cloud Messaging, Storage, Processing | Help with your Cloud journey, and on-premise at the same time |
Running Model-driven Data Quality checks from CSV to Spark/Cloud | Run Data Quality checks on data supplied in multiple data formats |
Complex Business Rules - Risk in Banking Book example | Data injestion, bucketing and aggregation; All in ALFA |
Empower Data Domain / Product Owners to define Data Products | Align your Data Models to Data Products to enable Data Mesh Adaption |
Imported Model Examples¶
ALFA supports importing your existing models from XML Schema or JSON Schema.
See the link below for imported schemas from FpML, ISO20022, EMIR, LEI and FIRE.
https://schemarise.github.io/alfa-sample-models-doc
1. Feature Tour Sample¶
The ALFA Sample Project is a feature tour project showing models, transformation, metadata, rules etc. This project is useful for familarizing with ALFA.
The project is available to download from https://github.com/Schemarise/demo-alfa-project (Click the green
Code
button and download zip)The project also contains Java, Python and Spark code that works with code generated off the model.
2. Energy Trading Data - Validate, Transform and Standardize¶
In this project, source data (Exelon https://www.elexon.co.uk/data/) is provided as multiple CSV files in different formats, which may have DQ (data quality) issues. >This data needs to be grouped with another header data to create a final target datasets.
ALFA is used to define source data, data tranformation and canonical data format, alongside rules that validate.
This example defines the source formats, applying any constraint where required in order to ensure poor quality data is identified early. Given ALFAs ‘Active DQ’ >approach, bad data is identified before it is sent for further processing.
The entire pipeline is be executed using ALFA CLI - Command-line utility (see Readme file), and a DQ report is produced for each run.
Click here to download the sample project as a zip to uncompress and open using ALFA VSCode editor.
Taking it further
ALFA assert definitions further help to keep track of SLAs on batches/CSV files, by keeping SLA configuration and using that data alongside DQ rules. SLAs are modelled in ALFA, and SLA data be made available for use during DQ validation.
3. Financial Regulations - Importing and extending an external model¶
The Financial Regulatory (FIRE) Data Standard created by Suade Ltd is regulatory report standard described in JSON Schema. The choice of JSON Schema gives rise to many limitations which are overcome by automatically importing those schemas into ALFA.
The
README.md
file of this project goes through a few quick steps showing now to import the latest FIRE JSON Schema definitions into ALFA, and how ALFA extends those with metadata and validation rules. The project also shows how to execute ALFA DQ on JSON data in the FIRE JSON format.This is an excellent example of the power and flexiblity of ALFA in reusing exiting models. It showcases how quickly ALFA adds significant value to data >quality and model governance.
4. Cloud Integration - Use ALFA in Cloud Messaging, Storage, Processing¶
Video of deploying these projects on GCP.
The aim of this project is to define ALFA models and use them to drive a Cloud based application.
The video above shows how the project is used to deploy ALFA generated models to support GCP based application development.
There are 2 zip files for this project -
- ALFA Models - contains set of ALFA models that get built and published
- ALFA GCP Build Projects - Consumes pre-built ALFA models and generates Google PubSub, >BigQuery, DataProc deployment and application support code.
In order to deploy this project, you need to have a GCP account/project and GCP tools (
gsutil
,bq
etc) installed.As shown in the video, a major benefit created by ALFA is the consistency in the data models and automation across different cloud platforms, formats, >configurations.
5. Running Model-driven Data Quality checks - on CSV, JSON, JSONL, Zipped CSV or JSON, Avro, Spark¶
Data is regularly distributed as one of the above formats and ALFA helps ensure that data conforms the model and DQ rules.
This project demonstrates how ALFA models and DQ rules are used to validate data in a variety of formats.
The validation is not only ‘record level’, it also validated between records in the file - to check if OrderID
is unique across the entire batch.
A full run of DQ check on a 1 million row CSV file is completed in 5 to 10 seconds.
6. Define and Execute Complex Business Rules - Risk in Banking Book example¶
IRRBB is an important measure for financial institutions.
This example demonstrates going beyond data modelling to capture what and how data should be bucketed, by which dimensions/fields, then aggregated to produce Risk Result summaries that used to drive reporting and business decisions.
The project defines the ingestion data models, the canonical risk and aggregate models, mapping from former to latter, finally the bucketing and aggregation rules.
This is a paradigm shift in how business rules are maintained and implemented. Being able to isolate and capture the mapping and bucketing logic independent of target implementation/deployment (e.g. Spark, Pandas, DataBricks etc), means it is iteratively validated, discussed, and updated quickly. Once satisfied, it is generated into target implementation for production application rollout.
Furthermore, updates to the rules are made in the model, re-validated, and released. Compared this to the often used approach of having such rules documented in wiki/confluence/excel which eventually become out-of-date and eventually ignored.
7. Empower Data Domain / Product Owners to define Data Products¶
ALFA’s support for defining Data Products is demonstrated in this example. See dataproduct for how it is supported in ALFA.
Consider 3 independent Data Domains within an organization - Governance Data, Common Data and Trade Data. They can be owned and managed independently by different parts of the organization.
Some models defined within the domain can be for external use - such for publishing data, those are defined as part of the Data Domain’s DataProduct - as published definitions. Data Domains may consume models, e.g. common definitions such as a Currency, Country. What is consumed is also captured as part of a consuming Domain’s DataProduct. Metadata such as SLA’s are be captured as part of the dataproduct definition.
The 3 projects are contained with the following zip files.
Each of these project depend on the previous one. The projects are versioned and published to a (organization’s) repository, for dependent projects to download them as dependencies. Versioning, release frequency and other SLDC aspects can follow the best-practice for the organization. ALFA does not influence of enforce a particular toolset or methodology.
The last of the consuming Data Products, Trade in this case, will have visibility of all Data Products and they are displayed as showned below.
See the Data Products
tab in https://alfa-demo.github.io/dataproduct-trade/edm-index.html