Data Quality¶
DQ can be expressed in 2 forms - at the field data type level (fine grain) and object (coarse grain) level. When combined, every aspect of data validation can be expressed.
Often simply having well-defined constraints is sufficent to enforce good DQ.
Field Constraints¶
Constraints for a datatype are expressed as values within brackets following the data type.
dataTypeName< optionalTypeParameters > ( constraints )Multiple parameters may be specified for constraints, for example a min and max length or size.
Optional or mandatory field¶
A value can be declared as being optional by ending the datatype with a
?
Surname : string Middlename : string?
String size, format¶
Minimum/Maximum size and regular expressions for text can be define against
string
data type to enforce data conforms to the given pattern.Currency : string(3, 3) CalculationBasis :string("/(\\d+|Actual)\\/(\\d+)/u") FileDate : date("MM/dd/yy")
Collection capacity and nesting¶
The minimum/Maximum size of collections (can be unbound) can be enforced and are applicable when nested as shown below.
Batch : list<Payload>(0, 1000) Metadata : map< string, list< string(0,10) > >
Permitted values¶
In-lined
enum
definitions can be used to enforce data conforms to a set of well-defined values. If the enum is reusable, it is better to have it defined as a separate definition.AssetClass: enum< CO, CR, CU, EQ, IR >
Externally validated types¶
Existing external code can reused to validate data expressed in ALFA by defining them as
native
types. The native libraries can implement the required validation logic.typedefs { // https://www.iso20022.org/market-identifier-codes ISOMarketId = native org.iso.types.MktIdISO10383 } record Trade { ... Mkt : ISOMarketId ... }
Expression-based fields¶
Fields can be defined as being composed by values from other fields.
bookingCounterpartyId : string = IdOfOtherCounterparty + ":" + sourceSystem
Range/Size¶
Any datatype that that can have possible range of values or size can be defined. The start and end can be unbounded if required.
Name : string(3, 30) Age : int(0, 130) DateOfBirth : date( "1900-01-01", * ) // Set max list size to prevent memory issues Orders : list< Order >(1, 10000)
Format¶
Text-based fields can define a format to be enforced on the data. This is typically used for date based fields.
MonthByYear : date(*, *, "YYYY-MM" ) DateFromFile : date(*, *, "YYYYMMDD" )
Decimal precision and scale¶
Decimal types can be annotate with the required scale and precision for data accuracy.
Price : decimal(12,8)
Object Validations - Named Rules¶
While ALFA constraints allow field and data type level validations, by using assert, it is possible to express object level validations.
Multiple assert can be defined for an object. Each can raise
multiple errors or warnings. Having smaller assert
definitions
with a clearly named intent is recommended over a single large assert.
The raise
function can be used to record errors or warnings as part of the validation. See assert for details.
Understanding basics of Expressions will be useful before creating assert definitions.
// Outer type definition ... assert <Unique assertion identifier> { < Expressions > } ...
Expressions in the assert
can enforce various types of validation.
Inter-field dependencies¶
A rule based on 2 or more fields
if ( TradeDate > MaturityDate ) raise error( Conformity, "TD > MD is invalid") ...
Multi-optional required field¶
Optionality based on values of multiple fields
if ( isNone(email) && isNone(phoneNo) ) raise error( Completeness, "Email or phone reqd") ...
Data integrity (key lookup)¶
External lookup of values. Generated code will use a datastore interface that has implementations to lookup data in backend data stores.
unknownCcys = filter(currency, r => !keyExists(Ccy, new CcyKey(r))) if ( ! keyExists( Trade, new TradeKey( id ) ) ) raise warning(Integrity, "Trade ${id} lookup failed") ...
Complex expressions¶
Rules based on multiple conditions and values
financials = { "C", "F", "I", "L", "O", "R", "U" } if (NatureOfTheBookingCty == NatureOfBookingCptyType.F && len(filter(corpSector, e => !contains(financials, e))) > 0) ...
Aggregations¶
Aggregate values and use the result for validation
regionProfits : map< string, double > = aggregate(rows, r=>r.Region, 0, (acc,e) => acc+e.TotalProfit) ...
Data patterns, analytics (percentiles, standard deviation)¶
Use builtin functions to perform common analytics calculations
let profits = values(regionProfits) let p10 = percentile(profits, 10) let outlierProfits = filter( regionProfits, (k, v) => v <= p10 ) ...
External library/service integration¶
Integrate with services to perform validation.
service ObjectSizeChecker { objectSizeInMB( m : $record) : int } ... let mb = ObjectSizeChecker::objectSizeInMB( this ) if (mb > 100) raise error(Consistency, "Payload ${mb} exceed size limit 100MB")
Executing DQ Rules¶
Rules can be applied against data using 2 methods:
alfa
command line tool with-dq
parameter and arguments.Using the command line, ALFA supports validating data in CSV, JSON, JSONL (JSON objects as lines in file), Avro, XML, Spark DataFrames.
See the DQ Project in the examples section to see these in action.
Use generated code (with all rules generated into code) to validate.
When using the API, there are several options to validate streams, batches, one at a time etc. Errors are captured as
ValidationAlert
objects shown below, and can be serialized to JSON and other formats.Please contact us for details of API level integration.
DQ Run Results¶
Results of the DQ run are output as ALFA defined objects. From the command line execution, these are written as a
validation-results-<timestamp>.json
and validation-results-<timestamp>.html
file.
Validation Results, natually, are fully defined in ALFA. The JSON file will conform to the definitions below. The HTML file is a tablular formatted file, which can be use for basic reporting.
entity ValidationReport key( Id : uuid ) {
Timestamp : datetime
SourceSystem : string?
SourceSubsystem : string?
SourceFeed : string?
SourceSubfeed : string?
SourceInfo : string?
DataFormat : string?
DataFormatInfo : string?
TotalRecords : long
TotalErrors : long
TotalWarnings : long
Alerts : list< ValidationAlert >
}
record ValidationAlert {
// Error or Warning
Severity : SeverityType
Timestamp : datetime
Message : string
// ALFA type being processed
TypeName : string?
// An identification of the validated object
Identification : string?
// Assert that caused alert
AssertName : string?
// Path to field
FieldName : string?
ViolatedConstraint : ConstraintType?
DataQualityCategory : DataQualityType?
// Info regarding the source of the data, such as line number
SourceInfo : string?
// Details such a stack trace
ExceptionDetails : string?
}
enum ConstraintType {
MandatoryFieldNotSet OutsidePermittedRange InvalidPattern InvalidConstant Duplicate
InvalidTypeForField UnknownField InvalidDecimalScale InvalidDecimalPrecision DataFormatError
UserDefinedAssert Unknown
}
enum SeverityType {
Error Warning
}
enum DataQualityType {
Accuracy Completeness Conformity Consistency Coverage Integrity
Provenance Timeliness Uniqueness Validity Unclassified
}