Data Quality

DQ can be expressed in 2 forms - at the field data type level (fine grain) and object (coarse grain) level. When combined, every aspect of data validation can be expressed.

Often simply having well-defined constraints is sufficent to enforce good DQ.

Field Constraints

Constraints for a datatype are expressed as values within brackets following the data type.

dataTypeName< optionalTypeParameters > ( constraints )

Multiple parameters may be specified for constraints, for example a min and max length or size.

Optional or mandatory field

A value can be declared as being optional by ending the datatype with a ?

Surname : string
Middlename : string?

String size, format

Minimum/Maximum size and regular expressions for text can be define against string data type to enforce data conforms to the given pattern.

Currency : string(3, 3)
CalculationBasis :string("/(\\d+|Actual)\\/(\\d+)/u")
FileDate : date("MM/dd/yy")

Collection capacity and nesting

The minimum/Maximum size of collections (can be unbound) can be enforced and are applicable when nested as shown below.

Batch : list<Payload>(0, 1000)
Metadata : map< string, list< string(0,10) > >

Permitted values

In-lined enum definitions can be used to enforce data conforms to a set of well-defined values. If the enum is reusable, it is better to have it defined as a separate definition.

AssetClass: enum< CO, CR, CU, EQ, IR >

Externally validated types

Existing external code can reused to validate data expressed in ALFA by defining them as native types. The native libraries can implement the required validation logic.

typedefs {
    // https://www.iso20022.org/market-identifier-codes
    ISOMarketId = native org.iso.types.MktIdISO10383
}

record Trade {
   ...
   Mkt : ISOMarketId
   ...
}

Expression-based fields

Fields can be defined as being composed by values from other fields.

bookingCounterpartyId : string = IdOfOtherCounterparty + ":" + sourceSystem

Range/Size

Any datatype that that can have possible range of values or size can be defined. The start and end can be unbounded if required.

Name : string(3, 30)
Age : int(0, 130)
DateOfBirth : date( "1900-01-01", * )
// Set max list size to prevent memory issues
Orders : list< Order >(1, 10000)

Format

Text-based fields can define a format to be enforced on the data. This is typically used for date based fields.

MonthByYear : date(*, *, "YYYY-MM" )
DateFromFile : date(*, *, "YYYYMMDD" )

Decimal precision and scale

Decimal types can be annotate with the required scale and precision for data accuracy.

Price : decimal(12,8)

Object Validations - Named Rules

While ALFA constraints allow field and data type level validations, by using assert, it is possible to express object level validations.

Multiple assert can be defined for an object. Each can raise multiple errors or warnings. Having smaller assert definitions with a clearly named intent is recommended over a single large assert.

The raise function can be used to record errors or warnings as part of the validation. See assert for details.

Understanding basics of Expressions will be useful before creating assert definitions.

// Outer type definition
...
   assert <Unique assertion identifier> {
       < Expressions >
   }
...

Expressions in the assert can enforce various types of validation.

Inter-field dependencies

A rule based on 2 or more fields

if ( TradeDate > MaturityDate )
   raise error( Conformity, "TD > MD is invalid")
...

Multi-optional required field

Optionality based on values of multiple fields

if ( isNone(email) && isNone(phoneNo) )
   raise error( Completeness, "Email or phone reqd")
...

Data integrity (key lookup)

External lookup of values. Generated code will use a datastore interface that has implementations to lookup data in backend data stores.

unknownCcys = filter(currency, r => !keyExists(Ccy, new CcyKey(r)))
if ( ! keyExists( Trade, new TradeKey( id ) ) )
   raise warning(Integrity, "Trade ${id} lookup failed")
...

Complex expressions

Rules based on multiple conditions and values

financials = { "C", "F", "I", "L", "O", "R", "U" }
if (NatureOfTheBookingCty == NatureOfBookingCptyType.F &&
   len(filter(corpSector, e => !contains(financials, e))) > 0)
  ...

Aggregations

Aggregate values and use the result for validation

regionProfits : map< string, double > = aggregate(rows,
                 r=>r.Region, 0, (acc,e) => acc+e.TotalProfit)
...

Data patterns, analytics (percentiles, standard deviation)

Use builtin functions to perform common analytics calculations

let profits = values(regionProfits)
let p10 = percentile(profits, 10)
let outlierProfits = filter( regionProfits, (k, v) => v <= p10 )
...

External library/service integration

Integrate with services to perform validation.

service ObjectSizeChecker {
   objectSizeInMB( m : $record) : int
}
...

let mb = ObjectSizeChecker::objectSizeInMB( this )
if (mb > 100)
    raise error(Consistency, "Payload ${mb} exceed size limit 100MB")

Executing DQ Rules

Rules can be applied against data using 2 methods:

  • alfa command line tool with -dq parameter and arguments.

    Using the command line, ALFA supports validating data in CSV, JSON, JSONL (JSON objects as lines in file), Avro, XML, Spark DataFrames.

    See the DQ Project in the examples section to see these in action.

  • Use generated code (with all rules generated into code) to validate.

    When using the API, there are several options to validate streams, batches, one at a time etc. Errors are captured as ValidationAlert objects shown below, and can be serialized to JSON and other formats.

    Please contact us for details of API level integration.

DQ Run Results

Results of the DQ run are output as ALFA defined objects. From the command line execution, these are written as a validation-results-<timestamp>.json and validation-results-<timestamp>.html file.

Validation Results, natually, are fully defined in ALFA. The JSON file will conform to the definitions below. The HTML file is a tablular formatted file, which can be use for basic reporting.

entity ValidationReport key( Id : uuid ) {
    Timestamp : datetime

    SourceSystem : string?
    SourceSubsystem : string?
    SourceFeed : string?
    SourceSubfeed : string?
    SourceInfo : string?

    DataFormat : string?
    DataFormatInfo : string?

    TotalRecords : long
    TotalErrors : long
    TotalWarnings : long

    Alerts : list< ValidationAlert >
}

record ValidationAlert {
    // Error or Warning
    Severity : SeverityType

    Timestamp : datetime
    Message : string

    // ALFA type being processed
    TypeName : string?

    // An identification of the validated object
    Identification : string?

    // Assert that caused alert
    AssertName : string?

    // Path to field
    FieldName : string?

    ViolatedConstraint : ConstraintType?
    DataQualityCategory : DataQualityType?

    // Info regarding the source of the data, such as line number
    SourceInfo : string?

    // Details such a stack trace
    ExceptionDetails : string?
}

enum ConstraintType {
    MandatoryFieldNotSet OutsidePermittedRange InvalidPattern InvalidConstant Duplicate
    InvalidTypeForField UnknownField InvalidDecimalScale InvalidDecimalPrecision DataFormatError
    UserDefinedAssert Unknown
}

enum SeverityType {
    Error Warning
}

enum DataQualityType {
    Accuracy Completeness Conformity Consistency Coverage Integrity
    Provenance Timeliness Uniqueness Validity Unclassified
}