Data Normalization Task Force
- Trans*mation
The ability to change the structure / shape / representation of data
Transcription : Changing the serialization format of data - e.g. between XML and JSON, or XML and SQL
Transrepresentation : Changing the representation of data between two semantically equivalent, but structurally different forms.
Paraphrasing: The source and target forms are expressed using the same language. E.g. : transforming a pre- into a post-coordinated form
Translation : The source and target forms are expressed using different languages. E.g. converting FHIR to RIM, or a source system data schema to FHIR
Mapping / Transformation : Any change in structure that also involves a change in semantics
Needs to quantify the impact/nature of the change in semantics (e.g. "broader than", "approximate match", etc..)
Normalizing Trans*mation : A Trans*mation process that targets a canonical representation of data, e.g. using standard terminologies and schemas
- Clinical Semantization : A Normalizing Trans*mation process that outputs semantic clinical data models (i.e. schemas enhanced with a semantically rich canonical interpretation)
Validation
Conformance checking:
The ability to check the consistency of a piece of data with respect to some criteria, usually defined in terms of integrity constraints.
The outcome is a set, ideally empty, of violated criteria. (or, dually, the set of fulfilled criteria)Syntactic validation : check whether the data is structurally correct, without involving the semantics of the data.
E.g. check that a patient’s date of birth record entry is expressed using a given date format, such as DD-MM-YYYY.Semantic validation : check whether the data is semantically consistent with respect to the reference domain it is modelling.
E.g. disallow dates of birth which would make a patient’s age inconsistent with biological laws.
Verification: check whether a candidate "source" and a candidate "target" actually map to each other with respect to a given Trans*ation
- Trimming:
The ability to remove (“trim”) any data element that does not satisfy a conformance check. This is the maximum subset of the original data structure that satisfies the integrity constraints.
- Trimming:
Enrichment
The ability to create ("materialize") new data that is implied, but not explicitly asserted, as part of a data base. Can be considered a special kind of inference.
Classification : the ability to recognize a piece of data as (representing) an instance of some kind, or as being a member of some class, by virtue of its properties and relations. E.g. recognizing an Observation as a HgbObservation
Qualification: Classification based on the role an entity plays in relationship to another entity (e.g. post-op day 1 fluids, pre-op Hgb)
Contextualization : Classification based on a multi-variate well defined set of roles/relationships
Completion : the ability to infer the values of properties and attributes when not explicitly asserted. E.g. inferring the gender of a ob-gyn patient.
Feature extraction : Using NLP/ML techniques to identify and make known concepts explicit e.g. from a non-structured piece of data.
Correlation / “Linkage” : the ability to infer the existence of a relationship of some kind between (the entities represented by) two data elements, usually by virtue of their properties, or because of the existence of a third entity which acts as a relator.
E.g. inferring the “pre-op-ness” of an Observation with respect to a Surgery.Production/Assertion : the ability to materialize (the representation of) entities that are not explicitly part of the data.
E.g. inferring the existence of a bleed in a post-surgical patient
E.g. estimating a patient’s CHADS2 score at a given point in time, and giving it an interpretation.State Identification: qualitative (exposed, at-risk, elevated-risk, suspected, confirmed, treated, resolved, etc.) vs quantitative (such as tied to a risk and/ or severity score)
Measurements: process and (intermediate) outcomes often based on the aggregation of data
Calculations : Productions based on quantitative mathematical formulas
- Scoring: Calculations whose output can be interpreted as a value on a scale, usually with an associated interpretation (e.g. risk scores)
Auditing: Tracing the source of a piece of data, original or inferred
Provenance : who/when/where the data was gathered, asserted or inferred
Pedigree : correlating inferred data to the evidence on which it is based
- (Security) Labeling/Tagging: Adding annotations (metadata) for the purpose of enforcing consent and other policies
Managing Imperfection : handling any type of (un)certainty, vagueness/fuzziness, imprecision, confidence, strength of evidence, and belief associated to a piece of data.
Capability | Specs / APIs | Knowledge | Implementations | |
---|---|---|---|---|
Trans*mation / Normalization | ||||
Transcription | FHIR De/Serializers | |||
Paraphrasing | OMG MDMI / QVT | FHIR "Mapping Language" + fhir:StructureMap
RDF: ShEx + Graph Production | ||
Translation | OMG MDMI / QVT | |||
Transformation | OMG MDMI / QVT | |||
Validation | ||||
Structural Conformance | $validate | fhir:StructureDefinition | ||
Semantic Conformance | ||||
Trimming | ||||
Enrichment | ||||
Classification | Ontology Languages | |||
Completion | Production (Rule) Languages Predictive Modeling (e.g. PMML) Complex Event Processing (Indirect) : Expression Languages : HL7 CQL (Indirect) : Terms / Ontologies / Valuesets | |||
Linkage | ||||
Inference | ||||
Auditing | fhir:Audit + fhir:Provenance | |||
Uncertainty |