-
Notifications
You must be signed in to change notification settings - Fork 48
Hadrian Data Format
PFA by itself does not define a data representation, only a type system (Avro's type system). Hadrian, as a software library rather than an application, does not require data to be serialized in a particular format. Three input formats are defined so far (Avro, JSON, and CSV), but applications using the library are encouraged to use their own input formats: anything that is appropriate for the workflow that Hadrian is to be embedded in.
However, data has to be represented in some form for processing by PFA functions. This is the data format used internally by Hadrian.
| Avro type | Hadrian's internal format |
|---|---|
| null |
null Java Object |
| boolean | java.lang.Boolean |
| int | java.lang.Integer |
| long | java.lang.Long |
| float | java.lang.Float |
| double | java.lang.Double |
| string | Java String
|
| bytes | Java array of bytes |
| array | com.opendatagroup.hadrian.data.PFAArray |
| map | com.opendatagroup.hadrian.data.PFAMap |
| record | subclass of com.opendatagroup.hadrian.data.PFARecord
|
| fixed | subclass of com.opendatagroup.hadrian.data.PFAFixed
|
| enum | subclass of com.opendatagroup.hadrian.data.PFAEnumSymbols
|
| union | Java Object |
Input to a scoring engine's action method must be of this form, and output from that method will be of this form. This is not the format that the Avro library produces when you deserialize an Avro file (Hadrian uses a custom org.apache.avro.specific.SpecificData called com.opendatagroup.hadrian.data.PFASpecificData). However, it is a format that can be passed directly to the Avro library to serialize an Avro file.
Three of the above, PFARecord, PFAFixed, and PFAEnumSymbols are compiled specifically for each PFA engine class. (If you run PFAEngine.fromJson with multiplicity > 1, all of the scoring engines returned share the same class; if you run PFAEngine.fromJson multiple times, the scoring engines belong to different classes.) You must use the right subclass. Since these subclasses are compiled at runtime, they must be accessed through a special ClassLoader.
Here is an example of creating a PFARecord for a given engine (of class com.opendatagroup.hadrian.jvmcompiler.PFAEngine) and a recordType (of class com.opendatagroup.hadrian.datatype.AvroRecord). Assume that the fields of this record have already been converted into the appropriate types and are stored, in field order, in an array of Objects called fieldData.
val recordTypeName = recordType.fullName
val classLoader = engine.classLoader
val subclass = classLoader.loadClass(recordTypeName)
val constructor = subclass.getConstructor(classOf[Array[AnyRef]])
constructor.newInstance(fieldData)Only the last line needs to be executed at runtime; the rest can be saved from an initialization phase. In fact, calling constructor.setAccessible(true) can speed up constructor.newInstance(fieldData) by skipping access checks at runtime.
Here is an example of creating a PFAFixed from a given engine (of class PFAEngine) and a fixedType (of class com.opendatagroup.hadrian.datatype.AvroFixed). Assume that the data is stored as an array of byte primitives called bytesData.
val fixedTypeName = fixedType.fullName
val classLoader = engine.classLoader
val subclass = classLoader.loadClass(fixedTypeName)
val constructor = subclass.getConstructor(classOf[Array[Byte]])
constructor.newInstance(bytesData)Here is an example of creating a PFAEnumSymbol from a given engine (of class PFAEngine) and an enumType (of class com.opendatagroup.hadrian.datatype.AvroEnum). Assume that the data is given as a string called symbolName.
val enumTypeName = enumType.fullName
val classLoader = engine.classLoader
val subclass = classLoader.loadClass(enumTypeName)
val constructor = subclass.getConstructor(classOf[org.apache.avro.Schema], classOf[String])
constructor.newInstance(enumType.schema, symbolName)Return to the Hadrian wiki table of contents.
Licensed under the Hadrian Personal Use and Evaluation License (PUEL).