Generate SDMX-ML from Python objects
Contents
Generate SDMX-ML from Python objects#
sdmx
was developed to retrieve SDMX-formatted data from web services and convert it to pandas
objects.
The opposite—creating SDMX messages from Python or pandas objects—is also possible. (Helper code to simplify this process is still a planned future development.) This page gives a minimal demonstration.
Create some example data#
This data has:
8 observations.
2 dimensions with the identifiers (IDs) “TIME_DETAIL” and “REF_AREA”.
1 attribute with the ID “UNIT_MEASURE”.
The measure—i.e. the quantity for which each observation provides a single value—has the generic ID “OBS_VALUE”.
We store the data in a pandas.DataFrame
, with each row corresponding to one observation.
For example, the first observation has:
Dimension values TIME_DETAIL=2016, REF_AREA=1. These are the key for the observation.
The value “PT” for the UNIT_MEASURE attribute.
A value of 50 for the primary measure.
In [1]: import pandas as pd
# List of dimensions
In [2]: D = ["TIME_DETAIL", "REF_AREA"]
# List of measures
In [3]: M = ["OBS_VALUE"]
# List of attributes
In [4]: A = ["UNIT_MEASURE"]
# Keys, attributes, and values together in a single data frame
In [5]: data = pd.DataFrame(
...: columns=D + M + A,
...: data=[
...: [2016, 1, 50, "PT"],
...: [2017, 1, 60, "PT"],
...: [2016, 2, 70, "PT"],
...: [2017, 2, 80, "PT"],
...: [2016, 1, 5000, "USD"],
...: [2017, 1, 6000, "USD"],
...: [2016, 2, 7000, "USD"],
...: [2017, 2, 8000, "USD"],
...: ],
...: )
...:
In [6]: data
Out[6]:
TIME_DETAIL REF_AREA OBS_VALUE UNIT_MEASURE
0 2016 1 50 PT
1 2017 1 60 PT
2 2016 2 70 PT
3 2017 2 80 PT
4 2016 1 5000 USD
5 2017 1 6000 USD
6 2016 2 7000 USD
7 2017 2 8000 USD
Create a data structure definition (DSD)#
The module sdmx.model
contains the classes needed to describe the structure of this data set.
A DSD collects objects that describe the structure of the data.
There are different classes to describe dimensions, measures, and attributes.
In [7]: import sdmx
In [8]: from sdmx.model import (
...: DataStructureDefinition,
...: Dimension,
...: PrimaryMeasure,
...: DataAttribute,
...: )
...:
# Create an empty DSD
In [9]: dsd = DataStructureDefinition(id="CUSTOM_DSD")
# Add 1 Dimension object to the DSD for each dimension of the data.
# Dimensions must have a explicit order for make_key(), below.
In [10]: for order, id in enumerate(D):
....: dsd.dimensions.append(Dimension(id=id, order=order))
....:
# `A` only has 1 element, but this code will work with 2 or more.
In [11]: for id in A:
....: dsd.attributes.append(DataAttribute(id=id))
....:
In [12]: for id in M:
....: dsd.measures.append(PrimaryMeasure(id=id))
....:
# No longer needed
In [13]: del D, M, A
Note
This is a minimal example, so we don’t further describe the structure, even though sdmx.model
offers the full SDMX information model.
We could, for instance, use a Codelist
to add internationalized names, annotations, and other information to the codes “PT” and “USD” used for the “UNIT_MEASURE” attribute, and thus restrict the values of this attribute to the codes in that list.
Or, we could add Concept
objects to give a full description of what is meant by “REF_AREA”—regardless of whether it appears as a dimension or an attribute.
Populate a data set with observations#
The next step is to convert the data frame to Observation
objects.
We define a new function, make_obs
, that operates on one row of the data frame.
The function generates a single Observation object by using the different columns as key values (for dimensions), attributes, or the observation value, as appropriate.
In [14]: from sdmx.model import Key, AttributeValue, Observation
# `key` is a Key that gives values for each dimension.
# `attrs` is a dictionary of attribute values (here, only 1).
# `value_for` refers to the measure.
# `value` is the observation value for that measure.
In [15]: def make_obs(row):
....: key = dsd.make_key(Key, row[[d.id for d in dsd.dimensions]])
....: attrs = {
....: a.id: AttributeValue(value_for=a, value=row[a.id])
....: for a in dsd.attributes
....: }
....: return Observation(
....: dimension=key,
....: attached_attribute=attrs,
....: value_for=dsd.measures[0],
....: value=row[dsd.measures[0].id],
....: )
....:
Note
Because the DSD is a complete description of the structure of the data, notice that make_obs
can use its properties to retrieve the IDs for dimensions, attributes, and the primary measure.
The variables D
, M
, and A
were already deleted and aren’t used anymore.
Next, we use the built-in method pandas.DataFrame.apply()
to run this function on each row of data
.
# Convert each row of `data` to an Observation
# apply() returns a pd.Series; convert to a list
In [16]: observations = data.apply(make_obs, axis=1).to_list()
This list of Observation objects can now be used to create Datasets
.
Because of the structure of our data
, there are only 4 unique keys for 8 observations.
For instance, the key TIME_DETAIL=2016, REF_AREA=1 appears twice, each time with a different value for the UNIT_MEASURE attribute.
The SDMX information model requires that every observation in a data set has a unique key.
We meet this requirement by creating two data sets, so that each data set contains a set of unique keys.
In [17]: from sdmx.model import DataSet
# Only the Observations with UNIT_MEASURE="PT"
In [18]: ds1 = DataSet(structured_by=dsd, obs=observations[:4])
In [19]: ds1
Out[19]: DataSet(annotations=[], action=None, attrib={}, valid_from=None, described_by=None, structured_by=<DataStructureDefinition CUSTOM_DSD>, obs=[Observation(attached_attribute={'UNIT_MEASURE': <AttributeValue: UNIT_MEASURE=PT>}, series_key=None, dimension=<Key: TIME_DETAIL=2016, REF_AREA=1>, value=50, value_for=<PrimaryMeasure OBS_VALUE>, group_keys=set()), Observation(attached_attribute={'UNIT_MEASURE': <AttributeValue: UNIT_MEASURE=PT>}, series_key=None, dimension=<Key: TIME_DETAIL=2017, REF_AREA=1>, value=60, value_for=<PrimaryMeasure OBS_VALUE>, group_keys=set()), Observation(attached_attribute={'UNIT_MEASURE': <AttributeValue: UNIT_MEASURE=PT>}, series_key=None, dimension=<Key: TIME_DETAIL=2016, REF_AREA=2>, value=70, value_for=<PrimaryMeasure OBS_VALUE>, group_keys=set()), Observation(attached_attribute={'UNIT_MEASURE': <AttributeValue: UNIT_MEASURE=PT>}, series_key=None, dimension=<Key: TIME_DETAIL=2017, REF_AREA=2>, value=80, value_for=<PrimaryMeasure OBS_VALUE>, group_keys=set())], series={}, group={})
# Observations with UNIT_MEASURE="USD"
In [20]: ds2 = DataSet(structured_by=dsd, obs=observations[4:])
In [21]: ds2
Out[21]: DataSet(annotations=[], action=None, attrib={}, valid_from=None, described_by=None, structured_by=<DataStructureDefinition CUSTOM_DSD>, obs=[Observation(attached_attribute={'UNIT_MEASURE': <AttributeValue: UNIT_MEASURE=USD>}, series_key=None, dimension=<Key: TIME_DETAIL=2016, REF_AREA=1>, value=5000, value_for=<PrimaryMeasure OBS_VALUE>, group_keys=set()), Observation(attached_attribute={'UNIT_MEASURE': <AttributeValue: UNIT_MEASURE=USD>}, series_key=None, dimension=<Key: TIME_DETAIL=2017, REF_AREA=1>, value=6000, value_for=<PrimaryMeasure OBS_VALUE>, group_keys=set()), Observation(attached_attribute={'UNIT_MEASURE': <AttributeValue: UNIT_MEASURE=USD>}, series_key=None, dimension=<Key: TIME_DETAIL=2016, REF_AREA=2>, value=7000, value_for=<PrimaryMeasure OBS_VALUE>, group_keys=set()), Observation(attached_attribute={'UNIT_MEASURE': <AttributeValue: UNIT_MEASURE=USD>}, series_key=None, dimension=<Key: TIME_DETAIL=2017, REF_AREA=2>, value=8000, value_for=<PrimaryMeasure OBS_VALUE>, group_keys=set())], series={}, group={})
The DSD is also connected to each data set.
Encapsulate in messages and write to file#
SDMX files always contain complete messages with either data or structure.
To write the ds1
and ds2
objects to file, we need to enclose them in a message object.
An SDMX data message doesn’t refer to a DSD directly, but to a data flow definition (DFD), which in turn refers to the DSD. We create a DFD as well.
In [22]: from sdmx.model import DataflowDefinition
In [23]: from sdmx.message import DataMessage
# The DFD points to the DSD
In [24]: dfd = DataflowDefinition(id="CUSTOM_DFD", structure=dsd)
# The data message contains the data set, and points to the data flow
In [25]: msg1 = DataMessage(data=[ds1, ds2], dataflow=dfd)
# Write in SDMX-ML (XML) format
In [26]: with open("data-message.xml", "wb") as f:
....: f.write(sdmx.to_xml(msg1))
....:
We also write the DFD and DSD to file.
This step is not required: sdmx
could infer these when reading data-message.xml
.
However, the very purpose of the SDMX standard is to enable good practice, to be explicit and unambigious about how data is structured and what it means.
In [27]: from sdmx.message import StructureMessage
# Structure messages can contain many instances of several kinds
# of structure objects. See the documentation.
In [28]: msg2 = StructureMessage(
....: dataflow={dfd.id: dfd},
....: structure={dsd.id: dsd},
....: )
....:
In [29]: with open("structure-message.xml", "wb") as f:
....: f.write(sdmx.to_xml(msg2))
....:
Check the results#
We read the data from the files just generated.
# Delete references to all the objects just created
In [30]: del msg1, msg2, ds1, ds2, dfd, dsd, observations
# Re-read from files
In [31]: msg3 = sdmx.read_sdmx("structure-message.xml")
In [32]: msg4 = sdmx.read_sdmx(
....: "data-message.xml", dsd=msg3.structure["CUSTOM_DSD"]
....: )
....:
# Convert to a data frame, including attributes in a column
In [33]: dfs = sdmx.to_pandas(msg4, attributes="o")
In [34]: dfs
Out[34]:
[ value UNIT_MEASURE
TIME_DETAIL REF_AREA
2016 1 50.0 PT
2017 1 60.0 PT
2016 2 70.0 PT
2017 2 80.0 PT,
value UNIT_MEASURE
TIME_DETAIL REF_AREA
2016 1 5000.0 USD
2017 1 6000.0 USD
2016 2 7000.0 USD
2017 2 8000.0 USD]
to_pandas()
converts each data set in the message to a separate pandas
object with a unique pandas.MultiIndex
, so this call returns a list containing two data frames.
We can also combine these data frames into a single one, with a non-unique index, and then use pandas.DataFrame.reset_index()
to recover the initial structure:
In [35]: pd.concat(dfs).reset_index()
Out[35]:
TIME_DETAIL REF_AREA value UNIT_MEASURE
0 2016 1 50.0 PT
1 2017 1 60.0 PT
2 2016 2 70.0 PT
3 2017 2 80.0 PT
4 2016 1 5000.0 USD
5 2017 1 6000.0 USD
6 2016 2 7000.0 USD
7 2017 2 8000.0 USD
Note
Simplifying the process of authoring different kinds of SDMX objects and messages is a priority enhancement for sdmx
.
Contributions are welcome; see Development.