Ten-line usage example¶
Suppose we want to analyze annual unemployment data for some European countries.
All we need to know in advance is the data provider: Eurostat.
sdmx
makes it easy to inspect all data flows available from this provider. [1]
The data we want is in a data flow with the identifier ‘UNE_RT_A’.
The description of this data flow references a data structure definition (DSD) that happens to also have the ID ‘UNE_RT_A’. [2]
First we create a Client
that we will use to make multiple queries to this provider’s SDMX-REST web service:
In [1]: import sdmx
In [2]: estat = sdmx.Client("ESTAT")
Next, we download a structure message containing the DSD and other structural information that it references. These include structural metadata that together completely describe the data available through this dataflow: the concepts, things measured, dimensions, lists of codes used to label each dimension, attributes, and so on:
In [3]: sm = estat.datastructure("UNE_RT_A", params=dict(references="descendants"))
In [4]: sm
Out[4]:
<sdmx.StructureMessage>
<Header>
id: 'DSD1734066108'
prepared: '2024-12-13T05:01:48.808000+00:00'
sender: <Agency ESTAT>
source:
test: False
response: <Response [200]>
Codelist (6): FREQ AGE UNIT SEX GEO OBS_FLAG
ConceptScheme (1): UNE_RT_A
DataStructureDefinition (1): UNE_RT_A
sm
is a Python object of class StructureMessage
.
We can explore some of the specific artifacts—for example, two code lists—using StructureMessage.get()
to retrieve them and to_pandas()
to convert to pandas.Series
:
In [5]: for cl in "AGE", "SEX":
...: print(sdmx.to_pandas(sm.get(cl)))
...:
AGE
TOTAL Total
LFD Late foetal death
LFD1 Late foetal death (group 1)
LFD2 Late foetal death (group 2)
MN0 Zero minutes
...
AVG Average
NRP No response
NSP Not specified
OTH Other
UNK Unknown
Name: Age class, Length: 661, dtype: object
SEX
T Total
M Males
F Females
DIFF Absolute difference between males and females
NAP Not applicable
NRP No response
UNK Unknown
Name: Sex, dtype: object
Next, we download a data set containing a portion of the data in this data flow, structured by this DSD.
To obtain data only for Greece, Ireland and Spain, we use codes from the code list with the ID ‘GEO’ to specify a key for the dimension with the ID ‘geo’. [3]
We also use a query parameter, ‘startPeriod’, to limit the scope of the data returned along the ‘TIME_PERIOD’ dimension.
The query returns a data message (Python object of DataMessage
) containing the data set:
In [6]: dm = estat.data(
...: "UNE_RT_A",
...: key={"geo": "EL+ES+IE"},
...: params={"startPeriod": "2014"},
...: )
...:
We again use to_pandas()
to convert the entire dm
to a pandas.Series
with a multi-level index (one level per dimension of the DSD).
Then we can use pandas’ built-in methods, like pandas.Series.xs()
to take a cross-section, selecting on the ‘age’ index level (=SDMX dimension):
In [7]: data = (
...: sdmx.to_pandas(dm)
...: .xs("Y15-74", level="age", drop_level=False)
...: )
...:
We further examine the retrieved data set in the familiar form of a pandas.Series
.
For one example, show dimension names:
In [8]: data.index.names
Out[8]: FrozenList(['freq', 'age', 'unit', 'sex', 'geo', 'TIME_PERIOD'])
…and corresponding key values along these dimensions:
In [9]: data.index.levels
Out[9]: FrozenList([['A'], ['Y15-24', 'Y15-29', 'Y15-74', 'Y20-64', 'Y25-54', 'Y25-74', 'Y55-74'], ['PC_ACT', 'PC_POP', 'THS_PER'], ['F', 'M', 'T'], ['EL', 'ES', 'IE'], ['2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021', '2022', '2023']])
Select some data of interest: show aggregate unemployment rates across ages (“Y15-74” on the ‘age’ dimension) and sexes (“T” on the ‘sex’ dimension), expressed as a percentage of active population (“PC_ACT” on the ‘unit’ dimension):
In [10]: data.loc[("A", "Y15-74", "PC_ACT", "T")]
Out[10]:
geo TIME_PERIOD
EL 2014 26.6
2015 25.0
2016 23.9
2017 21.8
2018 19.7
2019 17.9
2020 17.6
2021 14.7
2022 12.5
2023 11.1
ES 2014 24.5
2015 22.1
2016 19.6
2017 17.2
2018 15.3
2019 14.1
2020 15.5
2021 14.9
2022 13.0
2023 12.2
IE 2014 11.9
2015 9.9
2016 8.4
2017 6.7
2018 5.8
2019 5.0
2020 5.9
2021 6.2
2022 4.5
2023 4.3
Name: value, dtype: float64