Ten-line usage example
Ten-line usage example#
Suppose we want to analyze annual unemployment data for some European countries. All we need to know in advance is the data provider: Eurostat.
sdmx
makes it easy to search the directory of dataflows, and the complete structural metadata about the datasets available through the selected dataflow.
(This example skips these steps; see the walkthrough.)
The data we want is in a data flow with the identifier ‘UNE_RT_A’. This dataflow references a data structure definition (DSD) that also has an ID ‘UNE_RT_A’. The DSD, in turn, contains or references all the metadata describing data sets available through this dataflow: the concepts, things measured, dimensions, and lists of codes used to label each dimension.
In [1]: import sdmx
In [2]: estat = sdmx.Client("ESTAT")
Download the metadata and expose:
In [3]: metadata = estat.datastructure("UNE_RT_A")
In [4]: metadata
Out[4]:
<sdmx.StructureMessage>
<Header>
id: 'DSD1667924407'
prepared: '2022-11-08T16:20:07.160000+00:00'
sender: <Agency ESTAT>
source:
test: False
response: <Response [200]>
Codelist (6): SEX OBS_FLAG UNIT AGE GEO FREQ
ConceptScheme (1): UNE_RT_A
DataStructureDefinition (1): UNE_RT_A
Explore the contents of some code lists:
In [5]: for cl in "AGE", "UNIT":
...: print(sdmx.to_pandas(metadata.codelist[cl]))
...:
AGE
TOTAL Total
LFD Late foetal death
LFD1 Late foetal death (group 1)
LFD2 Late foetal death (group 2)
MN0 Zero minutes
...
AVG Average
NRP No response
NSP Not specified
OTH Other
UNK Unknown
Name: Age class, Length: 654, dtype: object
UNIT
TOTAL Total
NR Number
NR_HAB Number per inhabitant
THS Thousand
MIO Million
...
PD_PCH_SM_NAC Price index (implicit deflator), percentage ch...
CRC_MEUR Current replacement costs, million euro
CRC_MNAC Current replacement costs, million units of na...
PYR_MEUR Previous year replacement costs, million euro
PYR_MNAC Previous year replacement costs, million units...
Name: Unit of measure, Length: 694, dtype: object
Next we download a data set containing a subset of the data from this data flow, structured by this DSD. To obtain data on Greece, Ireland and Spain only, we use codes from the code list with the ID ‘GEO’ to specify a key for the dimension with the ID ‘geo’ (note the difference: SDMX IDs are case-sensitive). We also use a query parameter, ‘startPeriod’, to limit the scope of the data returned:
In [6]: resp = estat.data(
...: "UNE_RT_A",
...: key={"geo": "EL+ES+IE"},
...: params={"startPeriod": "2007"},
...: )
...:
resp
is now a DataMessage
object.
We use the sdmx.to_pandas()
function to convert it to a pandas.Series
, then select on the ‘age’ dimension:
In [7]: data = (
...: sdmx.to_pandas(resp)
...: .xs("Y15-74", level="age", drop_level=False)
...: )
...:
We can now explore the data set as expressed in a familiar pandas object. First, show dimension names:
In [8]: data.index.names
Out[8]: FrozenList(['freq', 'age', 'unit', 'sex', 'geo', 'TIME_PERIOD'])
…and corresponding key values along these dimensions:
In [9]: data.index.levels
Out[9]: FrozenList([['A'], ['Y15-24', 'Y15-29', 'Y15-74', 'Y20-64', 'Y25-54', 'Y25-74', 'Y55-74'], ['PC_ACT', 'PC_POP', 'THS_PER'], ['F', 'M', 'T'], ['EL', 'ES', 'IE'], ['2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021']])
Select some data of interest: show aggregate unemployment rates across ages (“Y15-74” on the ‘age’ dimension) and sexes (“T” on the ‘sex’ dimension), expressed as a percentage of active population (“PC_ACT” on the ‘unit’ dimension):
In [10]: data.loc[("A", "Y15-74", "PC_ACT", "T")]
Out[10]:
geo TIME_PERIOD
EL 2009 9.8
2010 12.9
2011 18.1
2012 24.8
2013 27.8
2014 26.6
2015 25.0
2016 23.9
2017 21.8
2018 19.7
2019 17.9
2020 17.6
2021 14.7
ES 2009 17.9
2010 19.9
2011 21.4
2012 24.8
2013 26.1
2014 24.5
2015 22.1
2016 19.6
2017 17.2
2018 15.3
2019 14.1
2020 15.5
2021 14.8
IE 2009 12.6
2010 14.6
2011 15.4
2012 15.5
2013 13.8
2014 11.9
2015 9.9
2016 8.4
2017 6.7
2018 5.8
2019 5.0
2020 5.9
2021 6.2
Name: value, dtype: float64