Ten-line usage example¶
Suppose we want to analyze annual unemployment data for some European countries. All we need to know in advance is the data provider: Eurostat.
sdmx
makes it easy to search the directory of dataflows, and the complete structural metadata about the datasets available through the selected dataflow.
(This example skips these steps; see the walkthrough.)
The data we want is in a data flow with the identifier une_rt_a
.
This dataflow references a data structure definition (DSD) with the ID DSD_une_rt_a
.
The DSD, in turn, contains or references all the metadata describing data sets available through this dataflow: the concepts, things measured, dimensions, and lists of codes used to label each dimension.
In [1]: import sdmx
In [2]: estat = sdmx.Client('ESTAT')
Download the metadata and expose:
In [3]: metadata = estat.datastructure('DSD_une_rt_a')
In [4]: metadata
Out[4]:
<sdmx.StructureMessage>
<Header>
id: 'IDREF51927'
prepared: '2021-01-26T20:52:12.797000+00:00'
receiver: <Agency Unknown>
sender: <Agency Unknown>
source:
test: False
response: <Response [200]>
Codelist (7): CL_AGE CL_FREQ CL_GEO CL_OBS_FLAG CL_OBS_STATUS CL_SEX ...
ConceptScheme (1): CS_DSD_une_rt_a
DataStructureDefinition (1): DSD_une_rt_a
Explore the contents of some code lists:
In [5]: for cl in 'CL_AGE', 'CL_UNIT':
...: print(sdmx.to_pandas(metadata.codelist[cl]))
...:
CL_AGE
Y15-24 From 15 to 24 years
Y15-74 From 15 to 74 years
Y20-64 From 20 to 64 years
Y25-54 From 25 to 54 years
Y25-74 From 25 to 74 years
Y55-74 From 55 to 74 years
Name: AGE, dtype: object
CL_UNIT
THS_PER Thousand persons
PC_POP Percentage of total population
PC_ACT Percentage of active population
Name: UNIT, dtype: object
Next we download a dataset. To obtain data on Greece, Ireland and Spain only, we use codes from the code list ‘CL_GEO’ to specify a key for the dimension named ‘GEO’. We also use a query parameter, ‘startPeriod’, to limit the scope of the data returned:
In [6]: resp = estat.data(
...: 'une_rt_a',
...: key={'GEO': 'EL+ES+IE'},
...: params={'startPeriod': '2007'},
...: )
...:
resp
is now a DataMessage
object.
We use the built-in to_pandas()
function to convert it to a pandas.Series
, then select on the AGE
dimension:
In [7]: data = (sdmx.to_pandas(resp)
...: .xs('Y15-74', level='AGE', drop_level=False))
...:
We can now explore the data set as expressed in a familiar pandas object. First, show dimension names:
In [8]: data.index.names
Out[8]: FrozenList(['FREQ', 'AGE', 'UNIT', 'SEX', 'GEO', 'TIME_PERIOD'])
…and corresponding key values along these dimensions:
In [9]: data.index.levels
Out[9]: FrozenList([['A'], ['Y15-24', 'Y15-74', 'Y20-64', 'Y25-54', 'Y25-74', 'Y55-74'], ['PC_ACT', 'PC_POP', 'THS_PER'], ['F', 'M', 'T'], ['EL', 'ES', 'IE'], ['2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019']])
Select some data of interest: show aggregate unemployment rates across ages (‘Y15-74’ on the AGE
dimension) and sexes (‘T’ on the SEX
dimension), expressed as a percentage of active population (‘PC_ACT’ on the UNIT
dimension):
In [10]: data.loc[('A', 'Y15-74', 'PC_ACT', 'T')]
Out[10]:
GEO TIME_PERIOD
EL 2007 8.4
2008 7.8
2009 9.6
2010 12.7
2011 17.9
2012 24.5
2013 27.5
2014 26.5
2015 24.9
2016 23.6
2017 21.5
2018 19.3
2019 17.3
ES 2007 8.2
2008 11.3
2009 17.9
2010 19.9
2011 21.4
2012 24.8
2013 26.1
2014 24.5
2015 22.1
2016 19.6
2017 17.2
2018 15.3
2019 14.1
IE 2007 5.0
2008 6.8
2009 12.6
2010 14.6
2011 15.4
2012 15.5
2013 13.8
2014 11.9
2015 10.0
2016 8.4
2017 6.7
2018 5.8
2019 5.0
Name: value, dtype: float64