Ten-line usage example#

Suppose we want to analyze annual unemployment data for some European countries. All we need to know in advance is the data provider: Eurostat.

sdmx makes it easy to search the directory of dataflows, and the complete structural metadata about the datasets available through the selected dataflow. (This example skips these steps; see the walkthrough.)

The data we want is in a data flow with the identifier ‘UNE_RT_A’. This dataflow references a data structure definition (DSD) that also has an ID ‘UNE_RT_A’. The DSD, in turn, contains or references all the metadata describing data sets available through this dataflow: the concepts, things measured, dimensions, and lists of codes used to label each dimension.

In [1]: import sdmx

In [2]: estat = sdmx.Client("ESTAT")

Download the metadata and expose:

In [3]: metadata = estat.datastructure("UNE_RT_A")

In [4]: metadata
Out[4]: 
<sdmx.StructureMessage>
  <Header>
    id: 'DSD1667924407'
    prepared: '2022-11-08T16:20:07.160000+00:00'
    sender: <Agency ESTAT>
    source: 
    test: False
  response: <Response [200]>
  Codelist (6): SEX OBS_FLAG UNIT AGE GEO FREQ
  ConceptScheme (1): UNE_RT_A
  DataStructureDefinition (1): UNE_RT_A

Explore the contents of some code lists:

In [5]: for cl in "AGE", "UNIT":
   ...:     print(sdmx.to_pandas(metadata.codelist[cl]))
   ...: 
AGE
TOTAL                          Total
LFD                Late foetal death
LFD1     Late foetal death (group 1)
LFD2     Late foetal death (group 2)
MN0                     Zero minutes
                    ...             
AVG                          Average
NRP                      No response
NSP                    Not specified
OTH                            Other
UNK                          Unknown
Name: Age class, Length: 654, dtype: object
UNIT
TOTAL                                                        Total
NR                                                          Number
NR_HAB                                       Number per inhabitant
THS                                                       Thousand
MIO                                                        Million
                                       ...                        
PD_PCH_SM_NAC    Price index (implicit deflator), percentage ch...
CRC_MEUR                   Current replacement costs, million euro
CRC_MNAC         Current replacement costs, million units of na...
PYR_MEUR             Previous year replacement costs, million euro
PYR_MNAC         Previous year replacement costs, million units...
Name: Unit of measure, Length: 694, dtype: object

Next we download a data set containing a subset of the data from this data flow, structured by this DSD. To obtain data on Greece, Ireland and Spain only, we use codes from the code list with the ID ‘GEO’ to specify a key for the dimension with the ID ‘geo’ (note the difference: SDMX IDs are case-sensitive). We also use a query parameter, ‘startPeriod’, to limit the scope of the data returned:

In [6]: resp = estat.data(
   ...:     "UNE_RT_A",
   ...:     key={"geo": "EL+ES+IE"},
   ...:     params={"startPeriod": "2007"},
   ...: )
   ...: 

resp is now a DataMessage object. We use the sdmx.to_pandas() function to convert it to a pandas.Series, then select on the ‘age’ dimension:

In [7]: data = (
   ...:     sdmx.to_pandas(resp)
   ...:     .xs("Y15-74", level="age", drop_level=False)
   ...: )
   ...: 

We can now explore the data set as expressed in a familiar pandas object. First, show dimension names:

In [8]: data.index.names
Out[8]: FrozenList(['freq', 'age', 'unit', 'sex', 'geo', 'TIME_PERIOD'])

…and corresponding key values along these dimensions:

In [9]: data.index.levels
Out[9]: FrozenList([['A'], ['Y15-24', 'Y15-29', 'Y15-74', 'Y20-64', 'Y25-54', 'Y25-74', 'Y55-74'], ['PC_ACT', 'PC_POP', 'THS_PER'], ['F', 'M', 'T'], ['EL', 'ES', 'IE'], ['2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021']])

Select some data of interest: show aggregate unemployment rates across ages (“Y15-74” on the ‘age’ dimension) and sexes (“T” on the ‘sex’ dimension), expressed as a percentage of active population (“PC_ACT” on the ‘unit’ dimension):

In [10]: data.loc[("A", "Y15-74", "PC_ACT", "T")]
Out[10]: 
geo  TIME_PERIOD
EL   2009            9.8
     2010           12.9
     2011           18.1
     2012           24.8
     2013           27.8
     2014           26.6
     2015           25.0
     2016           23.9
     2017           21.8
     2018           19.7
     2019           17.9
     2020           17.6
     2021           14.7
ES   2009           17.9
     2010           19.9
     2011           21.4
     2012           24.8
     2013           26.1
     2014           24.5
     2015           22.1
     2016           19.6
     2017           17.2
     2018           15.3
     2019           14.1
     2020           15.5
     2021           14.8
IE   2009           12.6
     2010           14.6
     2011           15.4
     2012           15.5
     2013           13.8
     2014           11.9
     2015            9.9
     2016            8.4
     2017            6.7
     2018            5.8
     2019            5.0
     2020            5.9
     2021            6.2
Name: value, dtype: float64