Ten-line usage example

Ten-line usage example#

Suppose we want to analyze annual unemployment data for some European countries.

All we need to know in advance is the data provider: Eurostat. sdmx makes it easy to inspect all data flows available from this provider. [1] The data we want is in a data flow with the identifier ‘UNE_RT_A’. The description of this data flow references a data structure definition (DSD) that happens to also have the ID ‘UNE_RT_A’. [2]

First we create a Client that we will use to make multiple queries to this provider’s SDMX-REST web service:

In [1]: import sdmx

In [2]: estat = sdmx.Client("ESTAT")

Next, we download a structure message containing the DSD and other structural information that it references. These include structural metadata that together completely describe the data available through this dataflow: the concepts, things measured, dimensions, lists of codes used to label each dimension, attributes, and so on:

In [3]: sm = estat.datastructure("UNE_RT_A", params=dict(references="descendants"))

In [4]: sm
Out[4]: 
<sdmx.StructureMessage>
  <Header>
    id: 'DSD1710478906'
    prepared: '2024-03-15T05:01:46.496000+00:00'
    sender: <Agency ESTAT>
    source: 
    test: False
  response: <Response [200]>
  Codelist (8): ESTAT:FREQ(3.2) ESTAT:AGE(7.0) ESTAT:SEX(1.10) ESTAT:OB...
  ConceptScheme (1): UNE_RT_A
  DataStructureDefinition (1): UNE_RT_A

sm is a Python object of class StructureMessage. We can explore some of the specific artifacts—for example, two code lists—using StructureMessage.get() to retrieve them and to_pandas() to convert to pandas.Series:

In [5]: for cl in "AGE", "SEX":
   ...:     print(sdmx.to_pandas(sm.get(cl)))
   ...: 
AGE
TOTAL                          Total
LFD                Late foetal death
LFD1     Late foetal death (group 1)
LFD2     Late foetal death (group 2)
MN0                     Zero minutes
                    ...             
AVG                          Average
NRP                      No response
NSP                    Not specified
OTH                            Other
UNK                          Unknown
Name: Age class, Length: 657, dtype: object
SEX
T                                               Total
M                                               Males
F                                             Females
DIFF    Absolute difference between males and females
NAP                                    Not applicable
NRP                                       No response
UNK                                           Unknown
Name: Sex, dtype: object

Next, we download a data set containing a portion of the data in this data flow, structured by this DSD. To obtain data only for Greece, Ireland and Spain, we use codes from the code list with the ID ‘GEO’ to specify a key for the dimension with the ID ‘geo’. [3] We also use a query parameter, ‘startPeriod’, to limit the scope of the data returned along the ‘TIME_PERIOD’ dimension. The query returns a data message (Python object of DataMessage) containing the data set:

In [6]: dm = estat.data(
   ...:     "UNE_RT_A",
   ...:     key={"geo": "EL+ES+IE"},
   ...:     params={"startPeriod": "2014"},
   ...: )
   ...: 

We again use to_pandas() to convert the entire dm to a pandas.Series with a multi-level index (one level per dimension of the DSD). Then we can use pandas’ built-in methods, like pandas.Series.xs() to take a cross-section, selecting on the ‘age’ index level (=SDMX dimension):

In [7]: data = (
   ...:     sdmx.to_pandas(dm)
   ...:     .xs("Y15-74", level="age", drop_level=False)
   ...: )
   ...: 

We further examine the retrieved data set in the familiar form of a pandas.Series. For one example, show dimension names:

In [8]: data.index.names
Out[8]: FrozenList(['freq', 'age', 'unit', 'sex', 'geo', 'TIME_PERIOD'])

…and corresponding key values along these dimensions:

In [9]: data.index.levels
Out[9]: FrozenList([['A'], ['Y15-24', 'Y15-29', 'Y15-74', 'Y20-64', 'Y25-54', 'Y25-74', 'Y55-74'], ['PC_ACT', 'PC_POP', 'THS_PER'], ['F', 'M', 'T'], ['EL', 'ES', 'IE'], ['2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021', '2022', '2023']])

Select some data of interest: show aggregate unemployment rates across ages (“Y15-74” on the ‘age’ dimension) and sexes (“T” on the ‘sex’ dimension), expressed as a percentage of active population (“PC_ACT” on the ‘unit’ dimension):

In [10]: data.loc[("A", "Y15-74", "PC_ACT", "T")]
Out[10]: 
geo  TIME_PERIOD
EL   2014           26.6
     2015           25.0
     2016           23.9
     2017           21.8
     2018           19.7
     2019           17.9
     2020           17.6
     2021           14.7
     2022           12.5
     2023           11.1
ES   2014           24.5
     2015           22.1
     2016           19.6
     2017           17.2
     2018           15.3
     2019           14.1
     2020           15.5
     2021           14.8
     2022           12.9
     2023           12.1
IE   2014           11.9
     2015            9.9
     2016            8.4
     2017            6.7
     2018            5.8
     2019            5.0
     2020            5.9
     2021            6.2
     2022            4.5
     2023            4.3
Name: value, dtype: float64