Walkthrough¶
This page walks through an example sdmx
workflow, providing explanations of some SDMX concepts along the way.
See also Resources, HOWTOs for miscellaneous tasks, and follow links to the Glossary where some terms are explained.
SDMX workflow¶
Working with statistical data often includes some or all of the following steps.
sdmx
builds on SDMX features to make the steps straightforward:
- Choose a data provider.
sdmx
provides a built-in list of Data sources.
- Investigate what data is available.
Using
sdmx
, download the catalogue of data flows available from the data provider and select a data flow for further inspection.
- Understand what form the data comes in.
Using
sdmx
, download structure and metadata on the selected data flow and the data it contains, including the data structure definition, concepts, code lists and content constraints.
- Download the actual data.
Using
sdmx
, specify the needed portions of the data from the data flow by constructing a selection (‘key’) of series and a period/time range. Then, retrieve the data usingClient.get()
.
- Analyze or manipulate the data.
Convert to
pandas
types usingsdmx.to_pandas()
and use the result in further Python code and scripts.
Choose and connect to an SDMX web service¶
First, we instantiate a sdmx.Client
object, using the string ID of a data source recognized by sdmx
:
In [1]: import sdmx
In [2]: ecb = sdmx.Client("ECB")
The object ecb
is now ready to make multiple data and metadata queries to the European Central Bank’s web service.
To send requests to multiple web services, we could instantiate multiple Clients
.
Configure the HTTP connection¶
sdmx
builds on the widely-used requests
Python HTTP library.
To pre-configure all queries made by a Client
, we can pass any of the keyword arguments recognized by requests.request()
.
For example, a proxy server can be specified:
In [3]: ecb_via_proxy = sdmx.Client(
...: "ECB",
...: proxies={"http": "http://1.2.3.4:5678"}
...: )
...:
The session
attribute is a Session
object that can be used to inspect and modify configuration between queries:
In [4]: ecb_via_proxy.session.proxies
Out[4]: {'http': 'http://1.2.3.4:5678'}
For convenience, Session.timeout
stores the timeout in seconds for HTTP requests, and is passed automatically for all queries.
Cache HTTP responses and parsed objects¶
Added in version 0.3.0.
If requests_cache
is installed, it is used automatically by Session
.
To configure it, we can pass any of the arguments accepted by requests_cache.core.CachedSession
when creating a Client
.
For example, to force requests_cache
to use SQLite to store cached data with the fast_save
option, and expire cache entries after 10 minutes:
In [5]: ecb_with_cache = sdmx.Client(
...: "ECB",
...: backend="sqlite",
...: fast_save=True,
...: expire_after=600,
...: )
...:
Client
provides an optional, simple cache for retrieved and parsed Message
instances, where the cache key is the constructed query URL.
This cache is disabled by default; to activate it, supply use_cache=True to the constructor.
Obtain and explore metadata¶
This section illustrates how to download and explore metadata. Suppose we are looking for time-series on exchange rates, and we know that the European Central Bank provides a relevant data flow.
We could search the Internet for the dataflow ID or browse the ECB’s website.
However, we can also use sdmx
to retrieve metadata and get a complete overview of the data flows the ECB provides.
Get information about the source’s data flows¶
We use sdmx
to download the definitions for all data flows available from our chosen source.
We could call Client.get()
with [resource_type=]"dataflow"
as the first argument, but can also use a shorter alias:
In [6]: flow_msg = ecb.dataflow()
The query returns a Message
instance.
We can also see the URL that was queried and the response headers by accessing the Message.response
attribute:
In [7]: flow_msg.response.url
Out[7]: 'https://data-api.ecb.europa.eu/service/dataflow/ECB/all/latest'
In [8]: flow_msg.response.headers
Out[8]: {'Server': 'myracloud', 'Date': 'Wed, 15 Jan 2025 18:12:13 GMT', 'Content-Type': 'application/vnd.sdmx.structure+xml;version=2.1', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Expires': 'Wed, 15 Jan 2025 18:12:42 GMT', 'cache-control': 'max-age=30', 'ETag': '"myra-67ea9a13"', 'X-CDN': '1'}
All the content of the response—SDMX data and metadata objects—has been parsed and is accessible from flow_msg
.
Let’s find out what we have received:
In [9]: flow_msg
Out[9]:
<sdmx.StructureMessage>
<Header>
id: 'IREF647321'
prepared: '2025-01-15T18:12:12+00:00'
receiver: <Agency not_supplied>
sender: <Agency Unknown>
source:
test: False
response: <Response [200]>
DataflowDefinition (98): AME BKN BLS BNT BOP BPS BSI BSP CBD CBD2 CCP...
DataStructureDefinition (76): ECB_AME1 ECB_BKN1 ECB_BLS1 ECB_BOP_BNT ...
The string representation of the Message shows us a few things:
This is a Structure-, rather than DataMessage.
It contains 67
DataflowDefinition
objects. Because we didn’t specify an ID of a particular data flow, we received the definitions for all data flows available from the ECB web service.The first of these have ID attributes like ‘AME’, ‘BKN’, …
We could inspect these each individually using StructureMessage.dataflow
attribute, a DictLike
object that allows attribute- and index-style access:
In [10]: flow_msg.dataflow.BOP
Out[10]: <DataflowDefinition ECB:BOP(1.0): Euro Area Balance of Payments and International Investment Position Statistics>
Convert metadata to pandas.Series
¶
However, an easier way is to use sdmx.to_pandas()
to convert some of the information to a pandas.Series
:
In [11]: dataflows = sdmx.to_pandas(flow_msg.dataflow)
In [12]: dataflows.head()
Out[12]:
AME AMECO
BKN Banknotes statistics
BLS Bank Lending Survey Statistics
BNT Shipments of Euro Banknotes Statistics (ESCB)
BOP Euro Area Balance of Payments and Internationa...
dtype: object
In [13]: len(dataflows)
Out[13]: 98
to_pandas()
accepts most instances and Python collections of sdmx.model
objects, and we can use keyword arguments to control how each of these is handled.
See the method documentation for details.
As we are interested in exchange rate data, let’s use built-in Pandas methods to find an appropriate data flow:
In [14]: dataflows[dataflows.str.contains("exchange", case=False)]
Out[14]:
EXR Exchange Rates
FXI Foreign Exchange Statistics
SEE Securities exchange - Trading Statistics
dtype: object
We decide to look at ‘EXR’.
Some agencies, including ECB and INSEE, offer categorizations of data flows to help with this step. See this HOWTO entry.
Understand constraints¶
The CURRENCY
and CURRENCY_DENOM
dimensions of this DSD are both represented using the same CL_CURRENCY
code list.
In order to be reusable for as many data sets as possible, this code list is extensive and complete:
In [29]: len(exr_msg.codelist.CL_CURRENCY)
Out[29]: 369
However, the European Central Bank does not, in its ‘EXR’ data flow, commit to providing exchange rates between—for instance—the Congolese franc (‘CDF’) and Peruvian sol (‘PEN’).
In other words, the values of (CURRENCY
, CURRENCY_DENOM
) that we can expect to find in ‘EXR’ is much smaller than the 359 × 359 possible combinations of two values from CL_CURRENCY
.
How much smaller?
Let’s return to explore the ContentConstraint
that came with our metadata query:
In [30]: exr_msg.constraint.EXR_CONSTRAINTS
Out[30]: <ContentConstraint EXR_CONSTRAINTS: Constraints for the EXR dataflow.>
# Get the content 'region' included in the constraint
In [31]: cr = exr_msg.constraint.EXR_CONSTRAINTS.data_content_region[0]
# Get the valid members for two dimensions
In [32]: c1 = sdmx.to_pandas(cr.member["CURRENCY"].values)
# Convert list() to set()
In [33]: c1 = set(c1)
In [34]: len(c1)
Out[34]: 54
In [35]: c2 = sdmx.to_pandas(cr.member["CURRENCY_DENOM"].values)
In [36]: c2 = set(c2)
In [37]: len(c2)
Out[37]: 62
# Explore the contents
# Currencies that are valid for CURRENCY_DENOM, but not CURRENCY
In [38]: c2 - c1
Out[38]:
{'AED',
'ATS',
'BEF',
'CLP',
'COP',
'DEM',
'ESP',
'EUR',
'FIM',
'FRF',
'IEP',
'ITL',
'LUF',
'NLG',
'PEN',
'PTE',
'SAR',
'UAH'}
# The opposite:
In [39]: c1 - c2
Out[39]: {'E01', 'E02', 'E03', 'EGP', 'H00', 'H01', 'H02', 'H03', 'H37', 'H42'}
# Check certain contents
In [40]: {"CDF", "PEN"} < c1 | c2
Out[40]: False
In [41]: {"USD", "JPY"} < c1 & c2
Out[41]: True
We also see that ‘USD’ and ‘JPY’ are valid values along both dimensions.
Attribute names and allowed values can be obtained in a similar fashion.
Select and query data from a dataflow¶
Next, we will query for some data.
The step is simple: call Client.get()
with resource_type=”data” as the first argument, or the alias Client.data()
.
First, however, we describe some of the many options offered by SDMX and sdmx
for data queries.
Choose a data format¶
Web services offering SDMX-ML–formatted DataMessages
can return them in one of two formats:
- Generic data
use XML elements that explicitly identify whether values associated with an Observation are dimensions, or attributes.
For example, in the ‘EXR’ data flow, the XML content for the
CURRENCY_DENOM
dimension and for theOBS_STATUS
attribute are stored differently:<generic:Obs> <generic:ObsKey> <!-- NB. Other dimensions omitted. --> <generic:Value value="EUR" id="CURRENCY_DENOM"/> <!-- … --> </generic:ObsKey> <generic:ObsValue value="0.82363"/> <generic:Attributes> <!-- NB. Other attributes omitted. --> <generic:Value value="A" id="OBS_STATUS"/> <!-- … --> </generic:Attributes> </generic:Obs>
- Structure-specific data
use a more concise format:
<!-- NB. Other dimensions and attributes omitted: --> <Obs CURRENCY_DENOM="EUR" OBS_VALUE="0.82363" OBS_STATUS="A" />
This can result in much smaller messages. However, because this format does not distinguish dimensions and attributes, it cannot be properly parsed by
sdmx
without separately obtaining the data structure definition.
sdmx
adds appropriate HTTP headers for retrieving structure-specific data (see implementation notes).
In general, to minimize queries and message size:
First query for the DSD associated with a data flow.
When requesting data, pass the obtained object as the dsd= argument to
Client.get()
orClient.data()
.
This allows sdmx
to retrieve structure-specific data whenever possible.
It can also avoid an additional request when validating data query keys (below).
Construct a selection key for a query¶
SDMX web services can offer access to very large data flows. Queries for all the data in a data flow are not usually necessary, and in some cases servers will refuse to respond. By selecting a subset of data, performance is increased.
The SDMX REST API offers two ways to narrow a data request:
specify a key, i.e. values for 1 or more dimensions to be matched by returned Observations and SeriesKeys. The key is included as part of the URL constructed for the query. Using
sdmx
, a key is specified by the key= argument toClient.get()
.limit the time period, using the HTTP parameters ‘startPeriod’ and ‘endPeriod’. Using
sdmx
, these are specified using the params= argument toClient.get()
.
From the ECB’s dataflow on exchange rates, we specify the CURRENCY
dimension to contain either of the codes ‘USD’ or ‘JPY’.
The documentation for Client.get()
describes the multiple forms of the key argument and the validation applied.
The following are all equivalent:
In [42]: key = dict(CURRENCY=["USD", "JPY"])
In [43]: key = ".USD+JPY..."
We also set a start period to exclude older data:
In [44]: params = dict(startPeriod="2016")
Another way to validate a key against valid codes are series-key-only datasets, i.e. a dataset with all possible series keys where no series contains any observation.
sdmx
supports this validation method as well.
However, it is disabled by default.
Pass series_keys=True
to the Client method to validate a given key against a series-keys only dataset rather than the DSD.
Query data¶
Finally, we request the data in generic format:
In [45]: import sys
In [46]: ecb = sdmx.Client("ECB", backend="memory")
In [47]: data_msg = ecb.data("EXR", key=key, params=params)
# Generic data was returned
In [48]: data_msg.response.headers["content-type"]
Out[48]: 'application/vnd.sdmx.genericdata+xml;version=2.1'
# Number of bytes in the cached response
In [49]: bytes1 = sys.getsizeof(ecb.session.cache.responses.popitem()[1]._content)
In [50]: bytes1
Out[50]: 1210515
To demonstrate a query for a structure-specific data set, we pass the DSD obtained in the previous section:
In [51]: ss_msg = ecb.data("EXR", key=key, params=params, dsd=dsd)
# Structure-specific data was requested and returned
In [52]: ss_msg.response.request.headers["accept"]
Out[52]: 'application/vnd.sdmx.structurespecificdata+xml;version=2.1'
In [53]: ss_msg.response.headers["content-type"]
Out[53]: 'application/vnd.sdmx.structurespecificdata+xml;version=2.1'
# Number of bytes in the cached response
In [54]: bytes2 = sys.getsizeof(ecb.session.cache.responses.popitem()[1]._content)
In [55]: bytes2 / bytes1
Out[55]: 0.33991152525991003
The structure-specific message is a fraction of the size of the generic message.
In [56]: data = data_msg.data[0]
In [57]: type(data)
Out[57]: sdmx.model.v21.GenericDataSet
In [58]: len(data.series)
Out[58]: 16
In [59]: list(data.series.keys())[5]
Out[59]: <SeriesKey: FREQ=D, CURRENCY=USD, CURRENCY_DENOM=EUR, EXR_TYPE=SP00, EXR_SUFFIX=A>
In [60]: set(series_key.FREQ for series_key in data.series.keys())
Out[60]:
{<KeyValue: FREQ=A>,
<KeyValue: FREQ=D>,
<KeyValue: FREQ=H>,
<KeyValue: FREQ=M>,
<KeyValue: FREQ=Q>}
This dataset thus comprises 16 time series of several different period lengths.
We could have chosen to request only daily data in the first place by providing the value ‘D’ for the FREQ
dimension.
In the next section we will show how columns from a dataset can be selected through the information model when writing to a pandas
object.
Convert data to pandas¶
Select columns using the model API¶
As we want to write data to a pandas DataFrame rather than an iterator of pandas Series, we avoid mixing up different frequencies as pandas may raise an error when passed data with incompatible frequencies.
Therefore, we single out the series with daily data.
to_pandas()
accepts an optional iterable to select a subset of the series contained in the dataset.
Thus we can now generate our pandas DataFrame from daily exchange rate data only:
In [61]: import pandas as pd
In [62]: daily = [s for sk, s in data.series.items() if sk.FREQ == "D"]
In [63]: cur_df = pd.concat(sdmx.to_pandas(daily))
In [64]: cur_df.shape
Out[64]: (4630,)
In [65]: cur_df.tail()
Out[65]:
FREQ CURRENCY CURRENCY_DENOM EXR_TYPE EXR_SUFFIX TIME_PERIOD
D USD EUR SP00 A 2025-01-09 1.0305
2025-01-10 1.0304
2025-01-13 1.0198
2025-01-14 1.0245
2025-01-15 1.0300
Name: value, dtype: float64
Convert dimensions to pandas.DatetimeIndex
or PeriodIndex
¶
SDMX datasets often have a Dimension
with a name like TIME_PERIOD
.
To ease further processing of time-series data read from SDMX messages, write_dataset()
provides a datetime argument to convert these into pandas.DatetimeIndex
and PeriodIndex
classes.
For multi-dimensional datasets, write_dataset()
usually returns a pandas.Series
with a MultiIndex
that has one level for each dimension.
However, MultiIndex and DatetimeIndex/PeriodIndex are incompatible; it is not possible to use pandas’ date/time features for just one level of a MultiIndex (e.g. TIME_PERIOD
) while using other types for the other levels/dimensions (e.g. strings for CURRENCY
).
For this reason, when the datetime argument is used, write_dataset()
returns a DataFrame
: the DatetimeIndex/PeriodIndex is used along axis 0, and all other dimensions are collected in a MultiIndex on axis 1.
An example, using the same data flow as above:
In [66]: key = dict(CURRENCY_DENOM="EUR", FREQ="M", EXR_SUFFIX="A")
In [67]: params = dict(startPeriod="2019-01", endPeriod="2019-06")
In [68]: data = ecb.data("EXR", key=key, params=params).data[0]
Without date-time conversion, to_pandas()
produces a MultiIndex:
In [69]: sdmx.to_pandas(data)
Out[69]:
FREQ CURRENCY CURRENCY_DENOM EXR_TYPE EXR_SUFFIX TIME_PERIOD
M ARS EUR SP00 A 2019-01 42.736877
2019-02 43.523655
2019-03 46.479714
2019-04 48.520795
2019-05 50.155077
...
ZAR EUR SP00 A 2019-02 15.687945
2019-03 16.250743
2019-04 15.895875
2019-05 16.137123
2019-06 16.474880
Name: value, Length: 288, dtype: float64
With date-time conversion, it produces a DatetimeIndex:
In [70]: df1 = sdmx.to_pandas(data, datetime="TIME_PERIOD")
In [71]: df1.index
Out[71]:
DatetimeIndex(['2019-01-01', '2019-02-01', '2019-03-01', '2019-04-01',
'2019-05-01', '2019-06-01'],
dtype='datetime64[ns]', name='TIME_PERIOD', freq=None)
In [72]: df1
Out[72]:
FREQ M ...
CURRENCY ARS AUD BGN ... TWD USD ZAR
CURRENCY_DENOM EUR EUR EUR ... EUR EUR EUR
EXR_TYPE SP00 SP00 SP00 ... SP00 SP00 SP00
EXR_SUFFIX A A A ... A A A
TIME_PERIOD ...
2019-01-01 42.736877 1.597514 1.9558 ... 35.201205 1.141641 15.816950
2019-02-01 43.523655 1.589500 1.9558 ... 34.963125 1.135115 15.687945
2019-03-01 46.479714 1.595890 1.9558 ... 34.877605 1.130248 16.250743
2019-04-01 48.520795 1.580175 1.9558 ... 34.676925 1.123825 15.895875
2019-05-01 50.155077 1.611641 1.9558 ... 34.967468 1.118459 16.137123
2019-06-01 49.506670 1.626430 1.9558 ... 35.332025 1.129340 16.474880
[6 rows x 48 columns]
Use the advanced functionality to specify a dimension for the frequency of a PeriodIndex, and change the orientation so that the PeriodIndex is on the columns:
In [73]: df2 = sdmx.to_pandas(
....: data,
....: datetime=dict(dim="TIME_PERIOD", freq="FREQ", axis=1))
....:
In [74]: df2.columns
Out[74]: PeriodIndex(['2019-01', '2019-02', '2019-03', '2019-04', '2019-05', '2019-06'], dtype='period[M]', name='TIME_PERIOD')
In [75]: df2
Out[75]:
TIME_PERIOD 2019-01 ... 2019-06
CURRENCY CURRENCY_DENOM EXR_TYPE EXR_SUFFIX ...
ARS EUR SP00 A 42.736877 ... 49.506670
AUD EUR SP00 A 1.597514 ... 1.626430
BGN EUR SP00 A 1.955800 ... 1.955800
BRL EUR SP00 A 4.269982 ... 4.360035
CAD EUR SP00 A 1.519614 ... 1.501140
CHF EUR SP00 A 1.129700 ... 1.116705
CNY EUR SP00 A 7.750350 ... 7.793685
CZK EUR SP00 A 25.650091 ... 25.604800
DKK EUR SP00 A 7.465736 ... 7.466945
DZD EUR SP00 A 135.144305 ... 134.583205
E01 EUR EN00 A 101.445100 ... 101.220400
ERC0 A 103.076800 ... 102.810600
ERP0 A 100.300000 ... 99.381200
E02 EUR EN00 A 98.692700 ... 98.642700
ERC0 A 94.410500 ... 94.006000
ERP0 A 93.586900 ... 93.125400
E03 EUR EN00 A 116.357600 ... 116.223200
ERC0 A 93.834200 ... 93.260200
GBP EUR SP00 A 0.886030 ... 0.891073
H37 EUR NN00 A 110.776532 ... 112.854049
NRC0 A 106.060787 ... 108.429782
H42 EUR NN00 A 111.614047 ... 112.946142
NRC0 A 106.075411 ... 108.118356
HKD EUR SP00 A 8.952745 ... 8.838280
HRK EUR SP00 A 7.428550 ... 7.407885
HUF EUR SP00 A 319.800455 ... 322.558500
IDR EUR SP00 A 16164.770000 ... 16060.272000
ILS EUR SP00 A 4.207545 ... 4.062430
INR EUR SP00 A 80.798273 ... 78.407800
ISK EUR SP00 A 136.659091 ... 140.820000
JPY EUR SP00 A 124.341364 ... 122.080500
KRW EUR SP00 A 1281.459091 ... 1325.280500
MAD EUR SP00 A 10.882377 ... 10.847505
MXN EUR SP00 A 21.898509 ... 21.782680
MYR EUR SP00 A 4.700118 ... 4.696800
NOK EUR SP00 A 9.763105 ... 9.746480
NZD EUR SP00 A 1.685018 ... 1.711855
PHP EUR SP00 A 59.882455 ... 58.425100
PLN EUR SP00 A 4.291595 ... 4.263505
RON EUR SP00 A 4.706182 ... 4.725005
RUB EUR SP00 A 76.305455 ... 72.402775
SEK EUR SP00 A 10.268541 ... 10.626295
SGD EUR SP00 A 1.548614 ... 1.539010
THB EUR SP00 A 36.318273 ... 35.138600
TRY EUR SP00 A 6.136482 ... 6.561920
TWD EUR SP00 A 35.201205 ... 35.332025
USD EUR SP00 A 1.141641 ... 1.129340
ZAR EUR SP00 A 15.816950 ... 16.474880
[48 rows x 6 columns]
Warning
For large datasets, parsing datetimes may reduce performance.
Work with files¶
Client.get()
accepts the optional keyword argument tofile.
If given, the response from the web service is written to the specified file, and the parse Message
returned.
Added in version 0.2.1.
read_sdmx()
can be used to load SDMX messages stored in local files:
# Use an example ('specimen') file from the test suite
from sdmx.testing import SpecimenCollection
specimen = SpecimenCollection("/path/to/sdmx-test-data")
# …with time-series exchange rate data from the EU Central Bank
with specimen("ECB_EXR/ng-ts.xml") as f:
sdmx.read_sdmx(f)
Handle errors¶
Message.response
carries the requests.Response.status_code
attribute;
in the successful queries above, the status code is 200
.
The SDMX web services guidelines explain the meaning of other codes.
In addition, if the SDMX server has encountered an error, it may return a Message with a footer containing explanatory notes.
sdmx
exposes footer content as Message.footer
and Footer.text
.
Note
sdmx
raises only HTTP errors with status code between 400 and 499.
Codes >= 500 do not raise an error as the SDMX web services guidelines define special meanings to those codes.
The caller must therefore raise an error if needed.