Plotting time series using Pythons ggplot module

The ggplot2 library for the R programming language provides a facet_wrap function which is useful for visualising complex data sets. The following plots show mean monthly nitrogen dioxide levels from 2007 to 2016 for various cities:


As stated in the documentation for the Python ggplot module, ‘R is a little weird’. Fortunately, this module enables useful ggplot2 features to be used in Python. The following code gives a very similar output to the above:


Influence of wind speed on particulate matter

Skip to: Analysis in Python

Wind has a large effect on air pollutants by either importing or diluting and dispersing them. Plots of mean concentrations of particulate matter (PM) fractions measured at three types of monitoring sites show common trends as well as notable differences:

Marylebone Rd and N. Kensington are urban roadside and urban background sites. Plots for the two sites show negative correlations between wind speed and fine PM. Levels at N. Kensington show an increased sensitivity to wind speed and they decline steeply until reaching background levels. Wind shows no signifcant effect on the coarse fraction.

At Middlesborough, an urban industrial site, levels of fine PM follow a similar trend to N. Kensington. For coarse PM, there is no signficant correlation up until wind speeds of above 14 m/s, when there is a steep linear increase. This may reflect the resuspension of dust from a nearby steelworks site which is an important local source of PM10.

Analysis in Python

The FacetGrid class from the Seaborn library is useful for visualising the relationship between two variables where the relationship is conditioned by some other variable(s). It takes in a pandas DataFrame as the data source and draws multiple plots of the same relationship for different levels of a third variable. Different colours may be used to represent levels of another variable, as shown in a previous post:

To visualise the relationship between wind speed and the PM fractions, a DataFrame containing mean PM concentration values for each 1 m/s increment in wind speed can be created from the values returned by the groupby() method along with mean().

To obtain a data structure which can be plotted using FacetGrid, a column containing values for the conditioning variable(s) is needed. In the DataFrame shown above the values for the conditioning variable are contained within the column labels. It therefore needs rearranging to create a table structure where there is a column providing the corresponding site type and PM fraction for each numerical value.

This can be achieved with the melt() method from the pandas library, which converts a DataFrame from a wide format into a long format. The columns set as measured variables (value_vars), or any columns not set as identification variables (id_vars), are “unpivoted” to the row axis. This leaves just two non-identifier columns, ‘variable’ and ‘value’.

This DataFrame can now be plotted. Plot titles are set automatically but can be specified, as shown in lines 3 to 6 of the following code:

Spatial and temporal variations in ozone levels

Ozone levels are generally lower in urban than in rural areas. The difference is shown by box plots summarising ozone levels for urban and rural background site types; the red line represents mean values.

The urban decrement is due to local scavenging of ozone by nitric oxide (NO) from motor vehicle exhaust. The complex relationship between ozone and nitrogen oxides (NOx) may underlie their widely differing seasonal trends:

Ozone levels are highly variable due to a large dependence on the weather. This can be seen in the large year-to-year variability in levels recorded at a rural and an urban background site:

Data from the background site Harwell shows a significant correlation between daily maximum temperatures and maximum ozone concentrations.

Plotting diurnal variations in temperature and ozone levels shows a high degree of similarity between them:

Any causal link with temperature is likely to only be partial, since temperature also relates to other variables such as atmospheric stability and sunlight. Since ozone formation is driven by UV radiation, episodes of high ozone levels typically occur in summer.

The following scatter plot shows the number of hours with ozone concentrations above 160 μg/m3 for rural background sites across the UK, as a function of distance along a north-westerly co-ordinate. It shows a trend of decreased occurrences with an increased north-westerly distance.

Data points are grouped by altitude, which is influential. Sites with an altitude of 100 to 200 metres show a significant negative correlation (R2=0.96).

Analysis in Python

To calculate daily maximum values from hourly data points, the date column within the pandas DataFrame first needs to be converted into a datetime object and then set as the index.The pandas groupby() method is useful when analysing a pandas Series according to a certain category. In this case the Series of interest are ‘ozone’ and ‘temp’. These are grouped according to day, whose values could be specified by passing a Series as an argument to groupby(). In this case, pandas.TimeGrouper generates the values from the datetime index.

Different time periods could be specified, e.g. using ‘M’ for monthly data. Applying the max() method to the columns of interest returns Series containing daily maximum values, indexed by day.

To obtain max. ozone concentrations for each 1°C increment in daily max. temperature, a new DataFrame must first be created from these Series (after setting ‘Tmax’ values as integers). This allows the max. ozone values for each day to be grouped by daily max. temperature; using the max() method on the ‘O3max’ column returns a Series containing the max. ozone values for each daily max. temperature between 10 and 27°C. This data can then be plotted using a library such as matplotlib.

Alternatively, the agg() method could have been used, allowing a range of other statistical functions to be applied:

Urban enhancement of PM2.5 levels

Monitoring sites located in urban areas show a general increment in PM2.5 levels over rural sites. Due to local emissions, urban traffic sites record the highest values. The following box plots, with red lines as mean values, summarise hourly PM2.5 measurements between 2011 and 2015 for 34 urban background sites and 18 urban traffic sites across the UK:

Harwell, Oxfordshire, was the only rural background site within England which provided PM2.5 data over the same time period. The mean value at this site was 10.8 μg/m3. The urban background mean was 12.3 μg/m3.

Harwell is within the south-east of England, where regional background levels are higher than the rest of the UK. Although the one mean value cannot be used to represent the regional background concentration, it still indicates that regional background levels are the dominant contributor to urban background levels. The mean value of the roadside sites was 14.3 µg/m3, an increment of 2 µg/m3 over the urban background.

Data for the pairs of urban traffic and urban background sites within close proximity of each other is summarised as follows:

The site pairs showing the largest differences between them were the Glasgow Centre and Glasgow Kerbside sites. These sites are situated close to each other, but the kerbside site is on a frequently-congested road with built-up surroundings forming a street canyon, whereas the background site is within a pedestrianised area with open surroundings. The large difference between the London sites is likely due to similar reasons.

Analysis in Python

Seaborn box plots show distributions with respect to categories. To use its functions, data must be presented in either one of two forms. The first is as a list of vectors, as contained within the table structure obtained using pandas to read csv files downloaded from the DEFRA website. The pandas DataFrame obtained can be plotted using the boxplot function in Seaborn.

The other data structure that can be plotted is a 2D array, where one vector contains quantitative data and the other contains categorical data. This form was used in in the first of the above two box plots, where data from the different monitoriong sites was categorized into either urban background or urban traffic sites.

It involved reading csv files, one for each of the two different site types, and then extracting data into an N-dimensional array object in NumPy. A 1D array could then created by the numpy.ravel() function and assigned as a column in a new DataFrame. A second column containing the categorical data could similarly be created from a NumPy array, using a list comprehension to create a list of values for the type of monitoring site.

After combining the two Dataframes using pandas concat method, the data is now of a correct shape to create the box plots.

The ‘showmeans’ argument adds mean values and the ‘meanline’ argument creates a line instead of a cross. To help preserve a sensible scale and improve clarity of the plot, passing ‘showfliers=False’ can be used to remove outlying datapoints.

Particulate matter pollution episodes in spring

Weather conditions underlie the high particulate matter (PM) episodes which can occur in the early spring months in the UK. Such episodes may arise from the build up of local emissions due to poor dispersion, or due to easterly airflows which imports air pollution from Europe.

Except for bonfire night, the highest daily PM measurements at the various monitoring sites tend to occur in March and April. This is shown by the mean monthly PM2.5 levels for a rural background site, where PM levels reflect regional pollution sources:

A typical spring PM episode occurred in March 2014. The event can be visualised in a time series plot of data from sites such as Leeds, which recorded its highest daily PM2.5 levels in recent years:

Wind rose plots indicate that these high PM2.5 levels were largely a result of weather conditions involving moderate easterly winds:

This suggests that imported air pollution from continental Europe can make a significant contribution to UK air pollution levels.


Nitrogen oxide pollution episodes in winter

Levels of nitrogen oxides (NOx) vary widely across the different seasons and are highest in the winter months. Episodes of high NOx concentrations may arise during the winter when the ground is cold and winds are light, causing emissions to be trapped near the ground. This effect is shown by a 3D scatter plot:

In early-mid December 2013, many sites in London recorded some of their highest values of recent years. Time series plots of concentrations alongside wind speed and temperature also clearly shows how poor dispersion conditions can underlie high NOx pollution events in winter:

Analysis in Python

As previously described, the csv files contain strings for missing values which need to be replaced and the datatypes changed to numeric. In order to plot the time series graph, datatypes need to be changed to datetime values, by using pandas to_datetime method. This requires changing the values of 24:00 to 00:00, by using the replace method.

Looking at the rows for the midnight time points, it can be noticed that the date for this hour reads one day behind. This is a problem when plotting time series and requires changing. This is done using the timedelta method.

This date change can be applied specifically to the midnight time points by creating a subset of the data and then recombining it using the concat method. Applying the interpolation function fills in NaN entries using the linear interpolation method, which connects a straight line across the missing data points. The default method is linear but other methods can be specified.

In this example, the time series data containing the pandas datetime format is plotted using the Bokeh library:

Seasonal variations in air pollution

Air pollution levels vary across the seasons due to changing weather conditions and emission levels. The general trends can be visualised by average measurements taken at a rural background monitoring site:

Levels of particulate matter and nitrogen oxides (NOx) tend to peak in winter and early spring. The trends reflects emissions from winter heating and the relatively poor dispersion conditions.

NOx levels are an indicator of urban air quality since they closely relate to local traffic emissions. Discerning how individual variables impact air quality can be difficult, due to seasonal variability etc. Segmenting NOx data into monthly subsets helps reveal the impact that weather has:

The plots show that all occurrences of high NOx levels are in winter months and are associated with lower temperatures and wind speeds. This is likely attributable to poor dispersion conditions.

Wind Direction and Particulate Air Pollution

Particulate matter (PM) levels are strongly influenced by wind direction. A clear trend can be shown by plotting this variable against PM concentrations, averaged according to their associated wind directions:

The graph shows a marked increase in average PM levels when there is an easterly wind direction. These findings are consistent with published literature which describes how long range transport of PM2.5 from continental Europe can significantly contribute to PM levels and underlie many air pollution episodes.


Long term trends in air pollution

Time series plots can be considered as being the ‘bread and butter’ of air pollution analysis. They are useful for visualising long term trends in pollution levels, thereby enabling the evaluation of control measures and highlighting areas for improvement.

An example of this is in visualising changes in the increments in particulate matter (PM) concentrations over regional background levels at roadside and urban background sites in London:

This plot shows general declines in PM levels at Marylebone Road, where levels have long been problematic. This suggests that policies to reduce road traffic emissions at this site have had some degree of success.

Carbon monoxide (CO) is another air pollutant for which reductions in road traffic emissions have been achieved.

The UK’s national emissions inventory shows a large reduction in the contribution of road traffic to overall CO emissions, likely attributable to improvements in catalytic converters. This data is in accordance with a time series plot for Marylebone Road. The graph also shows concentrations of NOx, another air pollutant associated with road traffic:

It shows that similar reductions in NOx concentrations have not been achieved. Hourly measured NOx concentrations show the continued occurrence of excessively high levels:

Local industry and PM10 levels

PM10 emissions from certain industrial processes can lead to high pollution levels in nearby population areas. Plotting air pollutant concentrations against wind speed and direction can help attribute levels to nearby emission sources.

The town of Scunthorpe has a major steelworks located to the east. The coloured scatter plot of PM10 levels measured at Scunthorpe shows a cluster of high daily mean PM10 values with wind directions of between 90° and 135° (east to south-east), and wind speeds of 2-6 metres per second. Winds originating from the opposite direction are associated with low PM10 levels, especially at higher wind speeds.

 PM2.5 / PM10 Ratios

PM2.5/PM10 ratios vary according to local emission sources. Port Talbot, which has a major steelworks, has among the lowest ratios of all monitoring sites across the country.

The site shows a relatively weak correlation (R value of 0.62) between PM2.5 and PM10 in comparison to a typical urban background site (R value of 0.97 for Nth Kensington).

Segregating the Port Talbot data by month of year helps explain the weak correlation. The data points which have a characteristically low PM2.5/PM10 ratio reflect the influence of the steelworks on overall PM levels. However, for certain months which have above average background PM2.5 levels, domination of the background PM levels over local PM emission sources gives a large number of data points having a trendline typically of most monitoring sites.

Data points with high PM10 levels and low PM2.5/PM10 values are likely attributable to emissions by local industry, and plotting the data points with wind direction and wind speed can provide additional evidence. Such analysis could be used to monitor emissions by local industry.