As more and more common the high frequency (like daily) data becomes avaiable, statistical analysis of event(event statistics, e.g. the counts of specific extreme event) gets more and more popular than value statistics (e.g, mean and standard deviation). This blog shows how to (1) extract extreme events in time series, and (2) how to count events within a particular period.
1. Definition of events
Positive (negative) events are identified as those days at which the values are above (below) the thresholds. If the values of several consecutive days are above their corresponding shresholds, they are regarded as one single event, but with a duration of several days. Such thresholds can be quantile 90 (10) for extreme positive (negative) events, or mean plus (minus) one standard deviation for positive (negative) events. The quantile, mean and standard deviation should be calculated on each day of the year (so called climatological daily quantile, mean or std. can be easily implemented with cdo command: cdo -ydaypctl, -ydaymean, -ydaystd, python details see below). Thus, we would have 365 different thresholds. See Fig.1 for details.
Fig1.How to get threshold for an event. Here shows the procedure to get the threshold for each day of year with grand-ensemble (simulation data) data. For each day, the threshold is got from a data with size of (black bold line outlined data). For the observational data, because there is only one realisation, for each day, one can get the threshold from data of 7 days (window) ahead and after that day, and all the years, i.e., get the quantile or standard deviation from a data with size of . This can be easily implemented by DataArray.rolling function in python.
Fig2. Definition of events. Thresholds are calculated for each day of year. Here shows DJF only.
2. Extraction of events
We show here the procedure to extract positive (negative) events of NAO and EA index gotten from ERA5.
Consecutivly above or below the threshold. This is the most complicated part of the method. What we want to do here is to find the groups of data records in which their values are all above (below) the threshold. For example, we have a dataset below, we want the records inside the red square be identified as events.
Fig3. consecutive records. Source here.
Although xarray are powerful to process spatial-temporal data, we don't find a good method to extract extreme events. We solve the problem with pandas following a great tutorial here. xarray DataArray can be easily converted to pandas DataFrame by dataFrame=dataArray.to_dataframe().reset_index().
# pandas
defgetevents(df,column): ''' function to return events in series of df for one year.
**Arguments**
*df* A data frame, one of the column should be 'time'. *colum* based on values of which column to compare with the threshold.
**Return** A data frame of Events, columns is ['start_time','duration','sum','mean'] '''
# detect consecutive values > 0. details please see link above. G = df[df[column] > 0].groupby((df[column] <= 0).cumsum()) events = list()
# if no events identified if G.ngroups==0: Events = pd.DataFrame(columns= ['start_time','duration','sum','mean']) events.append(Events)
Our example time series is complicated in two aspects: data series are DJF only, which means data are not temporally continuous, and data in Dec of last year, and Jan and Feb in this year should be seen together as a independent time period, so we can not use groupby(year) directly, solution can be found below; There are two series in the data, which can be easily popularized to spatio-temporal data which contains hundreds of time series.
Extract events by year As stated before, our data is only for DJF, so we extract events sepertely for each year. In order to extract data properly, we offset the time series one month further with the function pd.DateOffset(months=1).
# xarray
deftimeprocess(ts): ts_time = ts.time
# roll forward 1 month to make the data into the same year ts_forward_time = pd.to_datetime(ts_time.values)+pd.DateOffset(months = 1) # +1 month ts_forward = ts.copy() ts_forward['time'] = ts_forward_time
# exclude the first and the last year where only two months of day avaiable. ts_out = ts_forward.sel(time = slice('1979','2020')) return ts_out
Multi time series Because the return from function getevents() is a DataFrame, we can not use the applyfunction of pandas directly, we just use the for loop and pd.concat(). So the function can be easily adapted to spatial data.
# xarray and pandas defevent_extract(ts,extype):
# *extype:posEx;negEx # extreme value extraction
# get threshold of one specific data from multi-years and 7 days before and after that day. ts_roll = ts.rolling(time = 15,center = True).construct(window_dim = 'win')