Calculating frame and series statistics
The Stats
type contains functions for fast calculation of statistics over
series and frames as well as over a moving and an expanding window in a series.
The standard statistical functions that are available in the Stats
type
are overloaded and can be applied to both data frames and series. More advanced
functionality is available only for series (but can be applied to frame columns
easily using the Frame.getNumericCols
function.
Series and frame statistics
In this section, we look at calculating simple statistics over data frame and
series. An important aspect is handling of missing values, so we demonstrate that
using a data set about air quality that contains missing values. The following
snippet loads AirQuality.csv
and shows the values in the Ozone
column:
1: 2: |
|
Keys |
0 |
1 |
2 |
3 |
4 |
... |
150 |
151 |
152 |
---|---|---|---|---|---|---|---|---|---|
Values |
N/A |
36 |
12 |
18 |
N/A |
... |
14 |
18 |
20 |
Series statistics
Given a series ozone
, we can use a number of Stats
functions to calculate
statistics. The following example creates a series (indexed by strings) that
stores mean extremes and median of the input series:
1: 2: 3: 4: 5: |
|
Keys |
Mean |
Max |
Min |
Median |
---|---|---|---|---|
Values |
42 |
168 |
1 |
31 |
To make the output simpler, we round the value of the mean (although the result is a floating point number). Note that the value is calculated from the available values in the series. All of the statistical functions skip over missing values in the input series.
As the above example demonstrates, Stats.max
and Stats.min
return option<float>
rather than just float
. The result value is None
when the series contains no values.
This makes it possible to use the functions not just on floating point numbers, but
also on series of integers and other types. Other statistical functions such as
Stats.mean
return nan
when no values are available.
Frame statistics
Functions such as Stats.mean
can be called on series, but also on entire data frames.
In that case, they calculate the statistics for each column of a data frame and return
Series<'C, float>
where 'C
is the column key of the original frame.
In the following snippet, we calculate means and standard deviations of all columns of
the air
data set and build a frame that shows the values (series) in two columns:
1: 2: 3: 4: 5: |
|
Min |
Max |
Mean |
+/- |
|
---|---|---|---|---|
Ozone |
1 |
168 |
42.14 |
33.13 |
Solar.R |
7 |
334 |
185.93 |
90.06 |
Wind |
1.7 |
20.7 |
9.96 |
3.52 |
Temp |
56 |
97 |
77.88 |
9.47 |
Month |
5 |
9 |
6.99 |
1.42 |
Day |
1 |
31 |
15.8 |
8.86 |
Missing values are handled in the same way as when calculating statistics of a series and are skipped. If this is not desirable, you can use functions from the Series module for working with missing values to treat missing values in different ways.
The Stats
module provides basic statistical functionality such as mean, standard
deviation and variance, but also more advanced functions including skewness and kurtosis.
You can find a complete list in the Series statistics
and Frame statistics sections of the API reference.
Moving window statistics
The Stats
type provides an efficient implementation of moving window statistics. The
implementation uses an online algorithm so that it does not have to re-calculate the
statistics for each window separately, but instead updates the value as it iterates over
the input (and so this is faster than using Series.window
).
The moving window function names are pre-fixed with the word moving
and calculate moving
statistics over a window of a fixed length. The following example calculates means over a
moving window of length 3:
1: 2: |
|
Keys |
0 |
1 |
2 |
3 |
4 |
... |
150 |
151 |
152 |
---|---|---|---|---|---|---|---|---|---|
Values |
N/A |
N/A |
24 |
22 |
15 |
... |
22 |
16 |
17.3333 |
The keys of the resulting series are the same as the keys of the input series. Statistical moving functions (count, sum, mean, variance, standard deviation, skewness and kurtosis) over a window of size n always mark the first n-1 values with missing (i.e. they only perform the calculation over complete windows). This explains why the value associated with the key 1 is N/A. For the key 2, the mean is calculated from all available values in the window, which is: (36+12)/2.
The boundary behavior of the functions that calculate minimum and maximum over a moving window differs. Rather than returning N/A for the first n-1 values, they return the extreme value over a smaller window:
1: 2: |
|
Keys |
0 |
1 |
2 |
3 |
4 |
... |
150 |
151 |
152 |
---|---|---|---|---|---|---|---|---|---|
Values |
N/A |
36 |
12 |
12 |
12 |
... |
14 |
14 |
14 |
Here, the first value is missing, because the one-element window containing just the first value contains only missing values. However, the value for the key 1, because the two-element window (starting from the beginning of the series) contains two elements.
Remarks
The windowing functions in the Stats
type support an efficient calculations over a fixed-size
windows specified by the size of the window. They also provide one, fixed, boundary behavior.
If you need more complex windowing behavior (such as window based on the distance between keys),
different handling of boundaries, or chunking (calculation over adjacent chunks), you can use
chunking and windowing functions from the Series
module such as Series.windowSizeInto
or
Series.chunkSizeInto
. For more information, see Grouping, windowing and
chunking section in the API reference.
Expanding windows
Expanding window means that the window starts as a single-element sized window at the beginning
of a series and expands as it moves over the series. For a time-series data ordered by time,
this gives you statistics calculated over all previous known observations.
In other words, the statistics is calculated for all values up to the current key and the
result is attached to the key at the end of the window. The expanding window functions are
prefixed with expanding
.
The following example demonstrates how to calculate expanding mean and expanding standard deviation over the Ozone series. The resulting series has the same keys as the input series. Here, we align the two series using a frame, so that we can easily see the results aligned:
1: 2: 3: 4: |
|
Ozone |
Mean |
+/- |
|
---|---|---|---|
0 |
N/A |
N/A |
N/A |
1 |
36 |
36 |
N/A |
2 |
12 |
24 |
16.97 |
3 |
18 |
22 |
12.49 |
4 |
N/A |
22 |
12.49 |
5 |
28 |
23.5 |
10.63 |
6 |
23 |
23.4 |
9.21 |
7 |
19 |
22.67 |
8.43 |
... |
... |
... |
... |
149 |
N/A |
42.8 |
33.32 |
150 |
14 |
42.55 |
33.28 |
151 |
18 |
42.33 |
33.21 |
152 |
20 |
42.14 |
33.13 |
As the example illustrates, expanding window statistics typically returns a series that starts
with some missing values. Here, the first mean is missing (because one-element window contains
no values) and the first two standard deviations are missing (stdDev
is define only for two
and more values). The only exception is expandingSum
, because the sum of no elements is zero.
Multi-level indexed statistics
For a series with multi-level (hierarchical) index, the functions prefixed with level
provide
a way to apply statistical operation on a single level of the index. Series with multi-level
index can be created directly by using a tuple (such as 'K1 * 'K2
) as the key, or they can
be produced by a grouping operation such as Frame.groupRowsBy
.
For example, you can create two-level index that represents time-series data with month as the first part of the key and day as the second part of the key. Then you can use multi-level statistical functions to calculate means (and other statistics) for each month separately.
The following example demonstrates the idea - the air
data set contains data for each
day between May and September. We can create a frame with two-level row key using
Frame.indexRowsUsing
and returning a tuple as the index:
1: 2: 3: |
|
The type of the byMonth
value is Frame<string * int, string>
meaning that the row index
has two levels. To make the output a little nicer, we use the GetMonthName
function to
turn the first level of the index into a string representing the month name.
We can now access individual columns and calculate statistics over the
first level (individual months) using functions prefixed with level
:
1: 2: |
|
Keys |
May |
June |
July |
August |
September |
---|---|---|---|---|---|
Values |
22.92 |
29.4444 |
59.1154 |
59.9615 |
31.4483 |
Currently, the Stats
type does not include a function that would let you apply multi-level
statistical functions on entire data frames, but this can easily be implemented using the
Frame.getNumericalCols
function and Series.mapValues
:
1: 2: 3: 4: 5: |
|
May |
June |
July |
August |
September |
|
---|---|---|---|---|---|
Ozone |
22.92 |
29.4444 |
59.1154 |
59.9615 |
31.4483 |
Solar.R |
181.2963 |
190.1667 |
216.4839 |
171.8571 |
167.4333 |
Wind |
11.6226 |
10.2667 |
8.9419 |
8.7935 |
10.18 |
Temp |
65.5484 |
79.1 |
83.9032 |
83.9677 |
76.9 |
If we used Frame.getNumericCols
directly, we would also calculate the mean of "Day" and
"Month" columns, which does not make much sense in this example. For that reason, the snippet
first calls sliceCols
to get only relevant columns.
namespace System.Data
--------------------
namespace Microsoft.FSharp.Data
Full name: Stats.root
Full name: Stats.air
module Frame
from Deedle
--------------------
type Frame =
static member CreateEmpty : unit -> Frame<'R,'C> (requires equality and equality)
static member FromArray2D : array:'T [,] -> Frame<int,int>
static member FromColumns : cols:Series<'TColKey,Series<'TRowKey,'V>> -> Frame<'TRowKey,'TColKey> (requires equality and equality)
static member FromColumns : cols:Series<'TColKey,ObjectSeries<'TRowKey>> -> Frame<'TRowKey,'TColKey> (requires equality and equality)
static member FromColumns : columns:seq<KeyValuePair<'ColKey,ObjectSeries<'RowKey>>> -> Frame<'RowKey,'ColKey> (requires equality and equality)
static member FromColumns : columns:seq<KeyValuePair<'ColKey,Series<'RowKey,'V>>> -> Frame<'RowKey,'ColKey> (requires equality and equality)
static member FromColumns : cols:seq<Series<'ColKey,'V>> -> Frame<'ColKey,int> (requires equality)
static member FromRecords : values:seq<'T> -> Frame<int,string>
static member FromRecords : series:Series<'K,'R> -> Frame<'K,string> (requires equality)
static member FromRowKeys : keys:seq<'K> -> Frame<'K,string> (requires equality)
...
Full name: Deedle.Frame
--------------------
type Frame<'TRowKey,'TColumnKey (requires equality and equality)> =
interface IDynamicMetaObjectProvider
interface INotifyCollectionChanged
interface IFsiFormattable
interface IFrame
new : names:seq<'TColumnKey> * columns:seq<ISeries<'TRowKey>> -> Frame<'TRowKey,'TColumnKey>
new : rowIndex:IIndex<'TRowKey> * columnIndex:IIndex<'TColumnKey> * data:IVector<IVector> * indexBuilder:IIndexBuilder * vectorBuilder:IVectorBuilder -> Frame<'TRowKey,'TColumnKey>
member AddColumn : column:'TColumnKey * series:ISeries<'TRowKey> -> unit
member AddColumn : column:'TColumnKey * series:seq<'V> -> unit
member AddColumn : column:'TColumnKey * series:ISeries<'TRowKey> * lookup:Lookup -> unit
member AddColumn : column:'TColumnKey * series:seq<'V> * lookup:Lookup -> unit
...
Full name: Deedle.Frame<_,_>
--------------------
new : names:seq<'TColumnKey> * columns:seq<ISeries<'TRowKey>> -> Frame<'TRowKey,'TColumnKey>
new : rowIndex:Indices.IIndex<'TRowKey> * columnIndex:Indices.IIndex<'TColumnKey> * data:IVector<IVector> * indexBuilder:Indices.IIndexBuilder * vectorBuilder:Vectors.IVectorBuilder -> Frame<'TRowKey,'TColumnKey>
static member Frame.ReadCsv : stream:Stream * ?hasHeaders:bool * ?inferTypes:bool * ?inferRows:int * ?schema:string * ?separators:string * ?culture:string * ?maxRows:int * ?missingValues:string [] -> Frame<int,string>
static member Frame.ReadCsv : reader:TextReader * ?hasHeaders:bool * ?inferTypes:bool * ?inferRows:int * ?schema:string * ?separators:string * ?culture:string * ?maxRows:int * ?missingValues:string [] -> Frame<int,string>
static member Frame.ReadCsv : path:string * indexCol:string * ?hasHeaders:bool * ?inferTypes:bool * ?inferRows:int * ?schema:string * ?separators:string * ?culture:string * ?maxRows:int * ?missingValues:string [] -> Frame<'R,string> (requires equality)
Full name: Stats.ozone
Full name: Deedle.F# Series extensions.series
Full name: Microsoft.FSharp.Core.Operators.round
static member count : frame:Frame<'R,'C> -> Series<'C,int> (requires equality and equality)
static member count : series:Series<'K,'V> -> int (requires equality)
static member expandingCount : series:Series<'K,float> -> Series<'K,float> (requires equality)
static member expandingKurt : series:Series<'K,float> -> Series<'K,float> (requires equality)
static member expandingMax : series:Series<'K,float> -> Series<'K,float> (requires equality)
static member expandingMean : series:Series<'K,float> -> Series<'K,float> (requires equality)
static member expandingMin : series:Series<'K,float> -> Series<'K,float> (requires equality)
static member expandingSkew : series:Series<'K,float> -> Series<'K,float> (requires equality)
static member expandingStdDev : series:Series<'K,float> -> Series<'K,float> (requires equality)
static member expandingSum : series:Series<'K,float> -> Series<'K,float> (requires equality)
...
Full name: Deedle.Stats
static member Stats.mean : series:Series<'K,float> -> float (requires equality)
from Microsoft.FSharp.Core
Full name: Microsoft.FSharp.Core.Option.get
static member Stats.max : series:Series<'K,'V> -> 'V option (requires equality and comparison)
static member Stats.min : series:Series<'K,'V> -> 'V option (requires equality and comparison)
static member Stats.median : series:Series<'K,float> -> float (requires equality)
Full name: Stats.info
static member Stats.stdDev : series:Series<'K,float> -> float (requires equality)
Full name: Deedle.F# Frame extensions.frame
Full name: Stats.exp
Full name: Stats.dateFormat
type CultureInfo =
new : name:string -> CultureInfo + 3 overloads
member Calendar : Calendar
member ClearCachedData : unit -> unit
member Clone : unit -> obj
member CompareInfo : CompareInfo
member CultureTypes : CultureTypes
member DateTimeFormat : DateTimeFormatInfo with get, set
member DisplayName : string
member EnglishName : string
member Equals : value:obj -> bool
...
Full name: System.Globalization.CultureInfo
--------------------
CultureInfo(name: string) : unit
CultureInfo(culture: int) : unit
CultureInfo(name: string, useUserOverride: bool) : unit
CultureInfo(culture: int, useUserOverride: bool) : unit
Full name: Stats.byMonth
Full name: Deedle.Frame.indexRowsUsing
member ObjectSeries.GetAs : column:'K * fallback:'R -> 'R
val int : value:'T -> int (requires member op_Explicit)
Full name: Microsoft.FSharp.Core.Operators.int
--------------------
type int = int32
Full name: Microsoft.FSharp.Core.int
--------------------
type int<'Measure> = int
Full name: Microsoft.FSharp.Core.int<_>
Full name: Microsoft.FSharp.Core.Operators.fst
Full name: Deedle.Frame.sliceCols
Full name: Deedle.Frame.getNumericCols
module Series
from Deedle
--------------------
type Series =
static member ofNullables : values:seq<Nullable<'a0>> -> Series<int,'a0> (requires default constructor and value type and 'a0 :> ValueType)
static member ofObservations : observations:seq<'a0 * 'a1> -> Series<'a0,'a1> (requires equality)
static member ofOptionalObservations : observations:seq<'K * 'a1 option> -> Series<'K,'a1> (requires equality)
static member ofValues : values:seq<'a0> -> Series<int,'a0>
Full name: Deedle.F# Series extensions.Series
--------------------
type Series<'K,'V (requires equality)> =
interface IFsiFormattable
interface ISeries<'K>
new : pairs:seq<KeyValuePair<'K,'V>> -> Series<'K,'V>
new : keys:'K [] * values:'V [] -> Series<'K,'V>
new : keys:seq<'K> * values:seq<'V> -> Series<'K,'V>
new : index:IIndex<'K> * vector:IVector<'V> * vectorBuilder:IVectorBuilder * indexBuilder:IIndexBuilder -> Series<'K,'V>
member After : lowerExclusive:'K -> Series<'K,'V>
member Aggregate : aggregation:Aggregation<'K> * observationSelector:Func<DataSegment<Series<'K,'V>>,KeyValuePair<'TNewKey,OptionalValue<'R>>> -> Series<'TNewKey,'R> (requires equality)
member Aggregate : aggregation:Aggregation<'K> * keySelector:Func<DataSegment<Series<'K,'V>>,'TNewKey> * valueSelector:Func<DataSegment<Series<'K,'V>>,OptionalValue<'R>> -> Series<'TNewKey,'R> (requires equality)
member AsyncMaterialize : unit -> Async<Series<'K,'V>>
...
Full name: Deedle.Series<_,_>
--------------------
new : pairs:seq<Collections.Generic.KeyValuePair<'K,'V>> -> Series<'K,'V>
new : keys:seq<'K> * values:seq<'V> -> Series<'K,'V>
new : keys:'K [] * values:'V [] -> Series<'K,'V>
new : index:Indices.IIndex<'K> * vector:IVector<'V> * vectorBuilder:Vectors.IVectorBuilder * indexBuilder:Indices.IIndexBuilder -> Series<'K,'V>
Full name: Deedle.Series.mapValues
static member Frame.ofRows : rows:Series<'R,#ISeries<'C>> -> Frame<'R,'C> (requires equality and equality)