Data frame features
In this section, we look at various features of the F# data frame library (using both
Series
and Frame
types and modules). Feel free to jump to the section you are interested
in, but note that some sections refer back to values built in "Creating & loading".
You can also get this page as an F# script file from GitHub and run the samples interactively.
Creating frames & loading data
Loading and saving CSV files
The easiest way to get data into data frame is to use a CSV file. The Frame.ReadCsv
function exposes this functionality:
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: |
// Assuming 'root' is a directory containing the file let titanic = Frame.ReadCsv(root + "Titanic.csv") // Read data and set the index column & order rows let msft = Frame.ReadCsv(root + "msft.csv") |> Frame.indexRowsDate "Date" |> Frame.orderRows // Specify column separator let air = Frame.ReadCsv(root + "AirQuality.csv", separators=";") |
The ReadCsv
method has a number of optional arguments that you can use to control
the loading. It supports both CSV files, TSV files and other formats. If the file name
ends with tsv
, the Tab is used automatically, but you can set separator
explicitly.
The following parameters can be used:
path
- Specifies a file name or an web location of the resource.inferTypes
- Specifies whether the method should attempt to infer types of columns automatically (set this tofalse
if you want to specify schema)inferRows
- IfinferTypes=true
, this parameter specifies the number of rows to use for type inference. The default value is 0, meaninig all rows.schema
- A string that specifies CSV schema. See the documentation for information about the schema format.separators
- A string that specifies one or more (single character) separators that are used to separate columns in the CSV file. Use for example";"
to parse semicolon separated files.culture
- Specifies the name of the culture that is used when parsing values in the CSV file (such as"en-US"
). The default is invariant culture.
The parameters are the same as those used by the CSV type provider in F# Data, so you can find additional documentation there.
Once you have a data frame, you can also save it to a CSV file using the
SaveCsv
method. For example:
1: 2: 3: 4: |
// Save CSV with semicolon separator air.SaveCsv(Path.GetTempFileName(), separator=';') // Save as CSV and include row key as "Date" column msft.SaveCsv(Path.GetTempFileName(), keyNames=["Date"], separator='\t') |
By default, the SaveCsv
method does not include the key from the data frame. This can be
overriden by calling SaveCsv
with the optional argument includeRowKeys=true
, or with an
additional argument keyNames
(demonstrated above) which sets the headers for the key columns(s)
in the CSV file. Usually, there is just a single row key, but there may be multiple when
hierarchical indexing is used.
Loading F# records or .NET objects
If you have another .NET or F# components that returns data as a sequence of F# records,
C# anonymous types or other .NET objects, you can use Frame.ofRecords
to turn them
into a data frame. Assume we have:
1: 2: 3: 4: 5: 6: 7: 8: |
type Person = { Name:string; Age:int; Countries:string list; } let peopleRecds = [ { Name = "Joe"; Age = 51; Countries = [ "UK"; "US"; "UK"] } { Name = "Tomas"; Age = 28; Countries = [ "CZ"; "UK"; "US"; "CZ" ] } { Name = "Eve"; Age = 2; Countries = [ "FR" ] } { Name = "Suzanne"; Age = 15; Countries = [ "US" ] } ] |
Now we can easily create a data frame that contains three columns
(Name
, Age
and Countries
) containing data of the same type as
the properties of Person
:
1: 2: 3: 4: |
// Turn the list of records into data frame let peopleList = Frame.ofRecords peopleRecds // Use the 'Name' column as a key (of type string) let people = peopleList |> Frame.indexRowsString "Name" |
Note that this does not perform any conversion on the column data. Numerical series
can be accessed using the ?
operator. For other types, we need to explicitly call
GetSeries
with the right type arguments:
1: 2: |
people?Age people.GetSeries<string list>("Countries") |
F# Data providers
In general, you can use any data source that exposes data as series of tuples. This means that we can easily load data using, for example, the World Bank type provider from F# Data library.
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: |
// Connect to the World Bank let wb = WorldBankData.GetDataContext() /// Given a region, load GDP in current US$ and return data as /// a frame with two-level column key (region and country name) let loadRegion (region:WorldBankData.ServiceTypes.Region) = [ for country in region.Countries -> // Create two-level column key using tuple (region.Name, country.Name) => // Create series from tuples returned by WorldBank Series.ofObservations country.Indicators.``GDP (current US$)`` ] |> frame |
To make data manipulation more convenient, we read country information per region and create data frame with a hierarchical index (for more information, see the advanced indexing section). Now we can easily read data for OECD and Euro area:
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: |
// Load Euro and OECD regions let eu = loadRegion wb.Regions.``Euro area`` let oecd = loadRegion wb.Regions.``OECD members`` // Join and convert to in billions of USD let world = eu.Join(oecd) / 1e9 val world : Frame<int,(string * string)> = Euro area OECD members Austria Estonia (...) Canada Chile (...) 1960 -> 6.592 <missing> 41.093 4.2117 1961 -> 7.311 <missing> 40.767 4.7053 : ... 2011 -> 417.6 22.15 1777.7 250.99 2012 -> 399.6 21.85 1821.4 268.18 |
The loaded data look something like the sample above. As you can see, the columns are grouped by the region and some data are not available.
Expanding objects in columns
It is possible to create data frames that contain other .NET objects as members in a series. This might be useful, for example, when you get multiple data sources producing objects and you want to align or join them before working with them. However, working with frames that contain complex .NET objects is less conveninet.
For this reason, the data frame supports expansion. Given a data frame with some object
in a column, you can use Frame.expandCols
to create a new frame that contains properties
of the object as new columns. For example:
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: |
// Create frame with single column 'People' let peopleNested = [ "People" => Series.ofValues peopleRecds ] |> frame // Expand the 'People' column peopleNested |> Frame.expandCols ["People"] val it : Frame<int,string> = People.Name People.Age People.Countries 0 -> Joe 51 [UK; US; UK] 1 -> Tomas 28 [CZ; UK; US; ... ] 2 -> Eve 2 [FR] 3 -> Suzanne 15 [US] |
As you can see, the operation generates columns based on the properties of the original column type and generates new names by prefixing the property names with the name of the original column.
Aside from properties of .NET objects, the expansion can also handle values of type
IDictionary<K, V>
and series that contain nested series with string
keys
(i.e. Series<string, T>
). If you have more complex structure, you can use
Frame.expandAllCols
to expand columns to a specified level recursively:
1: 2: 3: 4: 5: 6: 7: 8: 9: |
// Series that contains dictionaries, containing tuples let tuples = [ dict ["A", box 1; "C", box (2, 3)] dict ["B", box 1; "C", box (3, 4)] ] |> Series.ofValues // Expand dictionary keys (level 1) and tuple items (level 2) frame ["Tuples" => tuples] |> Frame.expandAllCols 2 |
Here, the resulting data frame will have 4 columns including
Tuples.A
and Tuples.B
(for the first keys) and Tuples.C.Item1
together with Tuples.C.Item2
representing the two items of the tuple
nested in a dictionary.
Manipulating data frames
The series type Series<K, V>
represents a series with keys of type K
and values
of type V
. This means that when working with series, the type of values is known
statically. When working with data frames, this is not the case - a frame is represented
as Frame<R, C>
where R
and C
are the types of row and column indices, respectively
(typically, R
will be an int
or DateTime
and C
will be string
representing
different column/series names.
A frame can contain heterogeneous data. One column may contain integers, another may contain floating point values and yet another can contain strings, dates or other objects like lists of strings. This information is not captured statically - and so when working with frames, you may need to specify the type explicitly, for example, when reading a series from a frame.
Getting data from a frame
We'll use the data frame people
which contains three columns - Name
of type string
,
Age
of type int
and Countries
of type string list
(we created it from F# records
in the previous section):
1: 2: 3: 4: 5: |
Name Age Countries Joe -> Joe 51 [UK; US; UK] Tomas -> Tomas 28 [CZ; UK; US; ... ] Eve -> Eve 2 [FR] Suzanne -> Suzanne 15 [US] |
To get a column (series) from a frame df
, you can use operations that are exposed directly
by the data frame, or you can use df.Columns
which returns all columns of the frame as a
series of series.
1: 2: 3: 4: 5: 6: 7: |
// Get the 'Age' column as a series of 'float' values // (the '?' operator converts values automatically) people?Age // Get the 'Name' column as a series of 'string' values people.GetSeries<string>("Name") // Get all frame columns as a series of series people.Columns |
A series s
of type Series<string, V>
supports the question mark operator s?Foo
to get
a value of type V
associated with the key Foo
. For other key types, you can sue the Get
method. Note that, unlike with frames, there is no implicit conversion:
1: 2: 3: 4: 5: 6: 7: 8: 9: |
// Get Series<string, float> let numAges = people?Age // Get value using question mark numAges?Tomas // Get value using 'Get' method numAges.Get("Tomas") // Returns missing when key is not found numAges.TryGet("Fridrich") |
The question mark operator and Get
method can be used on the Columns
property of data frame.
The return type of df?Columns
is ColumnSeries<string, string>
which is just a thin wrapper
over Series<C, ObjectSeries<R>>
. This means that you get back a series indexed by column names
where the values are ObjectSeries<R>
representing individual columns. The type
ObjectSeries<R>
is a thin wrapper over Series<R, obj>
which adds several functions
for getting the values as values of specified type.
In our case, the returned values are individual columns represented as ObjectSeries<string>
:
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: |
// Get column as an object series people.Columns?Age people.Columns?Countries val it : ObjectSeries<string> = Joe -> [UK; US; UK] Tomas -> [CZ; UK; US; ... ] Eve -> [FR] Suzanne -> [US] // Get column & try get column using members people.Columns.Get("Name") people.Columns.TryGet("CreditCard") // Get column at a specified offset people.Columns.GetAt(0) // Get column as object series and convert it // to a typed Series<string, string> people.Columns?Name.As<string>() // Try converting column to Series<string, int> people.Columns?Name.TryAs<int>() |
The type ObjectSeries<string>
has a few methods in addition to ordinary Series<K, V>
type.
On the lines 18 and 20, we use As<T>
and TryAs<T>
that can be used to convert object series
to a series with statically known type of values. The expression on line 18 is equivalent to
people.GetSeries<string>("Name")
, but it is not specific to frame columns - you can use the
same approach to work with frame rows (using people.Rows
) if your data set has rows of
homogeneous types.
Another case where you'll need to work with ObjectSeries<T>
is when mapping over rows:
1: 2: 3: |
// Iterate over rows and get the length of country list people.Rows |> Series.mapValues (fun row -> row.GetAs<string list>("Countries").Length) |
The rows that you get as a result of people.Rows
are heterogeneous (they contain values
of different types), so we cannot use row.As<T>()
to convert all values of the series
to some type. Instead, we use GetAs<T>(...)
which is similar to Get(...)
but converts
the value to a given type. You could also achieve the same thing by writing row?Countries
and then casting the result to string list
, but the GetAs
method provides a more convenient
syntax.
Adding rows and columns
The series type is immutable and so it is not possible to add new values to a series or
change the values stored in an existing series. However, you can use operations that return
a new series as the result such as Append
.
1: 2: 3: 4: |
// Create series with more value let more = series [ "John" => 48.0 ] // Create a new, concatenated series people?Age.Append(more) |
Data frame allows a very limited form of mutation. It is possible to add new series (as a column) to an existing data frame, drop a series or replace a series. However, individual series are still immutable.
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: |
// Calculate age + 1 for all people let add1 = people?Age |> Series.mapValues ((+) 1.0) // Add as a new series to the frame people?AgePlusOne <- add1 // Add new series from a list of values people?Siblings <- [0; 2; 1; 3] // Replace existing series with new values // (Equivalent to people?Siblings <- ...) people.ReplaceSeries("Siblings", [3; 2; 1; 0]) |
Finally, it is also possible to append one data frame or a single row to an existing data
frame. The operation is immutable, so the result is a new data frame with the added
rows. To create a new row for the data frame, we can use standard ways of constructing
series from key-value pairs, or we can use the SeriesBuilder
type:
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: |
// Create new object series with values for required columns let newRow = [ "Name" => box "Jim"; "Age" => box 51; "Countries" => box ["US"]; "Siblings" => box 5 ] |> series // Create a new data frame, containing the new series people.Append("Jim", newRow) // Another option is to use mutable SeriesBuilder let otherRow = SeriesBuilder<string>() otherRow?Name <- "Jim" otherRow?Age <- 51 otherRow?Countries <- ["US"] otherRow?Siblings <- 5 // The Series property returns the built series people.Append("Jim", otherRow.Series) |
Advanced slicing and lookup
Given a series, we have a number of options for getting one or more values or observations (keys and an associated values) from the series. First, let's look at different lookup operations that are available on any (even unordered series).
1: 2: 3: 4: 5: 6: 7: 8: 9: |
// Sample series with different keys & values let nums = series [ 1 => 10.0; 2 => 20.0 ] let strs = series [ "en" => "Hi"; "cz" => "Ahoj" ] // Lookup values using keys nums.[1] strs.["en"] // Supported when key is string strs?en |
For more examples, we use the Age
column from earlier data set as example:
1: 2: 3: 4: 5: 6: 7: |
// Get an unordered sample series let ages = people?Age // Returns value for a given key ages.["Tomas"] // Returns series with two keys from the source ages.[ ["Tomas"; "Joe"] ] |
The Series
module provides another set of useful functions (many of those
are also available as members, for example via ages.TryGet
):
1: 2: 3: 4: 5: 6: 7: |
// Fails when key is not present ages |> Series.get "John" // Returns 'None' when key is not present ages |> Series.tryGet "John" // Returns series with missing value for 'John' // (equivalent to 'ages.[ ["Tomas"; "John"] ]') ages |> Series.getAll [ "Tomas"; "John" ] |
We can also obtain all data from the series. The data frame library uses the term observations for all key-value pairs
1: 2: 3: 4: 5: 6: |
// Get all observations as a sequence of 'KeyValuePair' ages.Observations // Get all observations as a sequence of tuples ages |> Series.observations // Get all observations, with 'None' for missing values ages |> Series.observationsAll |
The previous examples were always looking for an exact key. If we have an ordered series, we can search for a nearest available key and we can also perform slicing. We use MSFT stock prices from earlier example:
1: 2: 3: 4: 5: 6: 7: 8: 9: |
// Get series with opening prices let opens = msft?Open // Fails. The key is not available in the series opens.[DateTime(2013, 1, 1)] // Works. Find value for the nearest greater key opens.Get(DateTime(2013, 1, 1), Lookup.NearestGreater) // Works. Find value for the nearest smaler key opens.Get(DateTime(2013, 1, 1), Lookup.NearestSmaller) |
When using instance members, we can use Get
which has an overload taking
Lookup
. The same functionality is exposed using Series.lookup
. We can
also obtain values for a sequence of keys:
1: 2: 3: 4: 5: 6: |
// Find value for the nearest greater key opens |> Series.lookup (DateTime(2013, 1, 1)) Lookup.NearestGreater // Get first price for each month in 2012 let dates = [ for m in 1 .. 12 -> DateTime(2012, m, 1) ] opens |> Series.lookupAll dates Lookup.NearestGreater |
With ordered series, we can use slicing to get a sub-range of a series:
1: 2: 3: 4: 5: 6: 7: |
opens.[DateTime(2013, 1, 1) .. DateTime(2013, 1, 31)] val it : Series<DateTime,float> = 1/2/2013 -> 27.25 1/3/2013 -> 27.63 ... 1/30/2013 -> 28.01 1/31/2013 -> 27.79 |
The slicing works even if the keys are not available in the series. The lookup automatically uses nearest greater lower bound and nearest smaller upper bound (here, we have no value for January 1).
Several other options - discussed in a later section - are available when using hierarchical (or multi-level) indices. But first, we need to look at grouping.
Grouping data
Grouping of data can be performed on both unordered and ordered series and frames. For ordered series, more options (such as floating window or grouping of consecutive elements) are available - these can be found in the time series tutorial. There are essentially two options:
You can group series of any values and get a series of series (representing individual groups). The result can easily be turned into a data frame using
Frame.ofColumns
orFrame.ofRows
, but this is not done automatically.You can group a frame rows using values in a specified column, or using a function. The result is a frame with multi-level (hierarchical) index. Hierarchical indexing is discussed later.
Keep in mind that you can easily get a series of rows or a series of columns from a frame
using df.Rows
and df.Columns
, so the first option is also useful on data frames.
Grouping series
In the following sample, we use the data frame people
loaded from F# records in
an earlier section. Let's first get the data:
1: 2: 3: 4: 5: 6: |
let travels = people.GetSeries<string list>("Countries") val travels : Series<string,string list> = Joe -> [UK; US; UK] Tomas -> [CZ; UK; US; ... ] Eve -> [FR] Suzanne -> [US] |
Now we can group the elements using both key (e.g. length of a name) and using the value (e.g. the number of visited countries):
1: 2: 3: 4: 5: 6: 7: 8: 9: |
// Group by name length (ignoring visited countries) travels |> Series.groupBy (fun k v -> k.Length) // Group by visited countries (people visited/not visited US) travels |> Series.groupBy (fun k v -> List.exists ((=) "US") v) // Group by name length and get number of values in each group travels |> Series.groupInto (fun k v -> k.Length) (fun len people -> Series.countKeys people) |
The groupBy
function returns a series of series (series with new keys, containing
series with all values for a given new key). You can than transform the values using
Series.mapValues
. However, if you want to avoid allocating all intermediate series,
you can also use Series.groupInto
which takes projection function as a second argument.
In the above examples, we count the number of keys in each group.
As a final example, let's say that we want to build a data frame that contains individual people (as rows), all countries that appear in someone's travel list (as columns). The frame contains the number of visits to each country by each person:
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: |
travels |> Series.mapValues (Seq.countBy id >> series) |> Frame.ofRows |> Frame.fillMissingWith 0 val it : Frame<string,string> = CZ FR UK US Joe -> 0 0 2 1 Tomas -> 2 0 1 1 Eve -> 0 1 0 0 Suzanne -> 0 0 0 1 |
The problem can be solved just using Series.mapValues
, together with standard F#
Seq
functions. We iterate over all rows (people and their countries). For each
country list, we generate a series that contains individual countries and the count
of visits (this is done by composing Seq.countBy
and a function series
to build
a series of observations). Then we turn the result to a data frame and fill missing
values with the constant zero (see a section about handling missing values).
Grouping data frames
So far, we worked with series and series of series (which can be turned into data frames
using Frame.ofRows
and Frame.ofColumns
). Next, we look at working with data frames.
Assume we loaded Titanic data set that is also used on the project home page. First, let's look at basic grouping (also used in the home page demo):
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: |
// Group using column 'Sex' of type 'string' titanic |> Frame.groupRowsByString "Sex" // Grouping using column converted to 'decimal' let byDecimal : Frame<decimal * _, _> = titanic |> Frame.groupRowsBy "Fare" // This is easier using member syntax titanic.GroupRowsBy<decimal>("Fare") // Group using calculated value - length of name titanic |> Frame.groupRowsUsing (fun k row -> row.GetAs<string>("Name").Length) |
When working with frames, you can group data using both rows and columns. For most
functions there is groupRows
and groupCols
equivalent.
The easiest functions to use are Frame.groupRowsByXyz
where Xyz
specifies the
type of the column that we're using for grouping. For example, we can easily group
rows using the "Sex" column.
When using less common type, you need to specify the type of the column. You can
see this on lines 5 and 9 where we use decimal
as the key. Finally, you can also
specify key selector as a function. The function gets the original key and the row
as a value of ObjectSeries<K>
. The type has various members for getting individual
values (columns) such as GetAs
which allows us to get a column of a specified type.
Grouping by single key
A grouped data frame uses multi-level index. This means that the index is a tuple of keys that represent multiple levels. For example:
1: 2: 3: 4: 5: 6: 7: 8: 9: |
titanic |> Frame.groupRowsByString "Sex" val it : Frame<(string * int),string> = Survive Name female 2 -> True Heikkinen, Miss. Laina 11 -> True Bonnell, Miss. Elizabeth 19 -> True Masselmani, Mrs. Fatima ... ... male 870 -> False Balkic, Mr. Cerin 878 -> False Laleff, Mr. Kristo |
As you can see, the pretty printer understands multi-level indices and
outputs the first level (sex) followed by the second level (passanger id).
You can turn frame with two-level index into a series of data frames
(and vice versa) using Frame.unnest
and Frame.nest
:
1: 2: 3: 4: 5: |
let bySex = titanic |> Frame.groupRowsByString "Sex" // Returns series with two frames as values let bySex1 = bySex |> Frame.nest // Converts unstacked data back to a single frame let bySex2 = bySex |> Frame.nest |> Frame.unnest |
Grouping by multiple keys
Finally, we can also apply grouping operation repeatedly to group data using multiple keys (and get a frame indexed by more than 2 levels). For example, we can group passangers by their class and port where they embarked:
1: 2: 3: 4: 5: 6: 7: 8: 9: |
// Group by passanger class and port let byClassAndPort = titanic |> Frame.groupRowsByInt "Pclass" |> Frame.groupRowsByString "Embarked" |> Frame.mapRowKeys Pair.flatten3 // Get just the Age series with the same row index let ageByClassAndPort = byClassAndPort?Age |
If you look at the type of byClassAndPort
, you can see that it is
Frame<(string * int * int),string>
. The row key is a tripple consisting
of port identifier (string), passanger class (int between 1 and 3) and the
passanger id. The multi-level indexing is preserved when we get a single
series from the frame.
As our last example, we look at various ways of aggregating the groups:
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: |
// Get average ages in each group byClassAndPort?Age |> Series.meanLevel Pair.get1And2Of3 // Averages for all numeric columns byClassAndPort |> Frame.meanLevel Pair.get1And2Of3 // Count number of survivors in each group byClassAndPort.GetSeries<bool>("Survived") |> Series.applyLevel Pair.get1And2Of3 (Series.values >> Seq.countBy id >> series) |> Frame.ofRows |
The last snippet is more interesting. We get the "Survived" column (which
contains Boolean values) and we aggregate each group using a specified function.
The function is composed from three components - it first gets the values in the
group, counts them (to get a number of true
and false
values) and then creates
a series with the results. The result looks as the following table (some values
were omitted):
1: 2: 3: 4: 5: 6: 7: |
True False C 1 -> 59 26 2 -> 9 8 3 -> 25 41 S 1 -> 74 53 2 -> 76 88 3 -> 67 286 |
Hierarchical indexing
For some data sets, the index is not a simple sequence of keys, but instead a more complex hierarchy. This can be captured using hierarchical indices. They also provide a convenient way of dealing with multi-dimensional data. The most common source of multi-level indices is grouping (the previous section has a number of examples).
Lookup in the World Bank data set
In this section, we start by looking at the World Bank data set from earlier. It is a data frame with two-level hierarchy of columns, where the first level is the name of region and the second level is the name of country.
Basic lookup can be performed using slicing operators. The following are only available in F# 3.1:
1: 2: 3: 4: 5: 6: |
// Get all countries in Euro area world.Columns.["Euro area", *] // Get Belgium data from Euro area group world.Columns.[("Euro area", "Belgium")] // Belgium is returned twice - from both Euro and OECD world.Columns.[*, "Belgium"] |
In F# 3.0, you can use a family of helper functions LookupXOfY
as follows:
1: 2: 3: 4: |
// Get all countries in Euro area world.Columns.[Lookup1Of2 "Euro area"] // Belgium is returned twice - from both Euro and OECD world.Columns.[Lookup2Of2 "Belgium"] |
The lookup operations always return data frame of the same type as the original frame. This means that even if you select one sub-group, you get back a frame with the same multi-level hierarchy of keys. This can be easily changed using projection on keys:
1: 2: 3: 4: |
// Drop the first level of keys (and get just countries) let euro = world.Columns.["Euro area", *] |> Frame.mapColKeys snd |
Grouping and aggregating World Bank data
Hierarchical keys are often created as a result of grouping. For example, we can group the rows (representing individual years) in the Euro zone data set by decades (for more information about grouping see also grouping section in this document).
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: |
let decades = euro |> Frame.groupRowsUsing (fun k _ -> sprintf "%d0s" (k / 10)) val decades : Frame<(string * int),string> = Austria Estonia ... 1960s 1960 -> 6.592 <missing> 1961 -> 7.311 <missing> ... 2010s 2010 -> 376.8 18.84 2011 -> 417.6 22.15 2012 -> 399.6 21.85 |
Now that we have a data frame with hierarchical index, we can select data in a single group, such as 1990s. The result is a data frame of the same type. We can also multiply the values, to get original GDP in USD (rather than billions):
1:
|
decades.Rows.["1990s", *] * 1e9 |
The Frame
and Series
modules provide a number of functions for aggregating the
groups. We can access a specific country and aggregate GDP for a country, or we can
apply aggregation to the entire data set:
1: 2: 3: 4: 5: 6: 7: 8: |
// Calculate means per decades for Slovakia decades?``Slovak Republic`` |> Series.meanLevel fst // Calculate means per decateds for all countries decades |> Frame.meanLevel fst // Calculate means per decades in USD decades * 1.0e9 |> Frame.meanLevel fst |
So far, we were working with data frames that only had one hierarchical index. However, it is perfectly possible to have hierarchical index for both rows and columns. The following snippet groups countries by their average GDP (in addition to grouping rows by decades):
1: 2: 3: |
// Group countries by comparing average GDP with $500bn let byGDP = decades |> Frame.groupColsUsing (fun k v -> v.As<float>() |> Series.mean > 500.0) |
You can see (by hovering over byGDP
) that the two hierarchies are captured in the type.
The column key is bool * string
(rich? and name) and the row key is string * int
(decade, year). This creates two groups of columns. One containing France, Germany and
Italy and the other containing remaining countries.
The aggregations are only (directly) supported on rows, but we can use Frame.transpose
to switch between rows and columns. The following calculates the mean for each group
of countries and for each decade:
1: 2: 3: 4: 5: 6: 7: 8: 9: |
// Mean by decade and then mean by country group byGDP |> Frame.meanLevel fst |> Frame.transpose |> Frame.meanLevel fst val it : Frame<bool,string> = 1960s 1970s ... 2010s False -> 10.45 34.37 306.89 True -> 82.74 335.71 2719.26 |
Handling missing values
THe support for missing values is built-in, which means that any series or frame can
contain missing values. When constructing series or frames from data, certain values
are automatically treated as "missing values". This includes Double.NaN
, null
values
for reference types and for nullable types:
1: 2: 3: 4: 5: 6: 7: 8: |
Series.ofValues [ Double.NaN; 1.0; 3.14 ] val it : Series<int,float> = 0 -> <missing> 1 -> 1 2 -> 3.14 let nulls = [ Nullable(1); Nullable(); Nullable(3) ] |> Series.ofValues |
Missing values are automatically skipped when performing statistical computations such
as Series.mean
. They are also ignored by projections and filtering, including
Series.mapValues
. When you want to handle missing values, you can use Series.mapAll
that gets the value as option<T>
(we use sample data set from earlier section):
1: 2: 3: 4: 5: 6: |
// Get column with missing values let ozone = air?Ozone // Replace missing values with zeros ozone |> Series.mapAll (fun k v -> match v with None -> Some 0.0 | v -> v) |
In practice, you will not need to use Series.mapAll
very often, because the
series module provides functions that fill missing values more easily:
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: |
// Fill missing values with constant ozone |> Series.fillMissingWith 0.0 // Available values are copied in backward // direction to fill missing values ozone |> Series.fillMissing Direction.Backward // Available values are propagated forward // (if the first value is missing, it is not filled!) ozone |> Series.fillMissing Direction.Forward // Fill values and drop those that could not be filled ozone |> Series.fillMissing Direction.Forward |> Series.dropMissing |
Various other strategies for handling missing values are not currently directly
supported by the library, but can be easily added using Series.fillMissingUsing
.
It takes a function and calls it on all missing values. If we have an interpolation
function, then we can pass it to fillMissingUsing
and perform any interpolation
needed.
For example, the following snippet gets the previous and next values and averages them (if they are available) or returns one of them (or zero if there are no values at all):
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: |
// Fill missing values using interpolation function ozone |> Series.fillMissingUsing (fun k -> // Get previous and next values let prev = ozone.TryGet(k, Lookup.NearestSmaller) let next = ozone.TryGet(k, Lookup.NearestGreater) // Pattern match to check which values were available match prev, next with | OptionalValue.Present(p), OptionalValue.Present(n) -> (p + n) / 2.0 | OptionalValue.Present(v), _ | _, OptionalValue.Present(v) -> v | _ -> 0.0) |
Full name: Features.titanic
module Frame
from Deedle
--------------------
type Frame =
static member CreateEmpty : unit -> Frame<'R,'C> (requires equality and equality)
static member FromColumns : cols:Series<'TColKey,Series<'TRowKey,'V>> -> Frame<'TRowKey,'TColKey> (requires equality and equality)
static member FromColumns : cols:Series<'TColKey,ObjectSeries<'TRowKey>> -> Frame<'TRowKey,'TColKey> (requires equality and equality)
static member FromColumns : columns:seq<KeyValuePair<'ColKey,ObjectSeries<'RowKey>>> -> Frame<'RowKey,'ColKey> (requires equality and equality)
static member FromColumns : columns:seq<KeyValuePair<'ColKey,Series<'RowKey,'V>>> -> Frame<'RowKey,'ColKey> (requires equality and equality)
static member FromColumns : rows:seq<Series<'ColKey,'V>> -> Frame<'ColKey,int> (requires equality)
static member FromRecords : values:seq<'T> -> Frame<int,string>
static member FromRecords : series:Series<'K,'R> -> Frame<'K,string> (requires equality)
static member FromRowKeys : keys:seq<'K> -> Frame<'K,string> (requires equality)
static member FromRows : rows:Series<'TColKey,Series<'TRowKey,'V>> -> Frame<'TColKey,'TRowKey> (requires equality and equality)
...
Full name: Deedle.Frame
--------------------
type Frame<'TRowKey,'TColumnKey (requires equality and equality)> =
interface IDynamicMetaObjectProvider
interface INotifyCollectionChanged
interface IFsiFormattable
interface IFrame
new : names:seq<'TColumnKey> * columns:seq<ISeries<'TRowKey>> -> Frame<'TRowKey,'TColumnKey>
private new : rowIndex:IIndex<'TRowKey> * columnIndex:IIndex<'TColumnKey> * data:IVector<IVector> -> Frame<'TRowKey,'TColumnKey>
member AddSeries : column:'TColumnKey * series:ISeries<'TRowKey> -> unit
member AddSeries : column:'TColumnKey * series:seq<'V> -> unit
member AddSeries : column:'TColumnKey * series:ISeries<'TRowKey> * lookup:Lookup -> unit
member AddSeries : column:'TColumnKey * series:seq<'V> * lookup:Lookup -> unit
...
Full name: Deedle.Frame<_,_>
--------------------
new : names:seq<'TColumnKey> * columns:seq<ISeries<'TRowKey>> -> Frame<'TRowKey,'TColumnKey>
static member Frame.ReadCsv : stream:Stream * ?hasHeaders:bool * ?inferTypes:bool * ?inferRows:int * ?schema:string * ?separators:string * ?culture:string * ?maxRows:int -> Frame<int,string>
Full name: Features.root
Full name: Features.msft
Full name: Deedle.Frame.indexRowsDate
Full name: Deedle.Frame.orderRows
Full name: Features.air
static member FrameExtensions.SaveCsv : frame:Frame<'R,'C> * path:string * keyNames:seq<string> * separator:char * culture:Globalization.CultureInfo -> unit (requires equality and equality)
static member FrameExtensions.SaveCsv : frame:Frame<'R,'C> * stream:Stream * includeRowKeys:bool * keyNames:seq<string> * separator:char * culture:Globalization.CultureInfo -> unit (requires equality and equality)
static member FrameExtensions.SaveCsv : frame:Frame<'R,'C> * path:string * includeRowKeys:bool * keyNames:seq<string> * separator:char * culture:Globalization.CultureInfo -> unit (requires equality and equality)
member Frame.SaveCsv : stream:Stream * ?includeRowKeys:bool * ?keyNames:seq<string> * ?separator:char * ?culture:Globalization.CultureInfo -> unit
member Frame.SaveCsv : path:string * ?includeRowKeys:bool * ?keyNames:seq<string> * ?separator:char * ?culture:Globalization.CultureInfo -> unit
static val DirectorySeparatorChar : char
static val AltDirectorySeparatorChar : char
static val VolumeSeparatorChar : char
static val InvalidPathChars : char[]
static val PathSeparator : char
static member ChangeExtension : path:string * extension:string -> string
static member Combine : params paths:string[] -> string + 3 overloads
static member GetDirectoryName : path:string -> string
static member GetExtension : path:string -> string
static member GetFileName : path:string -> string
...
Full name: System.IO.Path
{Name: string;
Age: int;
Countries: string list;}
Full name: Features.Person
val string : value:'T -> string
Full name: Microsoft.FSharp.Core.Operators.string
--------------------
type string = String
Full name: Microsoft.FSharp.Core.string
val int : value:'T -> int (requires member op_Explicit)
Full name: Microsoft.FSharp.Core.Operators.int
--------------------
type int = int32
Full name: Microsoft.FSharp.Core.int
--------------------
type int<'Measure> = int
Full name: Microsoft.FSharp.Core.int<_>
Full name: Microsoft.FSharp.Collections.list<_>
Full name: Features.peopleRecds
Full name: Features.peopleList
static member Frame.ofRecords : values:seq<'T> -> Frame<int,string>
Full name: Features.people
Full name: Deedle.Frame.indexRowsString
member Frame.GetSeries : column:'TColumnKey * lookup:Lookup -> Series<'TRowKey,'R>
Full name: Features.wb
static member GetDataContext : unit -> WorldBankDataService
nested type ServiceTypes
Full name: FSharp.Data.WorldBankData
Full name: Features.loadRegion
Given a region, load GDP in current US$ and return data as
a frame with two-level column key (region and country name)
nested type Countries
nested type Country
nested type Indicators
nested type IndicatorsDescriptions
nested type Region
nested type Regions
nested type Topic
nested type Topics
nested type WorldBankDataService
Full name: FSharp.Data.WorldBankData.ServiceTypes
<summary>Contains the types that describe the data service</summary>
inherit Region
member Countries : Countries
member Indicators : Indicators
member Name : string
member RegionCode : string
Full name: FSharp.Data.WorldBankData.ServiceTypes.Region
<summary>The indicators for the region</summary>
module Series
from Deedle
--------------------
type Series =
static member ofNullables : values:seq<Nullable<'a0>> -> Series<int,'a0> (requires default constructor and value type and 'a0 :> ValueType)
static member ofObservations : observations:seq<'a0 * 'a1> -> Series<'a0,'a1> (requires equality)
static member ofOptionalObservations : observations:seq<'K * OptionalValue<'a1>> -> Series<'K,'a1> (requires equality)
static member ofValues : values:seq<'a0> -> Series<int,'a0>
Full name: Deedle.FSharpSeriesExtensions.Series
--------------------
type Series<'K,'V (requires equality)> =
interface IFsiFormattable
interface ISeries<'K>
new : pairs:seq<KeyValuePair<'K,'V>> -> Series<'K,'V>
new : keys:seq<'K> * values:seq<'V> -> Series<'K,'V>
new : index:IIndex<'K> * vector:IVector<'V> * vectorBuilder:IVectorBuilder * indexBuilder:IIndexBuilder -> Series<'K,'V>
member After : lowerExclusive:'K -> Series<'K,'V>
member Aggregate : aggregation:Aggregation<'K> * observationSelector:Func<DataSegment<Series<'K,'V>>,KeyValuePair<'TNewKey,OptionalValue<'R>>> -> Series<'TNewKey,'R> (requires equality)
member Aggregate : aggregation:Aggregation<'K> * keySelector:Func<DataSegment<Series<'K,'V>>,'TNewKey> * valueSelector:Func<DataSegment<Series<'K,'V>>,OptionalValue<'R>> -> Series<'TNewKey,'R> (requires equality)
member Append : otherSeries:Series<'K,'V> -> Series<'K,'V>
member AsyncMaterialize : unit -> Async<Series<'K,'V>>
...
Full name: Deedle.Series<_,_>
--------------------
new : pairs:seq<Collections.Generic.KeyValuePair<'K,'V>> -> Series<'K,'V>
new : keys:seq<'K> * values:seq<'V> -> Series<'K,'V>
new : index:Indices.IIndex<'K> * vector:IVector<'V> * vectorBuilder:Vectors.IVectorBuilder * indexBuilder:Indices.IIndexBuilder -> Series<'K,'V>
<summary>The indicators for the country</summary>
Full name: Deedle.FSharpFrameExtensions.frame
Full name: Features.eu
Full name: Features.oecd
Full name: Features.world
member Frame.Join : colKey:'TColumnKey * series:Series<'TRowKey,'V> -> Frame<'TRowKey,'TColumnKey>
member Frame.Join : otherFrame:Frame<'TRowKey,'TColumnKey> * kind:JoinKind -> Frame<'TRowKey,'TColumnKey>
member Frame.Join : colKey:'TColumnKey * series:Series<'TRowKey,'V> * kind:JoinKind -> Frame<'TRowKey,'TColumnKey>
member Frame.Join : otherFrame:Frame<'TRowKey,'TColumnKey> * kind:JoinKind * lookup:Lookup -> Frame<'TRowKey,'TColumnKey>
member Frame.Join : colKey:'TColumnKey * series:Series<'TRowKey,'V> * kind:JoinKind * lookup:Lookup -> Frame<'TRowKey,'TColumnKey>
Full name: Features.peopleNested
Full name: Deedle.Frame.expandCols
Full name: Features.tuples
Full name: Microsoft.FSharp.Core.ExtraTopLevelOperators.dict
Full name: Microsoft.FSharp.Core.Operators.box
Full name: Deedle.Frame.expandAllCols
Full name: Features.numAges
member Series.Get : key:'K * lookup:Lookup -> 'V
member Series.TryGet : key:'K * lookup:Lookup -> OptionalValue<'V>
Full name: Deedle.Series.mapValues
Full name: Features.more
Full name: Deedle.FSharpSeriesExtensions.series
Full name: Features.add1
member Frame.ReplaceSeries : column:'TColumnKey * series:ISeries<'TRowKey> -> unit
member Frame.ReplaceSeries : column:'TColumnKey * data:seq<'V> * lookup:Lookup -> unit
member Frame.ReplaceSeries : column:'TColumnKey * series:ISeries<'TRowKey> * lookup:Lookup -> unit
Full name: Features.newRow
static member FrameExtensions.Append : frame:Frame<'TRowKey,'TColumnKey> * rowKey:'TRowKey * row:ISeries<'TColumnKey> -> Frame<'TRowKey,'TColumnKey> (requires equality and equality)
Full name: Features.otherRow
type SeriesBuilder<'K (requires equality)> =
inherit SeriesBuilder<'K,obj>
new : unit -> SeriesBuilder<'K>
Full name: Deedle.SeriesBuilder<_>
--------------------
type SeriesBuilder<'K,'V (requires equality and equality)> =
interface IDynamicMetaObjectProvider
interface IDictionary<'K,'V>
interface seq<KeyValuePair<'K,'V>>
interface IEnumerable
new : unit -> SeriesBuilder<'K,'V>
member Add : key:'K * value:'V -> unit
member Series : Series<'K,'V>
static member ( ?<- ) : builder:SeriesBuilder<string,'V> * name:string * value:'V -> unit
Full name: Deedle.SeriesBuilder<_,_>
--------------------
new : unit -> SeriesBuilder<'K>
--------------------
new : unit -> SeriesBuilder<'K,'V>
Full name: Features.nums
Full name: Features.strs
Full name: Features.ages
Full name: Deedle.Series.get
Full name: Deedle.Series.tryGet
Full name: Deedle.Series.getAll
Full name: Deedle.Series.observations
Full name: Deedle.Series.observationsAll
Full name: Features.opens
type DateTime =
struct
new : ticks:int64 -> DateTime + 10 overloads
member Add : value:TimeSpan -> DateTime
member AddDays : value:float -> DateTime
member AddHours : value:float -> DateTime
member AddMilliseconds : value:float -> DateTime
member AddMinutes : value:float -> DateTime
member AddMonths : months:int -> DateTime
member AddSeconds : value:float -> DateTime
member AddTicks : value:int64 -> DateTime
member AddYears : value:int -> DateTime
...
end
Full name: System.DateTime
--------------------
DateTime()
(+0 other overloads)
DateTime(ticks: int64) : unit
(+0 other overloads)
DateTime(ticks: int64, kind: DateTimeKind) : unit
(+0 other overloads)
DateTime(year: int, month: int, day: int) : unit
(+0 other overloads)
DateTime(year: int, month: int, day: int, calendar: Globalization.Calendar) : unit
(+0 other overloads)
DateTime(year: int, month: int, day: int, hour: int, minute: int, second: int) : unit
(+0 other overloads)
DateTime(year: int, month: int, day: int, hour: int, minute: int, second: int, kind: DateTimeKind) : unit
(+0 other overloads)
DateTime(year: int, month: int, day: int, hour: int, minute: int, second: int, calendar: Globalization.Calendar) : unit
(+0 other overloads)
DateTime(year: int, month: int, day: int, hour: int, minute: int, second: int, millisecond: int) : unit
(+0 other overloads)
DateTime(year: int, month: int, day: int, hour: int, minute: int, second: int, millisecond: int, kind: DateTimeKind) : unit
(+0 other overloads)
| Exact = 0
| NearestGreater = 1
| NearestSmaller = 2
Full name: Deedle.Lookup
Full name: Deedle.Series.lookup
Full name: Features.dates
Full name: Deedle.Series.lookupAll
Full name: Features.travels
Full name: Deedle.Series.groupBy
module List
from Microsoft.FSharp.Collections
--------------------
type List<'T> =
| ( [] )
| ( :: ) of Head: 'T * Tail: 'T list
interface IEnumerable
interface IEnumerable<'T>
member Head : 'T
member IsEmpty : bool
member Item : index:int -> 'T with get
member Length : int
member Tail : 'T list
static member Cons : head:'T * tail:'T list -> 'T list
static member Empty : 'T list
Full name: Microsoft.FSharp.Collections.List<_>
Full name: Microsoft.FSharp.Collections.List.exists
Full name: Deedle.Series.groupInto
Full name: Deedle.Series.countKeys
from Microsoft.FSharp.Collections
Full name: Microsoft.FSharp.Collections.Seq.countBy
Full name: Microsoft.FSharp.Core.Operators.id
static member Frame.ofRows : rows:Series<'a0,#ISeries<'a2>> -> Frame<'a0,'a2> (requires equality and equality)
Full name: Deedle.Frame.fillMissingWith
Full name: Deedle.Frame.groupRowsByString
Full name: Features.byDecimal
val decimal : value:'T -> decimal (requires member op_Explicit)
Full name: Microsoft.FSharp.Core.Operators.decimal
--------------------
type decimal = Decimal
Full name: Microsoft.FSharp.Core.decimal
--------------------
type decimal<'Measure> = decimal
Full name: Microsoft.FSharp.Core.decimal<_>
Full name: Deedle.Frame.groupRowsBy
Full name: Deedle.Frame.groupRowsUsing
Full name: Features.bySex
Full name: Features.bySex1
Full name: Deedle.Frame.nest
Full name: Features.bySex2
Full name: Deedle.Frame.unnest
Full name: Features.byClassAndPort
Full name: Deedle.Frame.groupRowsByInt
Full name: Deedle.Frame.mapRowKeys
from Deedle
Full name: Deedle.Pair.flatten3
Full name: Features.ageByClassAndPort
Full name: Deedle.Series.meanLevel
Full name: Deedle.Pair.get1And2Of3
Full name: Deedle.Frame.meanLevel
Full name: Microsoft.FSharp.Core.bool
Full name: Deedle.Series.applyLevel
Full name: Deedle.Series.values
Full name: Deedle.MultiKeyExtensions.Lookup1Of2
Full name: Deedle.MultiKeyExtensions.Lookup2Of2
Full name: Features.euro
Full name: Deedle.Frame.mapColKeys
Full name: Microsoft.FSharp.Core.Operators.snd
Full name: Features.decades
Full name: Microsoft.FSharp.Core.ExtraTopLevelOperators.sprintf
Full name: Microsoft.FSharp.Core.Operators.fst
Full name: Features.byGDP
Full name: Deedle.Frame.groupColsUsing
val float : value:'T -> float (requires member op_Explicit)
Full name: Microsoft.FSharp.Core.Operators.float
--------------------
type float = Double
Full name: Microsoft.FSharp.Core.float
--------------------
type float<'Measure> = float
Full name: Microsoft.FSharp.Core.float<_>
Full name: Deedle.Series.mean
Full name: Deedle.Frame.transpose
struct
member CompareTo : value:obj -> int + 1 overload
member Equals : obj:obj -> bool + 1 overload
member GetHashCode : unit -> int
member GetTypeCode : unit -> TypeCode
member ToString : unit -> string + 3 overloads
static val MinValue : float
static val MaxValue : float
static val Epsilon : float
static val NegativeInfinity : float
static val PositiveInfinity : float
...
end
Full name: System.Double
Full name: Features.nulls
type Nullable =
static member Compare<'T> : n1:Nullable<'T> * n2:Nullable<'T> -> int
static member Equals<'T> : n1:Nullable<'T> * n2:Nullable<'T> -> bool
static member GetUnderlyingType : nullableType:Type -> Type
Full name: System.Nullable
--------------------
type Nullable<'T (requires default constructor and value type and 'T :> ValueType)> =
struct
new : value:'T -> Nullable<'T>
member Equals : other:obj -> bool
member GetHashCode : unit -> int
member GetValueOrDefault : unit -> 'T + 1 overload
member HasValue : bool
member ToString : unit -> string
member Value : 'T
end
Full name: System.Nullable<_>
--------------------
Nullable()
Nullable(value: 'T) : unit
Full name: Features.ozone
Full name: Deedle.Series.mapAll
Full name: Deedle.Series.fillMissingWith
Full name: Deedle.Series.fillMissing
| Backward = 0
| Forward = 1
Full name: Deedle.Direction
Full name: Deedle.Series.dropMissing
Full name: Deedle.Series.fillMissingUsing
module OptionalValue
from Deedle
--------------------
type OptionalValue =
static member Create : v:'a0 -> OptionalValue<'a0>
static member Empty : unit -> OptionalValue<'T>
static member OfNullable : v:Nullable<'T> -> OptionalValue<'T> (requires default constructor and value type and 'T :> ValueType)
Full name: Deedle.OptionalValue
--------------------
type OptionalValue<'T> =
struct
new : value:'T -> OptionalValue<'T>
private new : hasValue:bool * value:'T -> OptionalValue<'T>
override Equals : y:obj -> bool
override GetHashCode : unit -> int
override ToString : unit -> string
member HasValue : bool
member Value : 'T
member ValueOrDefault : 'T
static member Missing : OptionalValue<'T>
end
Full name: Deedle.OptionalValue<_>
--------------------
OptionalValue()
new : value:'T -> OptionalValue<'T>
Full name: Deedle.OptionalValue.( |Missing|Present| )