Introduction to data.table
data.table is an extension of R's data.frame for working with large datasets. At the expense of a more complex syntax, data.table provides an efficient framework for filtering and manipulating large, multi-table, in-memory datasets. Major features include:
- Fast import of csv files via fread().
- Fast and flexible joining separate data.tables via keys.
- Support for assignment by reference.
Instead of re-hashing an introduction here, I would suggest reading the intro vignette in the data.table documentation.
data.table notes
Joins
Joins as Subsetting
When first using data.table, I kept referring to the this guide to remember how to do inner joins and outer joins. However, the main idea behind joins in data.table is to treat them as a type of subsetting, and if you embrace this idea, it will become clearer. Let's assume we have two data.tables, X and Y, and for simplicity's sake, we will assume that we have called setkey on both of them.
X[Y, nomatch=NA]
nomatch=NA is the default, so this would be the same as leaving it off
Take the subset of X, where X's key column's values match Y's key column's values. If there is a set of key values in Y with no corresponding value match in X (like a KeyError in Python), fill in that row's X's columns with NA. This is the same as Y outer join X.
X[Y, nomatch=0]
Take the subset of X, where X's key columns values match Y's key columns values. If there is a set of key values in Y with no corresponding value match in X, do not include the Y or X row. This is the same as Y inner join X (or X inner join Y).
Non-equi joins
By default, the type of join you typically do is an "equi-join" which just means that you join on key columns with values that are equal. Sometimes this is not ideal, especially if you are doing a join with something other than an auto-incremeted primary key (or other surrogate key). In other words, if the key you are joining on has some kind of natural meaning (like a timestamp), it might make sense to join on all rows with a time key that is greater or less than your primary key. This could end up saving you a large cartesian join followed by a subsetting. In fact, this might not be just a matter of convenience- if your data is big enough, you may not have enough memory for a full cartesian join.
The data.table documention refers to non-equi joins and rolling joins (described below) as distinct concepts, but in reality, rolling joins are just one type of non-equi join.
Rolling join
A rolling join allows you to join two tables with keys that are not an exact match, but are within a certain threshold of one another. That is, you can join on a "rolling" window of values. For example, if the primary key for X is 1, and the primary key value for Y is 2, a rolling join could be set up to connect these two rows. This is especially useful for joining two tables on timestamps. Perhaps you want to join table A with the key in table B that comes as close as possiible (or within certain amount of time after) the primary key in table A. If you are using multi-field keys, the last field that you specify is always used as the rolling field.
Other non-equi joins
There is another syntax, using the "on" keyword, to do arbitrary non-equi joins using the > , <, <=, and >= operators (as well as, I believe the range keyword). This is not well documented, but is discussed around page 6 of the official data.table documentation. I have tested this, and it works well and appears to be quite fast. One note is that the values in the comparisons must be column names- you cannot specify arbitrary functions inside of the on
vector. This means you likely need to make some temporary columns on your tables.
Note that on
keyword technically defines data.table secondary indices.
Self joins
There is nothing special about a self join, it is just joining a table to itself. You might want to join rows in the same table to one another, and you can do that. Just be aware, as with any join that has two tables with conflicting column names, data.table will rename the conflicting column names of the table you index by (in the i position) by prefacing the column names with "i.".
.EACHI joins
.EACHI
is a value that you can set the by
keyword to that changes the way a your J clause works in a join. It used to be the default behavior when you were merging two tables while using an aggregate value (like .N) in your J clause, but now you have to specify that you want this behavior (called by-without-by). The best explanation for this is this answer on Stack Overflow. In general, I have found this useful for situations where you would typically need to perform a cartesian join, followed by a grouping with code in J to further select. Or, similarly, using nested for loops to examine all the rows in one table for each row in the another table . X[Y, ,by=.EACHI] effectively lets your perform your J clause (and return a row or rows) for each row in Y, along with all the rows in X that match Y's key. This may not be the most efficient approach, especially with the availability of non-equi joins described in the prior section.
Also, see this Good explanation of .EACHI
Common Tasks
Creating an Index element/primary key
Find the largest/smallest item of a group
Fill/copy down a column
If you need to copy/fill down to fill in NA's between defined values, there is a method based on grouping by the cumsum of the is.NA value of the column. See here. This is can be pretty useful for performing "lookbacks" on values on rows in a table. It is also pretty trivial to do individual fill downs for subgroups in the table by creating an extra colum on the table containing the value of cumsum calculation after grouping by the variable of concern.
Grouping commands with {}
From Advanced data table- this is apparently not really special to data.table, but in the J position of data.table you can group commands, including creating temporary variables. The last line of the block should be a list containing the variables you want to actually return from the selection.
DT[,{t=1:3; a = a[t]; list(one_var=a, one_var=2);}]
Special symbols and functions
(Also see this reference)
.N
Returns the number of rows in the data.table, or when using a group, the number of rows in the group. Can now be used in the I position, for example, to sample 20 rows from your table
DT[sample(.N, 20)]
.I
This is a vector containing the index of the rows in the data.table from 1:.N. Among other things, this can be used to add a primary key to a table with:
DT[,pk:=.I]
When using a group by, it is a vector containing the index in the original row in the datable for each row in the current group. This can used to get the entire row corresponding to the maximum value of a particular column in each group:
DT[DT[,.I[which.max(time)], by=event_name]$V1]
Note that here V1
is the name of the unspecified column created by selecting in J without specifying a column name.
.SD
A data.table equal to the elements in the current group (Subset Data.Table). Can be quite slow when used.
J()/CJ()/SJ()
J()
is an alias forlist()
or.()
inside in the I/subset section inside of a data.table. It does nothing else and is generally deprecated.CJ()
/SJ()
are convenience functions that can be used inside the I section of a data.table or outside to do a cross or sorted join, respectively. I have not yet used them. See here.
get function / ():=
If you need to subset or select in J on a column name, but the column name you want to filter by is contained in a string variable, you can use the get
function.
var_col="temperature";DT[get(var_col)==50]
If you are updating or adding a column in J, and the column name is in a variable:
DT[i, (var_col) := val]
You could do something similar by setting with
to FALSE:
DT[i, var_col := val, with=FALSE]
set function
If you want to update the values of a data.table column in a for
loop, it can be inefficient to use the :=
syntax repeatedly. This function makes it faster for you. I've only seen small speed ups, but could be useful. Generally for loops are not recommended but apparently for some tasks for
loops and set are recommended over using .SD. See http://brooksandrew.github.io/simpleblog/articles/advanced-data-table/
with keyword
This tells data.table whether you want the variable names used in the J position to resolve to column names or to be looked up in the surrounding scope as variables. Default is TRUE (use column names).
Miscellaneous Tips
- If you are only grouping by a single column name, you can omit the
by=
, for example,DT[,.N,colname]
- The
keyby
keyword does a group followed by a setkey on the resulting data.table. - For comparing to values in prior or subsequent rows, see shift/lag keywords
- Update mutliple columns by reference at once using
:=()
, for example,DT[,':='(new1 = sum(colB), new2 = sum(colC))]
. This is referred to as the functional form. - Using variable names intead of columns, for example,
colvector = c("colname1", "colname2"); DT[,(colvector) := .(val1, val2)]
. You can also update multiple columns at once by referring to the column names as string literals in a vector on the left-hand side,DT[, c("temperature","anothercol"):=.(3,4)]
. See here for more examples.
Gotchas
Indexing data.table by a vector of integers vs matching on an integer key
Be aware that normally if you pass a vector of values into the I position, this will typically return the rows where the key is equal to one of those values. However, a vector of integers is reserved for returning the rows with those indexes. If you want to return rows based on the value of an integer key, you have to embed the vector in a list
DT[.(c(1,3,4))]
Note that this is just the syntax for selecting by multiple keys (indexing by a list of vectors) except we only supply one vector in the list.
Index by a boolean vector (like data.frame)
With a standard data.frame, the main way to select rows with specific values is to pass a vector of booleans into the first position.
DF[DF$A==5]
You can basically do the same thing with data.table:
s=(DT2$A==5|DT2$B==5); DT1[s]
This is useful when you want to subset by some calculation you have done elsewhere. Keep in mind you can do DT[DT$A==5]
, but this is generally not done since DT
is redundant.
DT[BOOLEAN COLUMN==TRUE] vs DT[BOOLEAN_COLUMN]
The first one works, the second will not. When I is a single variable name, data.table actually looks in the surround scope for the variable (this is what it does in the above example, DT1[s]
for the s
variable. This means you either have to directly compare to a boolean value, or use other syntax tricks
Other useful data.table links
Thanks for reading, please leave any comments or corrections.