For extensibility and reusability, our data module designs an elegant data flow that transforms raw data into the model input.
The overall data flow can be described as follows:
The details are as follows:
Dataset: Mainly based on the primary data structure of
pandas.DataFrame in the library of Pandas. During the transformation step
from atomic files to class Dataset, we provide many useful functions that support a
series of preprocessing functions in recommender systems, such as k-core data filtering and missing
value imputation. Detailed in [ API ].
DataLoader: Mainly based on a general internal data
structure implemented by our library, called Interaction. Interaction is
the internal data structural that is fed into the recommendation algorithms.
It is implemented as a new abstract data type based on python.dict, which is a
key-value indexed data structure. The keys correspond to features from input, which can be
conveniently referenced with feature names when writing
the recommendation algorithms; and the values correspond to tensors (implemented by
torch.Tensor), which will be used for the update and computation in learning
algorithms. Specially, the value entry for a specific key stores all the corresponding tensor data
in a batch or mini-batch. Detailed in [ API ].