Data Processing Pipeline

The overall data flow can be described as follows:

The details are as follows:

Raw Input: Unprocessed raw input dataset. Detailed in [ Dataset List ].
Atomic Files: Basic components for characterizing the input of various recommendation tasks, proposed by RecBole. Detailed in [ Atomic Files ].
Dataset: Mainly based on the primary data structure of pandas.DataFrame in the library of Pandas. During the transformation step from atomic files to class Dataset, we provide many useful functions that support a series of preprocessing functions in recommender systems, such as k-core data filtering and missing value imputation. Detailed in [ API ].
DataLoader: Mainly based on a general internal data structure implemented by our library, called Interaction. Interaction is the internal data structural that is fed into the recommendation algorithms. It is implemented as a new abstract data type based on python.dict, which is a key-value indexed data structure. The keys correspond to features from input, which can be conveniently referenced with feature names when writing the recommendation algorithms; and the values correspond to tensors (implemented by torch.Tensor), which will be used for the update and computation in learning algorithms. Specially, the value entry for a specific key stores all the corresponding tensor data in a batch or mini-batch. Detailed in [ API ].