Data management and data quality is a well-studied subject. However, integrating existing solutions with data acquisition in clinical settings and ML-based data analysis is a surprisingly large challenge. One potential reason for this situation is the fact that these tasks (i.e., data acquisition, data management and curation, data quality maintenance, and ML) require different skills, are typically separated in different organizational units, and performed to a large extent independently by different individuals. Thus, a lot of effort is “wasted” through “manual” data collection, “manual” data loading and exporting, and “manual” data reformatting until data can be used for ML studies.
Unfortunately, none of the existing approaches/systems support the entire data management tool chain to provide a complete and integrated solution [Perez-Pozuelo et al. 2020], [Du et al. 2020]. There exist many ad-hoc/partial solutions, often using commercial-of-the-shelf systems, and many solutions raise privacy issues. These partial solutions are not supporting the data management needs and do not provide a satisfying data basis for ML analysis.
We need to support the entire data management tool chain, starting with data acquisition, data storage and curation including cleaning, filtering, and anonymisation (i.e., the entire Extract-Transform-Load (ETL) pro- cess), and data quality improvement followed by data modelling and integration to build the data warehouse. This provides a suitable data basis for ML based data analysis, providing the required input for both, applications for sleep-related breathing disorders and ML research and development.
Since we have different data modelling requirements for the different ML techniques (different data formats as input data is needed), it is best to use a data warehouse with multi-dimensional data structures for the data management. This allows us to use views to extract the data in different formats/structures for the various ML techniques in the most suitable form for the different ML techniques, i.e., managing (retrieving and modelling) the data with SQL queries, not with scripts (see Figure 1). Furthermore, different data quality improvement approaches can be used for the “raw” data and the result can be managed as different versions. The resulting data from the ML analysis will again be integrated in the data warehouse (and data are not spread around in various CSV files). Design and implementation of this solution will be based on a data warehouse solution we developed to store training (time-series data) and injury data (like health records) from professional football players and is used to predict the risk of injuries for the players with ML [Theron 2020].
Figure 1: Data warehouse solution to support the entire value-chain of data from acquisition to analysis
References
[Perez-Pozuelo et al. 2020] Perez-Pozuelo I, Zhai B, Palotti J, Mall R, Aupetit M, Garcia-Gomez JM: The future of sleep health: a data-driven revolution in sleep science and medicine, NPJ Digital Medicine, 2020, 3(1):1-15
[Du et al. 2020] Du M, Liu N, Hu X: Techniques for interpretable machine learning, Coomunications of the ACM, January 2020, 63(1):68-77
[Theron 2020] Theron G: The use of Data Mining for Predicting Injuries in Professional Football Players, Master Thesis, Department of Informatics, University of Oslo, May 2020
Knowledge in data management is a requirement for this work which will be embedded with other MSc. Theses in the Respire project: https://www.mn.uio.no/ifi/forskning/prosjekter/respire/index.html
Contact: Vera Goebel (goebel@ifi.uio.no) and Thomas Plagemann (plageman@ifi.uio.no)