Declarative vs procedural data manipulation languages – state of the art analysis

Is it better to manipulate data in a procedural language (e.g. Python with the Pandas library), or a declarative language, like SQL or SPARQL?

Data transformation is a crucial and time-consuming phase in data analytics projects. A common approach to define the necessary transformations to produce data suitable for machine learning algorithms is using a data manipulation language (DML). A DML is a computer programming language used for adding (inserting), deleting, and modifying (updating) data stored in files or databases. A DML can be categorized in two main groups: declarative and procedural. A very popular example of declarative DML is the language SQL (Structured Query Language), which is commonly used to query and manipulate relational databases. Similarly, SPARQL is the SQL counterpart for RDF (graph) databases. With the adoption of Python in many data analytics projects, the library Pandas became one of the most successful examples of procedural DMLs. The aim of this thesis is to investigate and compare the most promising DMLs currently available, including declarative and procedural languages. The result of this analysis will be highly relevant for data scientists and data engineers.  

Publisert 11. okt. 2021 15:12 - Sist endret 11. okt. 2021 15:12

Veileder(e)

Omfang (studiepoeng)

60