Hands-On Machine Learning on Google Cloud Platform
上QQ阅读APP看书,第一时间看更新

Data structuring

Every day, everywhere in the world, large volumes of data are generated by the different activities of man. Originally, these pieces of data are not structured as they come from sources of different natures. They therefore require an organization to be ready for use. Thus, the unstructured information collected must be processed according to specific requirements and subsequently stored as structured data. There are many forms of data structures, ranging from basic to advanced and complex, and their use is essential in the process of structuring data.

Data structuring consists of a set of linear or nonlinear operations performed on apparently random and unstructured data taken as input. These operations are intended to analyze the nature of the data and its importance. The system then divides the data into broad categories of information, as measured by the results of the analysis, and stores them or sends them for further analysis. This additional analysis can be used to subdivide the data into additional subcategories of nested categories. During the analysis, some data can also be considered useless and eventually discarded.

The result of this process is represented by structured data, which can be further analyzed or used directly to extract information not known until now. The shift from unstructured data to useful information is what the cycle of structuring and processing data is based on, and their success often determines the importance of data in a given field of application.

Data structuring is a methodology for organizing and archiving data so that it can be accessed and modified efficiently. In particular, a data structure consists of a collection of data values, in the relationships between them and in the functions or operations that can be applied to the data, as shown in the following diagram:

Over time, data has been organized in different ways, starting from very basic structures like arrays that are commonly used in programming languages, all the way to modern data structures that can take complex forms. Modern data structures are databases of different types that support a wide range of elaborations and extended operations, which allow easy manipulation, categorization, and sorting of data in many different ways.

Relational databases are the preferred data structure for many people because they have been widely used for many years. The term database indicates the set of data used in a specific information system, of a business, scientific, administrative, or some other type. A database consists of two different types of information, belonging to distinct levels of abstraction:

  • Data, which represents the entities of the system to be modeled. The properties of these entities are described in terms of values (numeric, alphanumeric, and so on). The pieces of data are also grouped or classified into categories based on their common structure (for example, books, authors, and so on).
  • Structures (metadata), which describe the common characteristics of various categories of data, such as names and types of property values.

A database must represent the different aspects of reality, and, in particular, in addition to the actual data, also the relationships between the data, that is, the logical connections among the various categories. For example, the association that binds each author to their books and vice versa must be represented. The database must also meet the following requirements:

  • Data must be organized with minimal redundancy, that is, not be unnecessarily duplicated. This condition derives from the need to avoid not only the unnecessary use of storage resources, but also and above all the burden of managing multiple copies; furthermore, if the information relating to a category of data is duplicated, there is a risk that an update carried out on one of the copies and not shown on the others has negative consequences on the consistency and reliability of all data.
  • Data must be usable at the same time by multiple users. This requirement derives from the previous point; the situation in which each user (or category of users) works on his own copy of the data is to be avoided, and there must be a single version of the data, to which all users can access; this implies the need for each type of user to have a specific view of the data and specific access rights to the data. Furthermore, techniques are necessary to prevent the activity of the various users from creating conflicts for the simultaneous use of the same data.
  • Data must be permanent. This implies not only the use of mass memories, but also the application of techniques that preserve the set of data in case of malfunction of any component of the system.

The table is the fundamental data structure of a relational database. The tables represent the entities and relationships of the conceptual schema. It consists of records (rows or tuples) and fields (columns or attributes):

  • Each record represents an instance (or occurrence or tuple) of the entity/relationship
  • Each field represents an attribute of the entity/relationship

For each field a domain is identified (datatype): alphanumeric, numeric, date, Boolean, and so on.

The set of fields whose values uniquely identify a record within a table is called a primary key. When the primary key consists of only one field, it is called a key field. The following diagram shows an example of a primary key in a database:

When a key field cannot be found between the attributes of an entity, a numeric ID field is defined that auto-increments (counter).

Referential integrity is a set of rules of the relational model that guarantees data integrity when relationships are associated with one another through the foreign key: these rules are used to validate associations between tables and to eliminate errors in inserting, deleting, or modifying linked data.

The index is relevant in a database. An index is a data structure designed to improve data search times. Fields in a table for which searches or join operations are required can be indexed. In the absence of an index, the search for the value of a field takes place sequentially on the records in the table. Indexes are automatically generated from the database for fields defined as keys.