Information extraction from data has long been studied based on statistics. 'Data mining' in computer engineering is a process of extracting meaningful information from data using machine learning technology. In general, it is easy to get information from data with common categories and similar characteristics. Thus, data creation (i.e., collection) is crucial for effective data analysis.
Figure 1: Data Mining (Designed by macrovector / Freepik(http://www.freepik.com))
Data for Machine Learning
The process of data creation is vital when developing machine learning models. In case a problem occurs during data collection, it may be impossible to do smooth learning of a model. Thus, the process requires reliable data, which indicates the data created in the same environment by expert groups. In case data with unclear sources and many errors are applied to training, the model can perform training but wrongly make a prediction. In addition, the data locally collected in the common category rather than those widely collected with comprehensive characteristics are better for model training. In order to utilize the data collected in a wide range, a method of using clusters generated through unsupervised learning as data can be considered. In collecting data in the materials field, it is advisable to think of a material’s structure, direction, and composition as restrictions.
The data to be applied to machine learning must be standardized when preprocessing data. Data standardization indicates the conversion of data values into the values in a specific range for easy training. In general, data are standardized, ranging from 0 to 1. Otherwise, local training may occur, or training may not occur in extreme situations. Furthermore, the increase in the number of properties (columns) of data would cause a curse of dimensionality that the model performance decrease occurs. However, this problem can be resolved by removing the properties with high relation and reducing the dimension through correlation analysis between properties. Lastly, influential properties can be selected through the association of results corresponding to properties and results.
The data handling technology for data preprocessing requires the knowledge of data structure and type. Data can be divided into structured data and unstructured data according to structure. Structured data have a structure that can be composed of rows and columns like a table. Meanwhile, unstructured data do not have such structure, and one example is text. Data analysis requires such data to be structured. Structured data can be divided into vector, matrix, array, data frame, and list according to the structure type. Vector, matrix, and array have common data values with different dimensions, while data frame and list may have different data values. In the latter types, a data value indicates a number or character type. With a table type, social engineering mainly uses a data frame structure with a different data value in each property. For the materials field, skilled data frame handling technology is needed to use various data.
Training and Test Data for Model Development
The data for model development should be separated between training data and test data to prevent data overlapping. Training data are used for model training, while test data are for model evaluation. In general, data should be divided by a 7:3 or 8:2 ratio, and the results should be evenly distributed. In case data labels are uneven, some cases are divided by a 9:1 ratio. With unbalancing data, values with low ratio are insufficiently learned. Thus, it is common to apply balancing when unbalancing data are classified into training data and test data. With over-sampling and under-sampling, labels can be created without being lopsided. Over-sampling is a method of increasing the number of samples existing in a label with a low number of samples. Meanwhile, under-sampling is a method of decreasing the number of samples in a label with a large number of samples based on a label’s small number of samples. If the result values are a continuous type value rather than a categorical type value like a label, they can be sequentially sorted and divided into training data and test data to prevent lopsided result values.
Figure 2: Train/test data, Sampling 
Batch Method for Model Training
In the model training process, the training data batch method can adjust the error update between predicted values and actual ones. There are three types: stochastic gradient descent, batch gradient descent, and min-batch gradient descent. First, the stochastic gradient descent type calculates an error caused by one entity and applies it to the update. Thus, as updates are frequently made, the convergence speed may get faster. However, much calculation should be made, which is time consuming and may cause local minima problems. Second, the batch gradient descent type calculates the error caused by one entity and the average error of all entities (N) and applies it to the update. It has relatively less calculation and is more stably converged than the stochastic gradient descent type in terms of training. However, its complexity may increase to apply all errors. Third and last, in the most widespread min-batch gradient descent type, the training data are divided by the batch size, which calculates the average error value to apply to the training. The convergence speed increases if the batch size is small, but noise may occur in the training process. On the other hand, the error slope can be largely applied if the batch size is large, but the convergence speed decreases. Therefore, a proper batch method needs to be selected depending on the data and experimental environment.
Material Application Case with Unstructured Data
Let us look at a research case with an application of unstructured data in the materials field. This study predicts thermoelectric materials using texts from many papers. It collects about 3.30 million abstracts of scientific literature, processes natural language, and uses a vocabulary of about 0.5 million words. The words are embedded into 200-dimensional vectors and classified into material use fields . This study utilized vast amounts of unstructured data, and generated structured data by quantifying the relationship between words commonly generated in literature through data handling. Through this, I think that it showed meaningful results because it selected materials whose vector distance is close to the field of application.
Figure 3: Word2vec skip-gram and analogies 
Data Preparation for Research on Materials
For effectively developing machine learning models, local data rather than massive data should be used for training. In some cases, algorithm types and parameter optimization are focused, or the amount of data increases to reduce errors and prevent insufficient model performance in the process of developing models. However, problems in data, including the creation of source data, must be considered. Notably, local and reliable data are required, rather than a comprehensive and massive data range. In summary, data are as necessary as machine learning algorithms for efficient AI.
Jeong Rae Kim
Korea Institute of Science and Technology, Computational Science Research Center, Post-Doctor