Analysis of Earthquake Activity in Indonesia by Clustering Method

: Indonesia is an area where three large tectonic plates meet, namely the Indo-Australian, Eurasian and Pacific plates, so that Indonesia is included in the earthquake-prone category, with 11,660 earthquake vibrations identified in the Meteorology, Climatology and Geophysics Agency (BMKG) database in 2019 The purpose of this study is to develop a classification of the distribution of earthquakes in Indonesia in 2019 based on the values of magnitude, depth, and position. This research was conducted by using the clustering method based on the K-means algorithm and the DBSCAN algorithm as a comparison. The results of the clustering show that the earthquake data analysis using the K-Means algorithm is superior with a silhouette index value of 0.837, while the DBSCAN algorithm has a silhouette index value of 0.730.


Introduction
Indonesia is located above the connection of the Pacific, Eurasian, and Indo-Australian tectonic plates which continue to move actively so that it is prone to earthquakes due to the release of seismic waves on rocks in the earth's crust (Halim & Widodo, 2017;Kurmiati et al., 2021;Sari et al., 2012). Another trigger for earthquakes in Indonesia is the volcanic activity of active volcanoes surrounding the Indonesian archipelago (Murdiaty et al., 2020). The Meteorology, Climatology and Geophysics Agency recorded earthquakes reaching 400 times every month until in 2019 11,660 earthquakes were recorded on the Earthquake Repo site (Kurmiati et al., 2021).
An earthquake is a shaking event on the earth's surface caused by a sudden release of energy to create seismic waves that hit rocks in the earth's crust (Bahri & May, 2019). Earthquake events are recorded based on location in the form of latitude and longitude and their depth with a certain level of earthquake strength (magnitude) (Akbar et al., 2018). Until 2019 technological developments have not been able to predict exactly where and when an earthquake will occur even though the location prone to occurrence and the impacts caused by the earthquake have been mapped based on the level of strength recorded through seismographs (Bahri & May, 2019).
Data on the quantity of earthquakes that occur very much every year can be used as the basis for processing earthquake distribution with the data mining method which is stated as a large-scale data processing method to get new information that is easy to understand (Reviantika et al., 2020). Data such as the location or point of the earthquake, the depth level of the epicenter, and the strength of the earthquake can be used as data mining objects for analysis purposes in many relevant studies (Ismail, 2021;Reviantika et al., 2020).
Various analyzes of earthquakes have been carried out using various methods to determine the distribution of earthquakes, earthquake-prone areas, and the impact of earthquakes based on these data. Earthquake analysis can be done using an area classification approach, Ismail (2021) states that the classification of earthquake areas can be done using a random forest algorithm based on earthquake events in the form of coordinates (latitude, longitude), depth (depth), and magnitude (seismic energy strength of the earthquake), with an accuracy of 99.97%. Other methods that can be used in earthquake analysis are K-Means Clustering (Reviantika et al., 2020;Murdiaty et al., 2020) and Business Intelligence methods (Akbar et al., 2018).
K-Means is a type of algorithm in the data mining clusterization method (Reviantika et al., 2020). Analysis using K-Means can provide results in the form of classification data for grouping the distribution of earthquakes, so in this study an analysis of the distribution of earthquakes in Indonesia in 2019 will be carried out based on data on earthquake point locations, earthquake depth levels, and earthquake strength. In addition, this study attempts to compare the clustering method using the K-Means algorithm with DBSCAN in determining the silhouette index. The silhouette index value given shows a statistical measure to choose the optimal number of clusters that can display graphics regarding the accuracy of the placement of an object in a cluster (Nicolaus et al., 2016).

Experimental
This study uses real time earthquake data in Indonesia in 2019 obtained from the Meteorology, Climatology and Geophysics Agency (BMKG) database. The data used in this study consisted of latitude, longitude, earthquake magnitude and depth of earthquake data. The analysis in this study uses the K-Means and DBSCAN algorithms. The stages of this research include preparing data in .csv form, then the data is prepared to avoid data that does not provide information by checking for missing data. The next step is to determine the number of clusters, determine the centroid and visualize it (figure 1).

Results and Discussion
Clustering is an unsupervised learning method by grouping data based on the level of similarity without supervision or clustering partition category method (Humairah & Rasyidah, 2020). This method is used because it is more efficient, such as removing redundant variables using correlation and ignoring target variables. The basic principle of clustering is to maximize the similarity between members of one cluster and minimize the similarity between members of different clusters. Clustering can also group data based on the level of similarity and level of accuracy (Kurniati et al., 2021;Syakur et al., 2018). The distance of the data is determined using the equation, The formula for calculating the distance between two points in one dimension, two dimensions, and three dimensions respectively is shown in equation (2) to equation (4) (Siregar, 2018), In this study, data was obtained from the BMKG database in 2019 with a total of 11660 data records of earthquake vibrations. The earthquake distribution database for the first five (5) data is shown in table 1. From table 1, the feature selection is then carried out using only the latitude, longitude, earthquake magnitude and depth of earthquake data attributes as mandatory data to be analyzed. The data is then cleaned using imputation to avoid missing data that can affect machine learning work, as shown in Figure 2. Imputation is used in estimating a data distribution parameter and remains dominantly used in testing new tests (Dempster & Rubin, 1997). This method is an alternative to least squares by maximizing the likelihood function (likelihood) or (log-likelihood). The probability function of the linear model is, The maximum likelihood method estimates the parameters  and  by obtaining the parameter values 0, 1 and  that maximize L (Dempster & Rubin, 1997).

Figure 2. Data cleaning results
Based on Figure 2, it can be explained that the results of feature selection using data cleaning have been successfully carried out. This is indicated by a dominant black color block in each of its attributes. The next step is to calculate the correlation between attributes as shown in Figure 3. The correlation calculation aims to determine the relationship between the variables. The highest correlation result in Figure 3 is 0.3. This shows that the correlation criteria are still weak, so it is necessary to normalize the data. Normalization aims to eliminate or reduce data so as to produce data that matches the expected value. Normalization of data is done using equation (6).
3-D visualization of latitude, longitude, earthquake magnitude and depth data is presented in Figure 4. Figure 4 shows that the distribution of earthquake data is divided into three groups, with Figure 4a for latitude and Figure 4b for longitude..  The results of the earthquake distribution that have been identified, then determined the initial value of the centroid at random. This is useful for calculating the distance of the distribution matrix, so that it can be continued for the stage of grouping objects and determining cluster members according to the minimum distance from the centroid. In addition to this, repeated iterations of the data are carried out in order to produce a new, better centroid distance, as shown in Figure 5. By doing a comparison between Figure 4 and Figure 5, further information is obtained that earthquake activity in Indonesia in 2019 is more common in cluster 1, namely at a depth of 0 km to 90 km and cluster 2 at a depth of 90 km to 300 km. Meanwhile, in cluster 3 for a depth of 300 km to 700 km, fewer occurrences were recorded. This is in accordance with the results of the conference held by ITB in 2021 which is presented in Figure 6.  . In general, the distribution of earthquakes occurred in almost all parts of Indonesia. In general, the generator of earthquakes is the presence of faults and subduction collisions of the earth's zone. Faults cause discontinuity in the rock so that there is a shift. The larger the rock shift area, the greater the resulting magnitude. The distribution of earthquake data in 2019 belongs to a phase that has its own characteristics, this is because before 2019 a large earthquake has been confirmed with an average of ± 7.3 Mw, including the Aceh earthquake in 2004 with 9.2 Mw (Meltzner et al., 2006) , 2006 Yogyakarta earthquake with 6.2 Mw (Sarah & Soebowo, 2013), Lombok earthquake series in 2016 with 6.2 Mw, 2017 earthquake with a scale of II-III, 2018 with 7.0 Mw (Kencanawati et al., 2020), Palu Earthquake 2018 with 7.5 Mw (Mason et al., 2021). So that this event is used as a factor to determine the distribution of the 2019 earthquake.
Based on the results of the clusterization of the earthquake distribution, the centroid values are obtained which are presented in table 2. If presented on a map, the average centroid is on the Eurasian plate in Figure 7 with a blue circle marked.

Figure 7. Earthquake centroid position based on clusterization results
The percentage variance of earthquake distribution is explained as a function of the number of clusters. The first cluster will provide a lot of information about the effect of the angle so that it forms an angle, see figure 8. This is in accordance with the data plot generated by each attribute. The number of clusters obtained based on data fractures using the Elbow method in Figure  8 is k = 3 or there are 3 clusters, this number is the result of optimal cluster formation for earthquake distribution data in 2019 (Bhoowalia &Kumar, 2014 andMarutho et al., 2018 ). So that the cluster data output for the first 10 data is shown in table 3. Silhouette coefficient value is a statistical measure to choose the optimal number of clusters (Nicolaus et al., 2016). Silhouette value determination can be done using the K-Means algorithm and the DBSCAN algorithm as shown in Figure 9 and Figure 10. Based on the results of the analysis, the silhouette value is 0.867 for the K-Means algorithm, while the DBSCAN algorithm is 0.730. So that it can be obtained information that the statistical analysis for the number of clusters with the K-Means algorithm is higher than the DBSCAN algorithm, but both have shown large values. Therefore, the number of earthquake distribution clusters for k=3 is the most optimal. Overall, data testing using clustering and using the Elbow Method on earthquake data in Indonesia in 2019 was appropriate. The K-Means Clustering process uses the Elbow method to determine the best k optimization value. The results of the clusters formed will be labeled to facilitate the division of each cluster area by considering the characteristics of each attribute. Performance testing using 11660 earthquake repo data. The results of the Sum of Square Error calculation for each cluster experienced the largest decrease at k = 3, which can be seen in Figure 8. This test will look for the performance of each cluster number which is adjusted to the range of values in the Elbow Method. In Figures 5 and  7, information is obtained that the strength of the earthquake magnitude is spread over the depth of each cluster. This research is also supported by the silhouette index of the K-Means algorithm which is compared with the DBSCAN algorithm.

Conclusion
The distribution of earthquakes in Indonesia in 2019 recorded 11,660 vibrations obtained from BMKG data from January 1, 2019 to December 31, 2019. The distribution of earthquakes was analyzed based on 4 attributes, namely latitude, longitude, depth, and magnitude data. The results of the analysis obtained that the level of correlation between attributes is still weak so it is necessary to normalize the data. This study also presents a visualization of the earthquake distribution in 3-D form and the results of the centroid area with a value of k = 3 using the K-means algorithm. The cluster area is divided into colors that have been determined by the minimum distance from the distribution of the earthquake point to the centroid. From the average centroid of each attribute, it can be grouped that the distribution point of the earthquake in 2019 is on the Eurasian plate. Clustering data with a value of k = 3 is also strengthened using the Elbow Method which shows the appropriate optimization value for the trial value. The clustering results also show that earthquake data analysis using the K-Means algorithm is superior with a silhouette index value of 0.837, compared to using the DBSCAN algorithm which has a silhouette index value of 0.730.