Classification of Tweets for Video Streaming Services’ Content Recommendation on Twitter

. Streaming services were popular platforms often visited by internet users. However, the abundance of content can be confusing for its users, prompting them to look for a recommendation from other people. Some of the users looked for content to enjoy with the help of Twitter. However, there were irrelevant tweets shown in the results, showing sentences not related at all to the content in the streaming services platform. This study addressed the classification of relevant and irrelevant tweets for streaming services’ content recommendation using random forests and the Convolutional Neural Network (CNN). The result showed that the CNN performed better in the test set with higher accuracy of 94% but slower in running time compared to the random forest. There were indeed distinctive characteristics between the two categories of the tweets. Finally, based on the resulting classification, users could identify the right words to use and avoid while searching on Twitter.


Introduction
Streaming services were on the rise recently. There were 59% of internet users aged 16 to 64 owning a technology device watching television content via streaming services platform each month [1]. As of September 2020, eleven video streaming services were operating in Indonesia [2]. Two of the most popular streaming services in Indonesia are Netflix and Disney+, offering various content such as movies, series, and animation.
However, due to the vast collection offered on the platform, some subscribers cannot decide on what kind of content that they want to enjoy. This leads to them checking out the recommendation from their colleagues, and some even ended up browsing for a recommendation from a stranger on social media, for example, Twitter. In Indonesia, Twitter is one of the most visited platforms, amassing more than 90 million monthly traffic with 56% of internet users actively using the platform for social media activity [1].
While looking for a recommendation, people usually wrote the name of the platform and the type of content that they want to know in the search box on Twitter.
They would type some keywords like "Netflix movie", "Netflix series", "Disney movie recommendation", etc., and get the tweet as a result containing the title of movies that is usually popular in their region. However, the result often showed nonrelevant things such as tweets about offering the streaming services platform. It encouraged the users to subscribe to a certain platform but sometimes people got uncomfortable since they expected to get relevant things (synopsis of a movie or movie title recommendation).
There are various reasons for people subscribing to a certain streaming service, such as option availability, social trends, and subscription fee [3]. As there are not many Indonesians having suitable methods for paying the subscription, there has been a rise in Twitter accounts offering to help Twitter users to subscribe to the services. However, the rise of these kinds of accounts sometimes was distracting for people wanting to find a content recommendation on Twitter, as there were many tweets with the keywords found but there were many nonrelevant tweets instead.
This study aims to identify the tweets relevant for finding content recommendations respective to the streaming services of Disney and Netflix on Twitter.
The tweets will then be classified into relevant and nonrelevant tweets. There are several methods often used in classification problems, with random forest and convolutional neural network (CNN) being the most popular methods of all. Random forest is an ensemble comprising of multiple decision trees, while CNN is the improved version of the neural network algorithm. Previously, the random forest is used in the research for movie sales prediction in Korea, resulting in the analysis of the related factors to the success of the movie [4]. Another research showed that random forest is better than naïve Bayes for classifying sentiment analysis of movie recommendations for users [5].
Although CNN is generally more popular for deep learning, the method showed better results for sentiment analysis compared to Backpropagation Neural Network (BNN) for classifying sentiment in Twitter of the government of Surabaya [6].
In this study, both random forests and CNN were used for classifying the relevant and irrelevant tweets for streaming services' content recommendation on Twitter. There was expected to exist a clear distinction between tweets from the two categories. The characteristics of the tweets will be useful for future references in Twitter searches.

2.1.
Streaming service provider. Streaming is a process of listening or watching sound or video from the internet without the need to download the content [7]. The streaming service provider is a system offering online streaming access, usually with a subscription, to content such as movies, series, or animations. The subscriber can play the contents via their media players, such as a computer, phone, or smart TV.

Text mining.
Text mining is a process to extract information from unstructured data in documents. It is essentially similar to data mining, with the main difference being in preprocessing stage, where a transformation from unstructured data to a more familiar format is needed [8]. The general steps of preprocessing: removing lines, links, numbers, username, punctuations, and stopwords, case folding and tokenizing. Extracted words are then weighted using Term Frequency-Inverse Document Frequency (TF-IDF) weighting.
where t, d, n, and N are terms, document, number of terms, and number of documents, respectively.

2.3.
Random forest. Random forests are methods for supervised learning, constructed from ensemble methods of decision trees [10]. The trees in random forests are grown with random inputs and features. It is effective for the problem of prediction and is effective to improve the accuracy of the model. For the problem of classification, the algorithm of random forests is as follows.
1. Repeat the following steps K times: a. Drawing a bootstrap sample of n from N training data.
b. Train a random forest for the bootstrapped data with a random feature where M is the number of features.
c. Predict the test set based on the trees in the previous step.
2. Predicting the final dataset by combining the result of classification using majority vote.

Convolutional neural network (CNN). CNN is a method commonly used for
image analysis. It has advantages over the other neural network-based method in transforming the data into an easier processed input. In the text mining problem, the input is in the form of a matrix from a sentence [11]. Generally, the process of CNN is feature extraction and classification process. Feature extraction consists of transforming the complex to a more simplified input. There are layers for convolution and layers for pooling for reducing the dimension of the parameters. Several popular activation functions including sigmoid functions, rectified linear units (RelU), and parametric RelU.
In the classification process, it processed the input from the previous steps in fully connected layers. The outputs are stored in an N-dimensional vector containing the N class probability [12]. The training process is determined by the batch size and epoch, with logistic sigmoid as the activation function in the dense layer.

2.1.
Evaluation measure. Accuracy, precision, recall, and F-measure are used to evaluate the overall performance of a classifier. They were calculated based on the confusion matrix showed in Table 1. It contains the amount of data for which the row showed the actual class and the column as predicted class [13] Accuracy evaluates the model by estimating the probability of the true value of the overall class label [14].
Precision is the fraction of correctly classified instances over all the instances available, while recall is the fraction of correctly classified instances over the number of relevant instances. F-measure is a measure of accuracy for the classification problem, the harmonic mean of precision and recall. All the metrics used in this study are written in the equation for ( ) [15].

Results and Discussion
The data was collected on December 12 th , 2020, filtered using ID (Indonesian language) with a total of 20 keywords from the name of the service providers, limited to Netflix and Disney, and the combination of them with the terms drama, series, film,    Figure 1 shows the frequent words in each category. There were many unrelated terms in Figure 1 (a) and many different words were shown in the same size, as there were no obvious key terms shown in the cloud. There are 65,031 words listed from both categories. The top ten words were shown in Figure 2 and from the figures, it can be inferred that there are words with no significant meaning leading in numbers, such as di, appearing 1,130 times, yg and yang appearing over 600 times, ada, written over 500 times, and so on. The existence of these words suggested that data preprocessing is needed for the tweets, and those words were included in the list of removed words.
The first stage of preprocessing step involved: removing links, digits, username, punctuation, emoji, and finally concluded with case-folding. Afterward, Sastrawi library is used to remove the stopwords in the sentences. The stopwords were obtained from the package.  tuning the hyperparameter to find out the best parameter for the analysis is required [17].
To investigate the best parameter for the random forest, an initial check for the parameters using Random Search was employed. There were several parameters considered in this study: number of trees, number of features, number of levels, the minimum number before node splitting, and the resampling method for the data. Each parameter was given several choices of value to be executed in Random Search. From the pool of available parameters written in Table 3, half of all possible combinations, 240 parameter combinations, were sampled and evaluated.   Table 3. Based on the evaluation metric written in Table 4, the best result was obtained using n_estimator of 700 and min samples split of 5 with the accuracy of 85,62%. The F1 score for this combination was also better compared to the others. This parameter was applied to the test set, resulting in 83,823% of accuracy. The random forest was generally able to classify the tweets based on their relevance to the content recommendation of streaming services on Twitter.

Convolutional Neural Network.
After obtaining bag-of-words, the words were transformed into feature vectors. Word2vec technique was used so that each word becomes a vector of weight that represents its characteristics. Afterward, the embedding layer was constructed to implement those vectors. At this point, the data is ready to be trained. Trained data was 80% of overall data. To investigate the best model, two activation functions were compared and evaluated with k-fold Cross-Validation (CV).
The results were shown in Table 5.  Table 4 showed that the best model used RelU as its activation function and evaluated using 10-fold CV. It can be known from its accuracy (72.95%) which is higher than the others. The model summary is shown in Table 6. The best model then implemented to test data and get 94% of accuracy. It was higher than the accuracy in the training set. It was understandable that CNN was able to classify the tweet, but it indicates that the model was underfitting. The model summary of CNN on testing data is presented in Table 7.

Conclusion
The result of this research concluded that CNN performed higher accuracy but slower in running time to classify the tweets compared to Random Forest in the test set.
The results suggested that there was indeed a distinctive category between relevant and nonrelevant tweets about streaming services' content recommendation in Twitter. By observing the resulting word cloud, Twitter users could obtain a general idea of what words they should write and the words to avoid in the search query if they were going to look for a content recommendation in Twitter in the future.
Future research should consider the other potential CNN parameter combination such as learning rate, epoch, batch size, and so on to prevent underfit or overfit model. It is also important to examine other random forest parameter combinations to get more optimum results.