This is my second post in the series of posts on Data Mining and Analysis on Twitter. This is an extension of my previous post where I presented the results of clustering of users on Twitter. In this post, I will present an analysis on how to (try to) predict future mentions on Twitter using different similarity measures as well as the cluster information that we obtained earlier.
First let us explore the reasons for which the users might mention each other on twitter. We begin with the most obvious link between the users that might result into a mention. This link is a social connection between the users (either i follows j or j follows i). If we consider twitter to be the only mode of communication between the users, then we can assume that a user will mention another user only when one of them sees the others’ tweets. And if we ignore the possibility of a user searching for some keyword/topic and then mentioning a random user who writes something interesting about the topic the other user is interested in. Therefore, we can assume with a very good accuracy that the user mentions occur on twitter if two users are somehow connected. Another factor for a user to mention another user would be if they have mentioned each other in the past. This tends to point out that they have been having a conversation and therefore share a good relation on twitter and therefore, are more likely to mention each other in the future. Another reason for a user to mention another user on twitter can be if he finds the other user’s tweets to be interesting. This similarity measure between the users can be correctly monitored using the content similarity of the user tweets and the hash tags. By saying this, we implicitly assume that a user also posts something that he is interested in.
Therefore, we now have a good model that can correlate between the user mentions in the future to the past data that we have obtained. Building a good model is the first step in future link prediction, a topic of vast interest in the research community. However, we limit our scope in this section to establishing a good relation between the mentions and other similarity measures which is a starting point towards the more complex link prediction problem. We now present the correlations between our similarity measures (tweet content similarity, user mention similarity and hash tag similarity) calculated for the data in the past with the mentions in the future.
The tables in the end of the post show the complete correlation matrix for the feature vectors obtained for the users that have a social connection between them. For all the pair of users that have a social connection between them, for the feature vectors by arranging the mentions between these pairs of users in column 1, hash-tag similarity between these users in column 2, tweet content similarity in column 3 followed by a weighted combination of these three similarity measures in column 4.
We assign a weight of 5 to the hashtag similarity measure as it is one of the factors that should have high correspondence with the future mentions between users on twitter. A main reason for assigning high weight here is because it can be tracked easily by users on twitter by using the twitter search feature or by seeing the trending topics. As we discussed earlier, past mentions are another important factor for future mentions on twitter and therefore, we assign a weight of 2 to these features. Tweet content similarity is one feature that users cannot easily track on twitter and therefore we assign the minimum weight to this feature.
The column 5 contains the classes that we want to compare these features with. These classes are obtained by using the mentions in the subsequent week with a class ‘1’ for a pair of users that have a mention relationship between them (either i mentions j or j mentions i) and class ‘0’ otherwise. We then find the correlations between different columns in the matrix. The tables below summarize the results for the correlation coefficients for different features. However, what is most interesting in the tables is the last column which shows the correlation between the feature vectors and the future mentions.
Table 1 shows the results for correlation between the feature-vectors obtained in form data for week 1 as compared to the mentions occurring in week 2(where the week boundaries are same as described here) for a complete set of 11,273 users. As discussed earlier, the most important property, i.e. past user mentions has a high correlation with the user mention data in the next week. This is as we expected because the users who mention each other often are more likely to have a mention relationship in future than the users who have not mentioned each other in the past. The next highest correlation with the future mentions for the individual feature vectors is the hash-tag similarity. This is also in line with our expectations as the users who tend to focus on the same keywords are more likely to follow it with a conversation with each other on twitter which leads to mentions between them. Therefore, the hash-tags are also quite good for modelling the future mentions on twitter. The lowest correlation between the future user mentions is obtained for the individual feature vector of tweet content similarity. The reason for this can be because it is difficult to track tweets on twitter based on their content than it is using the hash-tags or screen-names and therefore the tweets which have one of these features is more likely to start a conversation between two users. The figure highlighted in Table 1 corresponds to the correlation between our combined similarity measure w.r.t. the future mentions on twitter. We see that it is slightly better than the individual information as it includes information from two informative sources (i.e. the past mentions and the hash tag similarity).
Table 1: Correlation matrix for features and future mentions for feature vectors for week 1 as compared to mentions in week 2
W1/W2 |
mention |
hash |
Tweet |
combined |
class |
mention |
1 |
0.0528 |
0.003 |
0.919 |
0.1656 |
hash |
0.0528 |
1 |
0.0031 |
0.4422 |
0.0565 |
tweet |
0.003 |
0.0031 |
1 |
0.0134 |
0.0272 |
combined |
0.919 |
0.4422 |
0.0134 |
1 |
0.1713 |
class |
0.1656 |
0.0565 |
0.0272 |
0.1713 |
1 |
Table 2 shows the results for the setting for data in week 1 and 2 as compared to the mentions in week 3. We can observe some improvement in the results for the combined data of two weeks as compared to only the first week’s data. From this, we can infer that we can improve our results if we increase the data that we use to compute the feature vectors. Note that here the combination of all the feature vectors doesn’t perform as well mention individually. This is because of the low correlation of tweets and hash-tags which acts as a noisy data in this example and therefore leads to a worse combined feature vector as compared to the mention feature vector.
Table 2: Correlation matrix for features and future mentions for feature vectors for week 1 and week 2 as compared to mentions in week 3
W12/W3 |
mention |
Hash |
tweet |
combined |
class |
mention |
1 |
0.0976 |
0.0003 |
0.8526 |
0.2186 |
Hash |
0.0976 |
1 |
-0.0009 |
0.6033 |
0.0635 |
tweet |
0.0003 |
-0.0009 |
1 |
0.008 |
0.0359 |
combined |
0.8526 |
0.6033 |
0.008 |
1 |
0.2088 |
class |
0.2186 |
0.0635 |
0.0359 |
0.2088 |
1 |
However, the results for Table 3 which show the correlation matrix for feature vectors for weeks 1, 2 and 3 as compared to the future mentions in week 4 again show an improvement in the correlation for combined feature vector. An interesting point to note here is that the tweet content similarity feature vector has a negative correlation with the future mentions in week 4. This means the tweet content similarity feature vector doesn’t correspond very well to the mentions in the next week. However, we still see that the combined feature vectors perform better than the past mentions and therefore we can say that the process of combining the feature vectors can lead to better results.
Table 3: Correlation matrix for features and future mentions for feature vectors for week 1, 2 and 3 as compared to mentions in week 4
W123/W4 |
mention |
hash |
tweet |
combined |
class |
mention |
1 |
0.1428 |
0.0219 |
0.8912 |
0.1906 |
hash |
0.1428 |
1 |
0.0193 |
0.5761 |
0.0861 |
tweet |
0.0219 |
0.0193 |
1 |
0.0343 |
-0.006 |
combined |
0.8912 |
0.5761 |
0.0343 |
1 |
0.1968 |
class |
0.1906 |
0.0861 |
-0.006 |
0.1968 |
1 |
The final correlation results that we show for the future mentions are obtained by calculating the correlation for only a cluster of users. This group of users corresponds to the users placed into the same cluster by using the Fast Modularity Maximization algorithm on users’ social connections. Table 4 shows the correlation matrix for feature vectors for the users that belong to cluster 1 . We can observe here that the results are much better as compared to the previous results for the complete group of 11,273 users. This is because the mentions are more likely to occur between the communities of users rather than outside these communities. We can see that only one week’s data here surpasses the previous results that use even more data. Another important observation that we can make here is that the low correlation of tweets doesn’t act as too much of a noise in this case as the correlations for the other two features are quite high as opposed to the previous cases. This is another indication of how the cluster results obtained from the group of users can be used to establish a relationship between different features on twitter.
Table 4: Correlation matrix for features and future mentions for feature vectors for week 1 as compared to mentions in week 2 calculated for only the users in cluster 1 (as obtained in section 4.3)
W1/W2 |
mention |
hash |
tweet |
combined |
class |
mention |
1 |
0.0343 |
-0.0062 |
0.7492 |
0.1616 |
hash |
0.0343 |
1 |
-0.0049 |
0.6876 |
0.2192 |
Tweet |
-0.0062 |
-0.0049 |
1 |
-0.0001 |
-0.0116 |
combined |
0.7492 |
0.6876 |
-0.0001 |
1 |
0.2625 |
class |
0.1616 |
0.2192 |
-0.0116 |
0.2625 |
1 |
All posts in this series: