A Machine Learning Approach to Cluster Destination Image on Instagram

by Joanne Yu & Roman Egger

Full paper published by: Veronika Arefieva, Roman Egger, and Joanne Yu, Tourism Management, https://doi.org/10.1016/j.tourman.2021.104318

Open Acces – download the full paper here

Unlike previously, where the image of a destination is typically formed by tourism marketers, visual-centered social media platforms such as Instagram have dramatically changed the tourism industry and the way we travel. As a holiday planning tool, Instagram photos not only hint at how tourists associate and interpret the destination but also reproduce symbols with socially constructed meanings.

But wait… if you do a simple search for any destination on Instagram, they mostly have thousands, if not millions, of pictures posted by countless users. In the research, we provide a solution for how the destination image can be investigated in a data-rich environment.

Destination management organizations (DMOs) face the problem that they only know their own perceived image. They promote their country’s attractions without knowing how and about what tourists communicate. We were therefore interested in the question of what perceived image tourists have of Austria. The aim was to carry out topic analyses based on a clustering of images. In this way, and in combination with the geodata from the Instagram posts, we would see which themes are in vogue, and where in Austria. This information can then be used by DMOs to identify “white spots”, i.e. themes that are of interest to tourists, but which are not advertised by the destination, just because one was not aware of this up to now.

In our study, a total of 101,870 posts based on the hashtag #visitaustria, #feelaustria (these are the official hashtags from the destination management organization), and “#travel + #austria” were analysed. The data was preprocessed and all images were labelled by using the Google Cloud Vision API. So we retrieved a textual description of each image. Based on this text and the metadata from each post, we applied geoanalytics and three machine learning techniques; 1) k-means clustering based on document-term matrix, 2) correlation explanation (CorEx) topic model based on document-term matrix, and 3) k-means clustering based on Doc2Vec vectors, to evaluate how tourists associate their experiences on Instagram based on pictorial content. Detailed descriptions for each of the machine learning model follow.

Model 1: K-means clustering based on document-term matrix

A document-term matrix describes the frequency of terms occurring in a collection of documents, and the values of the matrix are tf-idf, which suggests the importance of a word in the documents. The resulting matrix was then clustered with the k-means algorithm. Based on the silhouette score, the best model returned 15 clusters. To provide a better cluster separation, the authors tested hierarchical approaches with the following model configurations: a single-layer model with 15 clusters, a hierarchical 2-layer model with 58 and 15 clusters, respectively, and a hierarchical 3-layer model with 58, 28 and 14 clusters. The cluster number on each layer was chosen using silhouette and inertia scores.

Model 2: CorEx topic model based on document-term matrix

CorEx algorithm maximizes the informativeness of the text data and allows the anchoring of words. The authors first constructed several single-layer and hierarchical models with three as a maximum number of layers. To enable manual result comparison with k-means models, the top layers of each CorEx model contained a cluster number of 15. To that end, the authors implemented one single-layer model with 15 clusters and a hierarchical 3-layer model with 58, 36 and 15 clusters on each layer, respectively. An appropriate cluster number for each layer was then selected based on total correlation.

Model 3: K-means clustering based on Doc2Vec vectors

A Doc2Vec model was used to produce 100-dimensional vectors for each photo. Meanwhile, a tourism domain-specific corpus was created based on 3.6 million reviews of tourism sights worldwide to formulate the Doc2Vec model. Next, three k-means models for 10, 15 and 20 clusters were tested using cosine distance to compute similarity. However, the authors reported that the cluster separation was unsatisfactory, resulting in several repetitive topics. One potential reason is that because the authors used the most frequent words in the documents belonging to the same cluster, the cluster descriptions were not well-separated. On a more general level, another issue causing poor performance is because the time-consuming process since Doc2Vec vectors need intensive parameter tuning.

The Take-away

Suggested by the authors, “k-means clustering based on document-term matrix offers the most satisfying results whereas using Doc2Vec techniques is the least recommended approach for clustering text data.” The use of three machine learning approaches provides a guideline for tourism organisations to uncover the destination image presenting on Instagram to optimise their marketing contents. By using “ready-to-use” frameworks (i.e., graphical images on Instagram), the presented models are already generalised algorithms which could be applied to any kind of data in textual form.

The authors conclude that the highest concentration of the photographs was in Austria’s most popular cities such as Vienna, Hallstatt, and Salzburg, as well as around national and nature parks, and around the Alps region. Another more specific example can be seen in the figure below. For pictures clustered as “atmospheric moods”, the findings show that most of the photos were taken in Vienna and the lake district close to Salzburg. Relevant visual contents of “atmospheric moods” photographs include sunset, afterglow, and dusk.

A dashboard allows to search for certain topics, like “athmospheric moode” in the figure above. A map then shows where these topics exist, naturally in the mountains (e.g. clouds, sunset) and in the lakes regions (e.g. fog, sunset).

In summary, data-driven approaches allow better understanding of the market by identifying white spots and optimising marketing communications. Recognising the importance of visual content displayed on Instagram would allow marketers to further identify new attractions where the majority of tourists might be unaware of before. By involving tourists in marketing content generation, the methodological techniques presented in this study optimises the cost and time effectiveness for destination marketers.

How to cite: Arefieva, V., Egger, R., & Yu, J. (2021). A machine learning approach to cluster destination image on Instagram. Tourism Management, 85, 104318.

Applied Data Science in Tourism

Interdisciplinary Approaches, Methodologies and Applications

A Machine Learning Approach to Cluster Destination Image on Instagram

Leave a Reply Cancel reply