When a Picture Is Worth More Than Words | by Yuanpei Cao | The Airbnb Tech Blog | Dec, 2022

How Airbnb uses visual attributes to enhance the Guest and Host experience

By Yuanpei Cao, Bill Ulammandakh, Hao Wang, and Tony Hwang

On Airbnb, our hosts share unique listings all over the world. There are hundreds of millions of accompanying listing photos on Airbnb. Listing photos contain crucial information about style and design aesthetics that are difficult to convey in words or a fixed list of amenities. Accordingly, multiple teams at Airbnb are now leveraging computer vision to extract and incorporate intangibles from our rich visual data to help guests easily find listings that suit their preferences.

In previous blog posts titled WIDeText: A Multimodal Deep Learning Framework, Categorizing Listing Photos at Airbnb and Amenity Detection and Beyond — New Frontiers of Computer Vision at Airbnb, we explored how we utilize computer vision for room categorization and amenity detection to map listing photos to a taxonomy of discrete concepts. This post goes beyond discrete categories into how Airbnb leverages image aesthetics and embeddings to optimize across various product surfaces including ad content, listing presentation, and listing recommendations.

Attractive photos are as vital as price, reviews, and description during a guest’s Airbnb search journey. To quantify “attractiveness” of photos, we developed a deep learning-based image aesthetics assessment pipeline. The underlying model is a deep convolutional neural network (CNN) trained on human-labeled image aesthetic rating distributions. Each photo was rated on a scale from 1 to 5 by hundreds of photographers based on their personal aesthetic measurements (the higher the rating, the better the aesthetic). Unlike traditional classification tasks that classify the photo into low, medium and high-quality categories, the model was built upon the Earth Mover’s Distance (EMD) as the loss function to predict photographers’ rating distributions.

Figure 1. The model that predicts image aesthetics distribution is CNN-based and trained with the EMD loss function. Suppose the ground truth label of a photo is: 10% of users give ratings 1 and 2, respectively, 20% give rating 3, and 30% give ratings 4 and 5, respectively. The corresponding prediction is [0.1, 0.1, 0.2, 0.3, 0.3]

The predicted mean rating is highly correlated with image resolution and listing booking probability, as well as high-end Airbnb listing photo distribution. Rating thresholds are set based on use cases, such as ad photo recommendation on social media and photo order suggestion in the listing onboarding process.

Figure 2. Examples of Airbnb listing photos with aesthetics scores higher than the 90% percentile

Airbnb uses advertising on social media to attract new customers and inspire our community. The social media platform chooses which ads to run based on millions of Airbnb-provided listing photos.

Figure 3. Airbnb Ads displayed on Facebook

Since a visually appealing Airbnb photo can effectively attract users to the platform and considerably increase the ad’s click-through rate (CTR), we utilized the image aesthetic score and room categorization to select the most attractive Airbnb photos of the living room, bedroom, kitchen, and exterior view. The criterion for “good quality” listing photos was set based on the top 50th percentile of the aesthetic score and tuned based on an internal manual aesthetic evaluation of 1K randomly selected listing cover photos. We performed A/B testing for this use case and found that the ad candidates with a higher aesthetic score generated a substantially higher CTR and booking rate.

Figure 4. Pre-selected Airbnb Creative Ads through image aesthetics and room type filters

When posting a new listing on Airbnb, hosts upload numerous photos. Optimally arranging these photos to highlight a home can be time-consuming and challenging. A host may also be uncertain about the ideal arrangement for their images because the work requires making trade-offs between photo attractiveness, photo diversity, and content relevance to guests. More specifically, the first five photos are the most important for listing success as they are the most frequently viewed and crucial to forming the initial guest impression. Accordingly, we developed an automated photo ranking algorithm that selects and orders the first five photos of a home leveraging two visual signals: home design evaluation and room categorization.

Home design evaluation estimates how well a home is designed from an interior design and architecture perspective. The CNN-based home design evaluation model is trained on Airbnb Plus and Luxe qualification data that assess the aesthetic appeal of each photo’s home design. Airbnb Plus and Luxe listings have passed strict home design evaluation criteria and so the data from their qualification process is well-suited to be used as training labels for a home design evaluation model. The photos are then classified into different room types, such as living room, bedroom, bathroom etc, through the room categorization model. Finally, an algorithm makes trade-offs between photo home design attractiveness, photo relevance, and photo diversity to maximize the booking probability of a home. Below is an example of how a new photo order is suggested. The photo auto-rank feature was launched in Host’s listing onboarding product in 2021, leading to significant lifts in new listing creation and booking success.

Original ordering

Auto-suggested ordering

Figure 5. The example of original photo order (top) uploaded by Airbnb Host and auto-suggested order (bottom) calculated by the proposed algorithm

Beyond aesthetics, photos also capture the general appearance and content. To efficiently represent this information, we encode and compress photos into image embeddings using computer vision models. Image embeddings are compact vector representations of images that represent visual features. These embeddings can be compared against each other with a distance metric that represents similarity in that feature space.

Figure 6. Image embeddings can be compared by distance metrics like cosine similarity to represent their similarity in the encoded latent space

The features learned by the encoder are directly influenced by the training image data distribution and training objectives. Our labeled room type and amenity classification data allows us to train models on this data distribution to produce semantically meaningful embeddings for listing photo similarity use cases. However, as the quantity and diversity of images on Airbnb grow, it becomes increasingly untenable to rely solely on manually labeled data and supervised training techniques. Consequently, we are currently exploring self-supervised contrastive training to improve our image embedding models. This form of training does not require image labels; instead, it bootstraps contrastive learning with synthetically generated positive and negative pairs. Our image embedding models can then learn key visual features from listing photos without manual supervision.

Figure 7. Introducing random image transformations to synthetically create positive and negative pairs helps refine our image encoders without additional labeling.

It is often impractical to compute exhaustive pairwise embedding similarity, even within focused subsets of millions of items. To support real-time search use cases, such as (near) duplicate photo detection and visual similarity search, we instead perform an approximate nearest neighbor (ANN) search. This functionality is largely enabled by an efficient embedding index preprocessing and construction algorithm called Hierarchical Navigable Small World (HNSW). HNSW builds a hierarchical proximity graph structure that greatly constrains the search space at query time. We scale this horizontally with AWS OpenSearch, where each node contains its own HNSW embedding graphs and Lucene-backed indices that are hydrated periodically and can be queried in parallel. To add real-time embedding ANN search, we have implemented the following index hydration and index search design patterns enabled by existing Airbnb internal platforms.

To hydrate an embedding index on a periodic basis, all relevant embeddings computed by Bighead, Airbnb’s end-to-end machine learning platform, are aggregated and persisted into a Hive table. The encoder models producing the embeddings are deployed for both online inference and offline batch processing. Then, the incremental embedding update is synced to the embedding index on AWS OpenSearch through Airflow, our data pipeline orchestration service.

Figure 8. Index hydration data pathway

To perform image search, a client service will first verify whether the image’s embedding exists in the OpenSearch index cache to avoid recomputing embeddings unnecessarily. If the embedding is already there, the OpenSearch cluster can return approximate nearest neighbor results to the client without further processing. If there is a cache miss, Bighead is called to compute the image embedding, followed by a request to query the OpenSearch cluster for approximate nearest neighbors.

Figure 9. Image similarity search for a previously unseen image

Following this embedding search framework, we are scaling real-time visual search in current production flows and upcoming releases.

Airbnb Categories help our guests discover unique getaways. Some examples are “Amazing views”, “Historical homes”, and “Creative spaces”. These categories do not always share common amenities or discrete attributes, as they often represent an inspirational concept. We are exploring automatic category expansion by identifying similar listings based on their photos, which do capture design aesthetics.

Figure 10. Listing photos from the “Creative spaces” category

In the 2022 Summer Release, Airbnb introduced rebooking assistance to offer guests a smooth experience from Community Support ambassadors when a Host cancels on short notice. For the purpose of recommending comparable listings throughout the rebooking process, a two-tower reservation and listing embedding model ranks candidate listings, updated on a daily basis. As future work, we can consider augmenting the listing representation with image embeddings and enabling real-time search.

Figure 11. The example of a landing page that recommends similar listings to guests and Community Support ambassadors in the Rebooking assistance.

Photos contain aesthetic and style-related signals that are difficult to express in words or map to discrete attributes. Airbnb is increasingly leveraging these visual attributes to help our hosts highlight the unique character of their listings and to assist our guests in discovering listings that match their preferences.

Interested in working at Airbnb? Check out our open roles.

Thanks to Teng Wang, Regina Wu, Nan Li, Do-kyum Kim, Tiantian Zhang, Xiaohan Zeng, Mia Zhao, Wayne Zhang, Elaine Liu, Floria Wan, David Staub, Tong Jiang, Cheng Wan, Guillaume Guy, Wei Luo, Hanchen Su, Fan Wu, Pei Xiong, Aaron Yin, Jie Tang, Lifan Yang, Lu Zhang, Mihajlo Grbovic, Alejandro Virrueta, Brennan Polley, Jing Xia, Fanchen Kong, William Zhao, Caroline Leung, Meng Yu, Shijing Yao, Reid Andersen, Xianjun Zhang, Yuqi Zheng, Dapeng Li, and Juchuan Ma for the product collaborations. Also thanks Jenny Chen, Surashree Kulkarni, and Lauren Mackevich for editing.

Thanks to Ari Balogh, Tina Su, Andy Yasutake, Joy Zhang, Kelvin Xiong, Raj Rajagopal, and Zhong Ren’s leadership support on building computer vision products at Airbnb.