For many, architecture blends functionality with creative design. Our lives are governed by how we interact with buildings and the spaces that they create. Since the inception of the discipline, the creative work has been executed by human minds. However, this is no longer the case. Already, AI has begun to design the world we live in. In 2019, Nvidia released ArchiGAN, a neural network able to design building’s floor-plans and create new space optimizations. Although unlikely to completely replace architects, the capability of computers designing architectural floor-plans and blueprints allows for the creation of projects that human architects would never have considered on their own.
Having read some fo the cutting edge research in the field of neural networks and architecture, I was particularly interested in the intersection of convolutional neural networks and architectural image data. For this project, I wanted to explore whether neural networks could classify the architectural styles of buildings using image data. In particular, I wanted a create a system that could classify the style of building from a picture that anyone could take. Of course, there are many different styles of architecture and they all have many variations, so I chose a dataset that held a distinct variety of categorical styles as well as variation within the styles. Ideally, I hoped to be able to create some sort of application that would allow a user to take an image of any building they might come across and have the software tell them the architectural style of that building.
The data I used comes from this Kaggle set. This dataset contains 10103 images from 25 architectural styles scrapped from google and other sources. It includes more contemporary styles alongside antiquated and archeological ones. As such, I felt it would work well as a dataset for training a convolutional neural network to classify a wide variety of images of buildings. This dataset, however, was sparse, containing only the images sorted into the folder of which style is belongs to.
What is a convolutional neural network?
For this type of project, I decided to use a convolutional neural network as the model. A convolutional neural network (CNN) is a neural network architecture that accounts for spatial adjacency in the data. This type of neutral network solves the problem that feed forward networks face when classifying images as each data point is interpreted alongside nearby data points as well. Using filters, convolutional layers pool over all locations and provide matrices for each location that essentially allow the network to learn specific patterns. This makes them effective when classifying images in a way that other networks are not.
1. Creating a CNN
I used the Keras framework to build the convolutional network. Considering how complex these images of buildings are, I based my model off a previous proven CNN architecture, DenseNet121. The number 121 refers to the number of layers in the model. To save time, I initialized the model with ImageNet weights. However, I set the last 100 layers’ weights to be trainable, and thus allowed them to change and become fine-tuned to my specific data as the model trained.
I loaded the model without a top and added my own dense layers for classification. The final layer has 25 different nodes, one for each type of architectural style. To prevent overfitting of the model, I used pretty heavy dropout of 0.5. The layers you see below were the result of much trial and error, as at first, the model held too many nodes and was too powerful for this data, causing it to grossly overfit.
2. Data Pre-Processing & Data Augmentation
The images from the dataset came in all shapes and sizes, but the model required all the images to be the same size and so I resized them. In this case, due to file size and RAM constraints, each image is resized to be 128 x 128 pixels, in RGB color. While many other image classification uses grayscale color, there is a lot of information about buildings hidden in their color scheme and so I decided to process the data with color. Although each pixel held 3 different values from 0 to 255 denoting the color of that pixel, I normalized each value to between -1 and 1.
Also, while there are 10,103 images in the dataset I used, there are over 25 categories, leaving around 400 images per category. In order to improve the results of the training, I mirrored each image in order to double the amount of data available.
At this point, I split the data 80:20 for training and testing purposes and saved these numpy arrays as files on my computer to prevent having to process the images multiple times, as this was a CPU-intensive and lengthy process.
Furthermore, the dataset only contained images and thus after processing, I had to tag them with their correct classification and saved in a separate matrix. At this point, with all the data ready, all that was left was to train my model.
3. Training the Model
With all the preprocessing done, I compiled the model and trained it using the training data. Epoch training time was determined after much trial and error. Training only ran for 3 epochs, as after this, accuracy began to rise while validation accuracy began to fall. Thus, this network began to overfit the data around this time and began to memorize individual data points rather than the overall trends.
During training, the model predicts what style a building is with 71% accuracy and a loss of 0.984. That’s not super accurate, however, given that there are 25 categories, it does quite well.
However, this is only considering whether the network can correctly predict the most likely result. If we look at it’s top three predictions, its accuracy jumps to 93%. And if we look at its top 5 accuracy, it goes to 1.0! That’s 100% accurate! This is all new data that the model has not seen before as well, making this network a great success. While, getting the top 5 most likely predictions is not super useful to a user, it is proof that the model is actually learning something from the data. A discerning eye could probably choose the correct result from the top 5 predictions, but even a top 5 result could be confusing to someone without much experience in architecture.
If we want to delve more into the accuracy of model’s classifications, we can look at the confusion matrix to see what the model routinely missed.
An f-1 score takes into account the precision and recall of the model and thus is a good benchmark of performance.
By looking at the f1-score of each category, we see that it struggles with category 20 (post-modern) but excels with category 1 (novelty). Overall, the model does a pretty good job but there are clear biases in its predictions. It can classify certain styles with 95% accuracy but others with only 32% accuracy.
This may be in part due to the fact that certain styles are much more distinct than others. For example, Ancient Egyptian architecture is much more distinct from the rest of the dataset than Post-Modern. However, I think these results are promising enough to declare that convolutional neural networks are well suited for the classification of architecture based on image data. Although this project does not revolutionize the understanding of neural networks with architecture, it is clear that this is not the end of their union.
One improvement would be the upsizing of each image. Due to processing and storage constraints, the largest image size I could use was 128 x 128 pixels. This meant a lot of finer detail was lost which could have hampered the model. Likewise, further cleaning up of the data could improve the performance, as the data is relatively messy. Many images have many other elements inside besides the building and the addition of bounding boxes could help focus the model. I forcibly resized the images to 128 x 128, which could have disproportionately affected the accuracy on images of skyscrapers or other unbalanced buildings.
I also attempted to explore creating new images from this dataset using either a Variational Auto-encoder or Generative Adversarial Networks. Unfortunately, neither worked particularly well with this dataset. I found with the visualization of the 2D latent space, that there was not much variation between the different categories with the way I had processed them. I’m sure that image generation could be achieved but not within the timeframe that I had for this project.
Given the modest success of this project, I also plan to expand it by creating an interface and further tweaking of the model so that anyone can upload an image and receive a prediction. Adding more data and categories would be ideal as well.
This project was done for my Artificial Neural Networks and Deep Learning Course. Find my code on GitHub here.