Detection of AI generated images and videos

With the growing AI tools and libraries, it’s now easier to generate fake content. Until recently, the production of fake content has been limited because of the lack of sophisticated tools and also the involvement of complex and time consuming processes. However, these limitations are being significantly dealt with the use of large amounts of training datasets and the growth of machine learning and computer vision techniques that eliminate the involvement of manual editing steps, resulting in an enormous amount of fake content. With the increase in these content generators, it has become a necessity to develop tools to detect the AI generated content so as to avoid repetitions and enhance problem solving skills.

In our day to day life, we might have come across various news, articles and videos that seem fake. People who are not that well versed with technology might believe that fake content and share that across social media. That fake content may look good, but it is not a good practice and therefore a validation is necessary and hence machine learning comes to the rescue. In 2014, machine learning researcher Ian Goodfellow introduced the idea of generative adversarial networks or GANs. Let’s explore the this term first, “Generative” i.e., output things like images rather than just simple predictions, “adversarial networks” i.e., two neural networks competing with each other, one trying to fool the other into thinking that it can generate real entities, and the other trying to differentiate between real and fake. The images produced around that time were easy for us to differentiate, but the latest GAN-generated images are much more difficult to identify properly.

Cons of AI generated art and images: The possible concerns are that AI generated art could potentially replace human artists and thus leading to job loss and a decline in the values that traditional art signifies. AI could easily be used to create unethical and offensive images that might harm the norms and values of human traditions and cultures. These image generators can easily be used in more than one malicious way, such as superimposing faces of other human beings on an offensive image, that might spread rumours among the society, it can also lead to various political issues. While there are pros of imaGen, the cons also can’t be outweighed. Therefore, it is very important to check for the authenticity of an image, to be aware of what you are watching and sharing.

Why to employ Machine Learning and Deep Learning?

In the initial phase, around the time when GANs were first introduced, it was easy for humans to differentiate between AI generated images and authentic images, but with the growing technologies and algorithms, it has become too tough for humans to do the same. And hence, machine learning algorithms are to be used to obtain the accurate results. Though the algorithms might not be accurate, it is still fruitful to employ those so as to get a brief idea about the process and complexity. And of course, if the generators can grow, why not detectors!!

Models and methodologies: There exists many solutions to a single problem , while one suggests something, other suggests some other thing. One approach is as follows, in order to train a machine learning model to classify an image as AI generated or real, the first and foremost step is to select the appropriate datasets. We would need some high quality real and AI generated images for the same. There are various datasets over the internet that might help with the real images but there’s no pre-existing dataset of AI generated images, hence we could generate images using a model pre-trained on CAHQ images. After getting the appropriate datasets, the next step would be the preprocessing of the images. For pre-processing techniques we could use high-pass filters, co-occurrence matrices, and color transformations. Based on the pre-existing approaches, we could select Xception and ForensicTransfer as state-of-the-art model architectures for CNN-generated image detection. Xception is a deep CNN with depth-wise separable convolutions, inspired by Inception modules, and has shown good performance in multiple image forgery detection tasks, both for regular and compressed images. ForensicTransfer (FT) is a CNN-based encoder decoder architecture, which learns to encode the properties of fake and real images in latent space, outperforming several other methods when combined with high-pass filtering the images, or using transfer learning for few-shot adaptation to unknown classes. Images are classified as real if the real partition in latent space is more active than the fake partition, and vice versa. Other works suggest using computer-vision based recognition framework, i.e. object detection and visual question answering, another suggests Dual Attention Fake Detection Fine-tuning Network (DA-FDFtNet), a fine-tuning neural network-based architecture for fake face image detection. DAFDFtNet combines Fine-Tune Transformer (FTT) and Channel attention modules with a pre-trained Convolutional Neural Network (CNN) as a backbone, and MobileNet block V3 (MBblockV3) to distinguish the real and fake images.

Evaluation: There are various ways to evaluate the performance of detection methods. Let’s discuss some of them. In the easiest setup, test images could be created by the same generative model as train images and from the same data distribution. These test images are not to be further manipulated. This setup gives an upper bound on the performance of a detection method, but has no correspondence to a real-world scenario. Next one is to obtain the test images by generating them by one or multiple different models than images in the training set. Other approaches may suggest that the data used for generating training images differs from the data used for generating test images. The model may be equal or different. When images are uploaded to and downloaded from the internet, they are likely to undergo several types of post-processing, such as compression and resampling. On the other hand, images could be manipulated to make them less detectable, for example with blur and noise addition. We could select different types of pre-processing techniques, such as JPEG compression and Gaussian blurring, and evaluate how different amounts of post-processing influence the detection of AI generated images.

Conclusion: As per the new findings from a newly published paper, the SOTA models are likely less able to recognize and interpret AI images than people. This may be a concerning concept in the coming climate because machine learning models are being continuously trained on synthetic data, where it won’t be informed whether the data is ‘real’ or not. Out of the ten state-of-the-art(SOTA) models, which were tested on datasets generated by the image synthesis frameworks DALL-E 2 and Midjourney, the best one achieved only 60% and 80% top-5 accuracy for the two types of test, whereas ImageNet, if trained on non-synthetic, real-world data could achieve 91% and 99% respectively in the same categories, all the while human performance is particularly higher. Where there are ways to benefit from Machine learning services by using them in a right manner, there are also possibilities of Machine Learning algorithms being exploited if not been used in a right manner.

References:

https://openaccess.thecvf.com/content_CVPRW_2020/papers/w39/Hulzebosch_Detecting_CNN-Generated_Facial_Images_in_Real-World_Scenarios_CVPRW_2020_paper.pdf
https://arxiv.org/pdf/2112.12001v1.pdf
https://www.unite.ai/deep-learning-models-might-struggle-to-recognize-ai-generated-images/