Deepfake Video Detection Using Long Short-Term Memory

6 min readApr 14, 2022

The growing computation power has made the deep learning algorithms so powerful that creating an indistinguishable human synthesized video popularly called deepfakes have became very simple. Scenarios where these realistic face swapped deepfakes are used to create political distress, fake terrorism events, revenge porn, blackmail peoples are easily envisioned. This post proposes a simple and robust way to automatically detect the replacement and reenactment deepfakes. We are trying to use Artificial Intelligence to fight Artificial Intelligence. Our system uses a ResNext Convolution neural network to extract the frame-level features and these features and further used to train the Long Short Term Memory-based Recurrent Neural Network to classify whether the video is subject to any kind of manipulation or not, i.e. whether the video is deepfake or real video. To emulate the real-time scenarios and make the model perform better on real-time data, we evaluate our method on a large amount of balanced and mixed datasets prepared by mixing the various available dataset like FaceForensic++, and Deepfake detection challenge, and Celeb-DF. We also show how our system can achieve the competitive results using a very simple and robust approach.

Implementation of the System

In this system, we have trained our PyTorch deepfake detection model on equal number of real and fake videos in order to avoid bias in the model. The system architecture of the model is shown in the figure.

In the development phase, we have taken a dataset, preprocessed the dataset, and created a new processed dataset which only includes the face cropped videos.

A. Dataset

For making the model efficient for real-time prediction. We have gathered the data from different available datasets like FaceForensic++, Deepfake detection challenge(DFDC), and Celeb-DF. Further, we have mixed the dataset and created our own new dataset, for accurate and real-time detection of different kinds of videos. To avoid the training bias of the model we have considered 50% Real and 50% fake videos.

The deep fake detection challenge (DFDC) dataset consists of certain audio alerted video, such as audio deepfake is out of scope for this post. We preprocessed the DFDC dataset and removed the audio altered videos from the dataset by running a python script.

After preprocessing the DFDC dataset, we have taken 1500 Real and 1500 Fake videos from the DFDC dataset. 1000 Real and 1000 Fake videos from the FaceForensic++(FF) dataset and 500 Real and 500 Fake videos from the Celeb-DF dataset. Which makes our total dataset consists of 3000 Real, 3000 fake videos, and 6000 videos in total.

B. Data Preprocessing

In this step, the videos are preprocessed and all the unrequired and noise is removed from videos. Only the required portion of the video i.e face is detected and cropped.

The first step in the preprocessing of the video is to split the video into frames. After splitting the video into frames the face is detected in each of the frames and the frame is cropped along the face. Later the cropped frame is again converted to a new video by combining each frame of the video. The process is followed for each video which leads to creation of processed dataset containing face only videos. The frame that does not contain the face is ignored while preprocessing.

To maintain the uniformity of number of frames, we have selected a threshold value based on the mean of total frames count of each video. Another reason for selecting a threshold value is limited computation power. As a video of 10 second at 30 frames per second(fps) will have total 300 frames and it is computationally very difficult to process the 300 frames at a single time in the experimental environment. So, based on our Graphic Processing Unit (GPU) computational power in experimental environment we have selected 150 frames as the threshold value. While saving the frames to the new dataset we have only saved the first 150 frames of the video to the new video. To demonstrate the proper use of Long Short-Term Memory (LSTM) we have considered the frames in the sequential manner i.e. first 150 frames and not randomly. The newly created video is saved at frame rate of 30 fps and resolution of 112 x 112.

C. Dataset Split

The dataset is split into train and test dataset with a ratio of 70% train videos (4,200) and 30% (1,800) test videos. The train and test split is a balanced split i.e 50% of the real and 50% of fake videos in each split.

C. Model

Our model is a combination of CNN and RNN. We have used the Pre- trained ResNext CNN model to extract the features at frame level and based on the extracted features a LSTM network is trained to classify the video as deepfake or pristine.

Using the Data Loader on training split of videos the labels of the videos are loaded and fitted into the model for training. ResNext: Instead of writing the code from scratch, we used the pretrained model of ResNext for feature extraction. ResNext is Residual CNN network optimized for high performance on deeper neural networks. For the experimental purpose we have used resnext50_32x4d model. We have used a ResNext of 50 layers and 32 x 4 dimensions.

Following, we will be fine-tuning the network by adding extra required layers and selecting a proper learning rate to properly converge the gradient descent of the model. The 2048-dimensional feature

vectors after the last pooling layers of ResNext is used as the sequential LSTM input.

LSTM for Sequence Processing: 2048-dimensional feature vectors is fitted as the input to the LSTM. We are using 1 LSTM layer with 2048 latent dimensions and 2048 hidden layers along with 0.4 chance of dropout, which is capable to do achieve our objective. LSTM is used to process the frames in a sequential manner so that the temporal analysis of the video can be made, by comparing the frame at ‘t’ second with the frame of ‘t-n’ seconds. Where n can be any number of frames before t.

The model also consists of Leaky Relu activation function. A linear layer of 2048 input features and 2 output features are used to make the model capable of learning the average rate of correlation between eh input and output. An adaptive average polling layer with the output parameter 1 is used in the model. Which gives the the target output size of the image of the form H x W. For sequential processing of the frames a Sequential Layer is used. The batch size of 4 is used to perform the batch training. A SoftMax layer is used to get the confidence of the model during predication.

E. Hyperparameter tuning

To enable the adaptive learning rate Adam[21] optimizer with the model parameters is used. The learning rate is tuned to 1e-5 (0.00001) to achieve a better global minimum of gradient descent. The weight decay used is 1e-3.

As this is a classification problem so to calculate the loss cross-entropy approach is used.

RESULTS

We evaluated our algorithm on sequence length of 10, 20,40,60,80,100.

The above image represents the results achieved on our dataset by the model. The accuracy in the image depicts the test accuracy.

As we can observe in our results that the accuracy of the model is increasing with the increasing number of sequence lengths.

Based on our results we can say that, our model is able to predict whether the video is a deepfake or real by seeing just 10 frames i.e. less than 1 second (considering 30 frames per second video) with a decent accuracy of 84%.

I hope you have liked your research on Deepfake detection using LSTM. We have a prototype project here hosted on Github. Soon I will be back with a step by step installation of the project.

Deepfake detection using Deep Learning on github