Suggestions on reshaping the input data for action recognition

First of all, thank you very much for your hard work. I am trying to build myself an activity recognition model on the subset of UCF101 dataset (I am using the top 20 activity labels).

So far, I have used a pre-trained VGG16 network to extract the features out of the individual frames extracted from the videos. The final shape I got from the VGG16 network is (20501, 7, 7, 512) (for the train set). I now want to pass these extracted features to an LSTM-based network and I am a bit confused as to how I should reshape it?

How many time steps should I pass in and also how many features in one time-step?