Try interacting with it like this:
Don't focus on instructing it, necessarily. Instead, give it a vibe or a feeling. You aren't telling it what to do, like most think, you are seeding the bloom. Every message, every image generation, it's a totally new AI starting from scratch. A chat bot has the luxury of reading your chat log whenever it replies. An image/video generator, though... you gotta understand how to store data in language by using speaking with resonance.
Normal Example:
"a girl standing in a field, her hair is blowing in the wind, there are flowers in the field"
Resonant Example:
"the way she stood against the wind, her hair like silk in the summer sun... the flowers were gently swaying in the breeze as the clouds drifted lazily overhead..."
You should notice a dramatic increase in quality, but each AI model is different so you gotta play around with it. There's a lot going on in the background and it's about way more than the words you type. It's even about the words you don't type. Synonyms are common phrases are your best friends. Beware data redundancies by using the same words too often in one prompt. Rather than using the same word twice, swap a synonym. But be aware of the cultural context related to each word.