Intermediate45 minModule 6 of 7

AI Image, Video & Audio

Midjourney V7, GPT Image, Stable Diffusion, FLUX, Runway, Sora 2, ElevenLabs v3, and creative workflows.

AI-generated images, video, and audio have gone from novelty to professional-grade in under three years. As of early 2026, you can generate photorealistic images in seconds, create video clips from text descriptions, and clone voices with remarkable fidelity. In this module, you'll learn the major tools, how to use them effectively, and the creative workflows that are reshaping content creation.

AI Image Generation: The Landscape in 2026

The AI image generation space has matured significantly. Multiple tools now produce photorealistic, highly controllable images, each with distinct strengths. Here's the current state of play:

ToolLatest VersionStrengthsAccess
MidjourneyV7 (April 2025)Artistic quality, aesthetic control, rebuilt architecture with web-based editorWeb app at midjourney.com, subscription plans starting at $10/mo
GPT Image 1.5Dec 2025 (replaced DALL-E)Text rendering, instruction following, integrated into ChatGPT. Ranked #1 on LM Arena image leaderboardBuilt into ChatGPT (Plus, Team, Enterprise)
FLUX.2Pro / Flex / Dev variantsSpeed, quality, open-weight options. From Black Forest Labs (founded by Stable Diffusion creators)API access, integrated into many third-party tools
Stable DiffusionSD 3.5Open source, local generation, fine-tuning, full control over the pipelineFree (open source), runs locally or via cloud services
What Happened to DALL-E?
OpenAI's DALL-E, one of the original AI image generators, has been deprecated. It was replaced by GPT Image 1.5 in December 2025, which is natively integrated into ChatGPT. GPT Image 1.5 represents a significant leap in quality, especially for text rendering within images and following complex instructions. It quickly rose to #1 on the LM Arena image generation leaderboard.

How to Write Effective Image Prompts

The quality of your AI-generated images depends heavily on how you describe what you want. Good prompting is a skill that improves with practice. Here's a framework for writing effective prompts:

The Prompt Formula

A strong image prompt typically includes these elements, roughly in this order:

1
Subject

What is the main subject? Be specific: "a golden retriever puppy" not just "a dog."

2
Action / Pose

What is the subject doing? "Running through a field," "sitting at a desk," "looking directly at camera."

3
Setting / Background

Where is the scene? "In a sunlit forest," "on a busy Tokyo street," "against a clean white background."

4
Style / Medium

What visual style? "Photorealistic," "watercolor painting," "3D render," "flat vector illustration."

5
Lighting / Mood

"Soft golden hour light," "dramatic chiaroscuro," "neon-lit cyberpunk," "bright and airy."

6
Technical Details

Camera angle, lens type, aspect ratio: "shot from below," "35mm lens," "wide angle," "16:9 aspect ratio."

Iteration Is Key
Rarely does the first prompt produce exactly what you want. Treat image generation as an iterative process: generate, evaluate, refine the prompt, and regenerate. Most professionals go through 5-15 iterations to get the perfect image. Use features like Midjourney's "vary" and "remix" modes or ChatGPT's ability to edit specific areas of an image to refine your results.

Common Prompt Mistakes

  • Too vague: "A cool landscape" — what kind? Where? What time of day? What style?
  • Too long and contradictory: Extremely long prompts with conflicting instructions confuse the model
  • Ignoring negative prompts: Specify what you don't want when applicable (Midjourney uses --no, Stable Diffusion uses negative prompts)
  • Wrong aspect ratio: Always specify the aspect ratio for your intended use (social media, website header, print, etc.)

AI Video Generation

AI video generation has made remarkable progress. While we're not yet at the point of generating feature films, the current generation of tools can produce short clips (typically 5-30 seconds) with impressive visual quality and coherence.

ToolLatest VersionStrengthsNotes
RunwayGen-4.5Top-rated quality (1247 Elo), consistent motion, professional-grade output. Also launched GWM-1 (General World Model)Web-based, subscription plans available
SoraSora 2 (Sept 2025)High-quality text-to-video, strong scene understandingLimited availability — primarily US and Canada
Google VeoVeo 3 / 3.1Native audio generation with video, strong coherence, Google ecosystem integrationAvailable through Google AI tools
PikaCurrent (2025)Accessible interface, creative effects, image-to-video conversionWeb-based, 120K+ monthly active users

Practical Video Generation Workflows

Here's how professionals are using AI video tools in real workflows today:

  • Social media content: Generate eye-catching short clips for Instagram Reels, TikTok, or YouTube Shorts from text descriptions
  • Product visualization: Create product showcase videos without expensive shoots — describe the product, setting, and camera movement
  • Concept videos: Quickly visualize ideas for client pitches or internal presentations before investing in production
  • B-roll and stock footage: Generate specific B-roll clips instead of searching stock footage libraries
  • Storyboarding: Use AI to generate visual storyboard frames, then refine the best ones into video clips
Runway Gen-4.5 and World Models
Runway's Gen-4.5 currently leads video generation quality rankings with an Elo score of 1247. Beyond their generation model, Runway also launched GWM-1 (General World Model), which aims to understand and simulate real-world physics and dynamics. This represents a shift from pure video generation toward AI systems that truly understand how the physical world works — a foundation for even more realistic and controllable video generation in the future.

AI Voice and Audio

AI audio has become remarkably capable, with voice synthesis that's often indistinguishable from real human speech. The two standout tools in this space serve very different use cases:

ElevenLabs

$330M ARR

ElevenLabs is the leading AI voice platform, now at Eleven v3 for text-to-speech. Their technology produces natural, expressive speech with fine-grained control over emotion, pacing, and style.

  • Eleven v3 TTS: Latest text-to-speech model with improved naturalness and expressiveness
  • ElevenAgents: Voice-powered AI agents for phone calls, customer service, and interactive applications
  • ElevenCreative: Tools for creative audio projects including audiobooks and character voices
  • Voice cloning: Create a digital copy of any voice from a short audio sample (with consent)
  • 29+ languages: Generate speech in dozens of languages with natural accents

NotebookLM Audio Overviews

Google's NotebookLM offers a unique "Audio Overview" feature that transforms documents, articles, and research papers into engaging podcast-style audio discussions between two AI hosts.

  • Document-to-podcast: Upload any document and get a natural-sounding discussion about its contents
  • Research synthesis: Upload multiple sources and NotebookLM synthesizes them into a coherent audio overview
  • Learning tool: Convert dense material into an accessible listening format for commutes or exercise
  • Free to use: Available at notebooklm.google.com with a Google account

Audio Use Cases

  • Audiobook narration: Convert written content into professional narrated audio using ElevenLabs
  • Podcast production: Generate intro/outro voiceovers, or use NotebookLM to create discussion-format content
  • Multilingual content: Translate and voice your content in dozens of languages without hiring voice actors
  • Accessibility: Make written content accessible to visually impaired users with natural-sounding narration
  • Prototyping: Test voice interfaces, IVR systems, or voice assistant scripts before investing in professional recording

Practical Creative Workflows

The real power of AI creative tools comes from combining them in workflows. Here are some practical multi-tool creative workflows:

Social Media Content Pipeline

Step 1: Write your message using ChatGPT or Claude. Step 2: Generate supporting images with Midjourney V7 or GPT Image 1.5. Step 3: Create a short video clip with Runway Gen-4.5 for Reels/TikTok. Step 4: Add a voiceover with ElevenLabs if needed. Result: A complete multi-format content package from a single idea.

Product Marketing Visuals

Step 1: Photograph your product on a plain background. Step 2: Use GPT Image 1.5 to place it in lifestyle scenes (on a desk, in a kitchen, outdoors). Step 3: Generate a product showcase video with Runway or Pika. Step 4: Create platform-specific sizes and formats. Result: A full suite of marketing visuals without a photo shoot.

Educational Content Creation

Step 1: Write educational content with AI assistance. Step 2: Generate diagrams and illustrations with Midjourney or GPT Image 1.5. Step 3: Upload the content to NotebookLM for an audio overview version. Step 4: Create short explainer video clips for key concepts. Result: Multi-format educational content (text, visual, audio, video) from one source.

Copyright and Ethical Considerations

AI-generated creative content exists in an evolving legal and ethical landscape. Here's what you need to know as of 2026:

Copyright Status

The legal landscape around AI-generated content copyright varies by jurisdiction and remains unsettled. In the US, the Copyright Office has indicated that purely AI-generated images without substantial human creative input may not be copyrightable. However, works where AI is used as a tool with significant human direction may qualify. Always check the latest guidance for your jurisdiction.

Ethical Usage

  • Consent for voice cloning: Never clone someone's voice without their explicit permission. ElevenLabs and other platforms have consent verification processes.
  • Deepfake awareness: Do not create realistic images or videos of real people in misleading scenarios. Many platforms prohibit this in their terms of service.
  • Disclosure: When AI-generated content could be mistaken for real photography or footage, consider disclosing its AI origin.
  • Artist impact: Be aware of the ongoing debate about AI training on artists' work. Some tools (like Adobe Firefly) train only on licensed content.
Commercial Usage Rights
Each platform has different terms regarding commercial use of generated content. Midjourney requires a paid plan for commercial use. ChatGPT grants usage rights to generated images. Stable Diffusion depends on the specific model license. FLUX.2 Pro has commercial terms that differ from the open Dev variant. Always review the terms of service before using AI-generated content commercially.

Recommended Resources

Key Takeaways

  • 1The AI image landscape in 2026 is led by Midjourney V7, GPT Image 1.5 (which replaced DALL-E), FLUX.2, and Stable Diffusion 3.5 — each with distinct strengths.
  • 2Effective image prompts follow a formula: subject, action, setting, style, lighting, and technical details. Iteration is essential — expect 5-15 refinement cycles.
  • 3AI video generation is maturing rapidly: Runway Gen-4.5 leads in quality, with Sora 2, Google Veo 3, and Pika as strong alternatives for different use cases.
  • 4ElevenLabs Eleven v3 is the leading voice synthesis platform, while NotebookLM offers free document-to-podcast conversion for learning and content creation.
  • 5The most powerful creative workflows combine multiple AI tools: text generation, image creation, video production, and voice synthesis working together.
  • 6Copyright law for AI-generated content is still evolving. Check platform terms for commercial use rights and always obtain consent before cloning voices.

Test Your Understanding

Module Assessment

5 questions · Score 70% or higher to complete this module

You can retake the quiz as many times as you need. Your best score is saved.

Cookie Preferences

We use cookies to enhance your experience. By continuing, you agree to our use of cookies.