AI Image, Video & Audio
Midjourney V7, GPT Image, Stable Diffusion, FLUX, Runway, Sora 2, ElevenLabs v3, and creative workflows.
AI-generated images, video, and audio have gone from novelty to professional-grade in under three years. As of early 2026, you can generate photorealistic images in seconds, create video clips from text descriptions, and clone voices with remarkable fidelity. In this module, you'll learn the major tools, how to use them effectively, and the creative workflows that are reshaping content creation.
AI Image Generation: The Landscape in 2026
The AI image generation space has matured significantly. Multiple tools now produce photorealistic, highly controllable images, each with distinct strengths. Here's the current state of play:
| Tool | Latest Version | Strengths | Access |
|---|---|---|---|
| Midjourney | V7 (April 2025) | Artistic quality, aesthetic control, rebuilt architecture with web-based editor | Web app at midjourney.com, subscription plans starting at $10/mo |
| GPT Image 1.5 | Dec 2025 (replaced DALL-E) | Text rendering, instruction following, integrated into ChatGPT. Ranked #1 on LM Arena image leaderboard | Built into ChatGPT (Plus, Team, Enterprise) |
| FLUX.2 | Pro / Flex / Dev variants | Speed, quality, open-weight options. From Black Forest Labs (founded by Stable Diffusion creators) | API access, integrated into many third-party tools |
| Stable Diffusion | SD 3.5 | Open source, local generation, fine-tuning, full control over the pipeline | Free (open source), runs locally or via cloud services |
How to Write Effective Image Prompts
The quality of your AI-generated images depends heavily on how you describe what you want. Good prompting is a skill that improves with practice. Here's a framework for writing effective prompts:
The Prompt Formula
A strong image prompt typically includes these elements, roughly in this order:
What is the main subject? Be specific: "a golden retriever puppy" not just "a dog."
What is the subject doing? "Running through a field," "sitting at a desk," "looking directly at camera."
Where is the scene? "In a sunlit forest," "on a busy Tokyo street," "against a clean white background."
What visual style? "Photorealistic," "watercolor painting," "3D render," "flat vector illustration."
"Soft golden hour light," "dramatic chiaroscuro," "neon-lit cyberpunk," "bright and airy."
Camera angle, lens type, aspect ratio: "shot from below," "35mm lens," "wide angle," "16:9 aspect ratio."
Common Prompt Mistakes
- Too vague: "A cool landscape" — what kind? Where? What time of day? What style?
- Too long and contradictory: Extremely long prompts with conflicting instructions confuse the model
- Ignoring negative prompts: Specify what you don't want when applicable (Midjourney uses
--no, Stable Diffusion uses negative prompts) - Wrong aspect ratio: Always specify the aspect ratio for your intended use (social media, website header, print, etc.)
AI Video Generation
AI video generation has made remarkable progress. While we're not yet at the point of generating feature films, the current generation of tools can produce short clips (typically 5-30 seconds) with impressive visual quality and coherence.
| Tool | Latest Version | Strengths | Notes |
|---|---|---|---|
| Runway | Gen-4.5 | Top-rated quality (1247 Elo), consistent motion, professional-grade output. Also launched GWM-1 (General World Model) | Web-based, subscription plans available |
| Sora | Sora 2 (Sept 2025) | High-quality text-to-video, strong scene understanding | Limited availability — primarily US and Canada |
| Google Veo | Veo 3 / 3.1 | Native audio generation with video, strong coherence, Google ecosystem integration | Available through Google AI tools |
| Pika | Current (2025) | Accessible interface, creative effects, image-to-video conversion | Web-based, 120K+ monthly active users |
Practical Video Generation Workflows
Here's how professionals are using AI video tools in real workflows today:
- Social media content: Generate eye-catching short clips for Instagram Reels, TikTok, or YouTube Shorts from text descriptions
- Product visualization: Create product showcase videos without expensive shoots — describe the product, setting, and camera movement
- Concept videos: Quickly visualize ideas for client pitches or internal presentations before investing in production
- B-roll and stock footage: Generate specific B-roll clips instead of searching stock footage libraries
- Storyboarding: Use AI to generate visual storyboard frames, then refine the best ones into video clips
AI Voice and Audio
AI audio has become remarkably capable, with voice synthesis that's often indistinguishable from real human speech. The two standout tools in this space serve very different use cases:
ElevenLabs
$330M ARRElevenLabs is the leading AI voice platform, now at Eleven v3 for text-to-speech. Their technology produces natural, expressive speech with fine-grained control over emotion, pacing, and style.
- Eleven v3 TTS: Latest text-to-speech model with improved naturalness and expressiveness
- ElevenAgents: Voice-powered AI agents for phone calls, customer service, and interactive applications
- ElevenCreative: Tools for creative audio projects including audiobooks and character voices
- Voice cloning: Create a digital copy of any voice from a short audio sample (with consent)
- 29+ languages: Generate speech in dozens of languages with natural accents
NotebookLM Audio Overviews
Google's NotebookLM offers a unique "Audio Overview" feature that transforms documents, articles, and research papers into engaging podcast-style audio discussions between two AI hosts.
- Document-to-podcast: Upload any document and get a natural-sounding discussion about its contents
- Research synthesis: Upload multiple sources and NotebookLM synthesizes them into a coherent audio overview
- Learning tool: Convert dense material into an accessible listening format for commutes or exercise
- Free to use: Available at notebooklm.google.com with a Google account
Audio Use Cases
- Audiobook narration: Convert written content into professional narrated audio using ElevenLabs
- Podcast production: Generate intro/outro voiceovers, or use NotebookLM to create discussion-format content
- Multilingual content: Translate and voice your content in dozens of languages without hiring voice actors
- Accessibility: Make written content accessible to visually impaired users with natural-sounding narration
- Prototyping: Test voice interfaces, IVR systems, or voice assistant scripts before investing in professional recording
Practical Creative Workflows
The real power of AI creative tools comes from combining them in workflows. Here are some practical multi-tool creative workflows:
Social Media Content Pipeline
Step 1: Write your message using ChatGPT or Claude. Step 2: Generate supporting images with Midjourney V7 or GPT Image 1.5. Step 3: Create a short video clip with Runway Gen-4.5 for Reels/TikTok. Step 4: Add a voiceover with ElevenLabs if needed. Result: A complete multi-format content package from a single idea.
Product Marketing Visuals
Step 1: Photograph your product on a plain background. Step 2: Use GPT Image 1.5 to place it in lifestyle scenes (on a desk, in a kitchen, outdoors). Step 3: Generate a product showcase video with Runway or Pika. Step 4: Create platform-specific sizes and formats. Result: A full suite of marketing visuals without a photo shoot.
Educational Content Creation
Step 1: Write educational content with AI assistance. Step 2: Generate diagrams and illustrations with Midjourney or GPT Image 1.5. Step 3: Upload the content to NotebookLM for an audio overview version. Step 4: Create short explainer video clips for key concepts. Result: Multi-format educational content (text, visual, audio, video) from one source.
Copyright and Ethical Considerations
AI-generated creative content exists in an evolving legal and ethical landscape. Here's what you need to know as of 2026:
Copyright Status
The legal landscape around AI-generated content copyright varies by jurisdiction and remains unsettled. In the US, the Copyright Office has indicated that purely AI-generated images without substantial human creative input may not be copyrightable. However, works where AI is used as a tool with significant human direction may qualify. Always check the latest guidance for your jurisdiction.
Ethical Usage
- Consent for voice cloning: Never clone someone's voice without their explicit permission. ElevenLabs and other platforms have consent verification processes.
- Deepfake awareness: Do not create realistic images or videos of real people in misleading scenarios. Many platforms prohibit this in their terms of service.
- Disclosure: When AI-generated content could be mistaken for real photography or footage, consider disclosing its AI origin.
- Artist impact: Be aware of the ongoing debate about AI training on artists' work. Some tools (like Adobe Firefly) train only on licensed content.
Recommended Resources
Midjourney
Midjourney, Inc.
Leading AI image generator, now on V7 with a rebuilt web editor. Known for exceptional artistic quality and aesthetic control.
Runway
Runway AI, Inc.
Top-rated AI video generation platform. Gen-4.5 leads quality rankings. Also offers image generation, editing, and world model research.
ElevenLabs
ElevenLabs
Industry-leading AI voice platform with Eleven v3 TTS, voice cloning, and multilingual support. Powers audiobooks, podcasts, and voice agents.
NotebookLM
Free tool that transforms documents into podcast-style audio discussions. Upload research papers, articles, or notes and get engaging audio overviews.
AI Art & Image Generation — Complete Beginner Guide
Matt Wolfe
Matt Wolfe covers the latest AI tools and workflows with practical, beginner-friendly tutorials on image, video, and audio generation.
Key Takeaways
- 1The AI image landscape in 2026 is led by Midjourney V7, GPT Image 1.5 (which replaced DALL-E), FLUX.2, and Stable Diffusion 3.5 — each with distinct strengths.
- 2Effective image prompts follow a formula: subject, action, setting, style, lighting, and technical details. Iteration is essential — expect 5-15 refinement cycles.
- 3AI video generation is maturing rapidly: Runway Gen-4.5 leads in quality, with Sora 2, Google Veo 3, and Pika as strong alternatives for different use cases.
- 4ElevenLabs Eleven v3 is the leading voice synthesis platform, while NotebookLM offers free document-to-podcast conversion for learning and content creation.
- 5The most powerful creative workflows combine multiple AI tools: text generation, image creation, video production, and voice synthesis working together.
- 6Copyright law for AI-generated content is still evolving. Check platform terms for commercial use rights and always obtain consent before cloning voices.
Test Your Understanding
Module Assessment
5 questions · Score 70% or higher to complete this module
You can retake the quiz as many times as you need. Your best score is saved.