AI Creative Glossary: 85+ Terms for AI Creators
Complete reference glossary for AI art, music, and video creation. Definitions for LoRA, CFG scale, diffusion models, prompt engineering, and 85+ more terms.
The AI creative space moves fast, and the terminology can be overwhelming. LoRA, CFG scale, latent space, embeddings, inference โ if you're hearing these terms in tutorials or community forums but don't fully understand them, you're not alone. This glossary is your decoder ring for the AI creative world, covering 85+ essential terms across image, music, video, and text generation.
Use the search box below to quickly find definitions, or browse alphabetically. Each term includes a plain-English explanation, context about where it's used, and links to related concepts and full articles. Bookmark this page โ you'll refer back to it constantly.
.glossary-container {
max-width: 1000px;
margin: 2rem auto;
padding: 2rem;
background: #0f0a1a;
border-radius: 16px;
color: #e2d9f3;
font-family: system-ui, -apple-system, sans-serif;
}
.search-box {
margin-bottom: 2.5rem;
}
.search-input {
width: 100%;
padding: 1.125rem;
background: rgba(139,92,246,0.08);
border: 2px solid rgba(167,139,250,0.2);
border-radius: 12px;
color: #e2d9f3;
font-size: 1.125rem;
font-family: inherit;
transition: border-color 0.2s;
}
.search-input:focus {
outline: none;
border-color: #7c3aed;
box-shadow: 0 0 0 3px rgba(124,58,237,0.1);
}
.search-input::placeholder {
color: #a78bfa;
}
.letter-section {
margin-bottom: 3rem;
}
.letter-header {
font-size: 2.5rem;
font-weight: 700;
color: #db2777;
margin-bottom: 1.5rem;
padding-bottom: 0.75rem;
border-bottom: 3px solid rgba(219,39,119,0.3);
}
.term-entry {
margin-bottom: 2rem;
padding: 1.5rem;
background: rgba(139,92,246,0.05);
border: 1px solid rgba(167,139,250,0.15);
border-radius: 12px;
transition: all 0.2s;
}
.term-entry:hover {
background: rgba(139,92,246,0.1);
border-color: rgba(167,139,250,0.3);
transform: translateX(4px);
}
.term-name {
font-size: 1.5rem;
font-weight: 700;
color: #e2d9f3;
margin-bottom: 0.5rem;
}
.term-definition {
color: #e2d9f3;
line-height: 1.8;
margin-bottom: 1rem;
font-size: 1.05rem;
}
.term-meta {
display: flex;
flex-wrap: wrap;
gap: 1rem;
margin-top: 1rem;
padding-top: 1rem;
border-top: 1px solid rgba(167,139,250,0.15);
font-size: 0.9rem;
}
.meta-item {
color: #a78bfa;
}
.meta-label {
font-weight: 600;
color: #db2777;
}
.term-entry.hidden {
display: none;
}
.no-results {
text-align: center;
padding: 3rem;
color: #a78bfa;
font-size: 1.125rem;
display: none;
}
.no-results.visible {
display: block;
}
@media (max-width: 640px) {
.glossary-container {
padding: 1.25rem;
}
.letter-header {
font-size: 2rem;
}
.term-name {
font-size: 1.25rem;
}
}
type="text"
class="search-input"
id="glossary-search"
placeholder="๐ Search terms... (e.g., 'LoRA', 'CFG scale', 'prompt')"
>
A
AI Art
Visual artwork created using artificial intelligence tools, typically through text-to-image models like Midjourney, DALL-E, or Stable Diffusion. AI art generation works by training neural networks on millions of images, then using text prompts to guide the creation of new images. Controversial in some art communities but rapidly gaining mainstream acceptance.
Aesthetic Score
A numerical rating (typically 1-10) that some AI models use to evaluate the visual appeal of generated images. Based on training data that includes human aesthetic preferences. Higher scores generally correlate with more "pleasing" compositions, but aesthetic is subjective. Some Stable Diffusion models allow you to target specific aesthetic score ranges.
AIVA
AI music composition tool specializing in orchestral and soundtrack music. AIVA (Artificial Intelligence Virtual Artist) generates original compositions based on genre, mood, and style preferences. Popular for film scoring, video game music, and content creators needing royalty-free tracks. Offers more fine-tuned control than Suno but less vocal capability.
Aspect Ratio
The proportional relationship between width and height of an image or video, expressed as two numbers (e.g., 16:9, 4:3, 1:1). Critical for AI image generation because different ratios suit different purposes: 16:9 for landscape/YouTube, 9:16 for vertical/TikTok, 1:1 for Instagram. In Midjourney, set with --ar. In DALL-E, describe in natural language. In Stable Diffusion, set explicit width/height values.
AUTOMATIC1111
The most popular web interface for running Stable Diffusion locally. Often called "A1111" or "SD WebUI." Provides a full-featured UI for text-to-image, image-to-image, inpainting, and more. Requires installation and a decent GPU but gives you complete control and unlimited generations. Alternative to ComfyUI and commercial services.
B
Batch Generation
Creating multiple images, audio tracks, or outputs in a single operation. Useful for testing variations or producing volume. Most tools support batch generation: Midjourney creates 4 images per prompt by default, Stable Diffusion lets you specify batch count, Suno can generate multiple songs at once. Be aware of credit consumption when batching on paid plans.
Bokeh
The aesthetic quality of out-of-focus areas in photography, characterized by soft, blurred backgrounds that make subjects pop. In AI image generation, requesting "bokeh" or "shallow depth of field" helps create this effect. Particularly effective for portrait prompts. Quality varies by model โ photorealistic models handle bokeh better than illustration-focused ones.
C
CFG Scale (Classifier-Free Guidance Scale)
Controls how closely an AI model follows your prompt versus generating freely. Higher CFG values (12-20) force strict prompt adherence but can cause oversaturation and artifacts. Lower values (1-5) give the model creative freedom but may stray from your intent. The sweet spot for most use cases is 7-12. In Midjourney, this is inversely controlled by --stylize (lower stylize = higher CFG adherence). Critical parameter in Stable Diffusion.
CLIP Skip
In Stable Diffusion, determines which layer of the CLIP text encoder to stop at when processing your prompt. CLIP Skip 1 (default) uses the final layer for maximum prompt fidelity. CLIP Skip 2 uses an earlier layer, often producing more artistic, less literal interpretations. Anime and illustration models often recommend CLIP Skip 2. Technical setting most beginners can ignore.
ComfyUI
A node-based interface for Stable Diffusion that lets you build complex generation workflows by connecting visual nodes. More powerful than AUTOMATIC1111 for advanced users, but steeper learning curve. Ideal for automating multi-step processes like upscaling + face restoration + style transfer. Preferred by professionals for its flexibility and customization.
Concept Art
Visual representations created during early creative development, typically for games, films, or products. AI tools excel at rapid concept art iteration, letting you explore dozens of design directions in minutes. Prompting for "concept art" typically yields painterly, detailed illustrations with clear subject focus. Popular style descriptor across all image generation tools.
Conditioning
The process of influencing AI generation using additional information beyond the text prompt, such as reference images, masks, depth maps, or control signals. ControlNet is the most common conditioning system. Think of it as giving the AI extra "context clues" about what you want. Powerful for maintaining consistency or precise control over composition.
ControlNet
An extension for Stable Diffusion that allows precise control over image composition using reference images. You provide a "control image" (edge map, depth map, pose skeleton, etc.) and ControlNet forces the generated image to match that structure while following your text prompt. Game-changer for consistency and complex compositions. Essentially a way to tell the AI "make this, but with this exact layout."
CLIP
OpenAI's "Contrastive Language-Image Pre-training" model that understands connections between text and images. The "translator" that converts your text prompt into a format the image generation model can understand. CLIP's training on 400 million image-text pairs is why AI models "know" what concepts look like. All major image generators use some form of CLIP or similar text encoding.
D
DALL-E
OpenAI's text-to-image AI model. DALL-E 3 (current version, available via ChatGPT) excels at following complex prompts and generating text within images. Known for safety guardrails and content policy enforcement. Prefers natural language prompts over keyword lists. Integrated with ChatGPT Plus ($20/mo) or available via API. Generally more "literal" than Midjourney's artistic interpretation.
Denoising Strength
In image-to-image generation (img2img), controls how much the AI alters the input image. Value from 0 to 1: 0.3 makes subtle changes (recolor, refine), 0.7 significantly transforms (change style, composition), 0.9+ nearly ignores the input. The key parameter for balancing "keep the original structure" vs "creative reinterpretation." Start around 0.5-0.7 and adjust.
Deep Dream
Google's early neural network visualization technique that creates surreal, psychedelic images by amplifying patterns the network detects. Characterized by recursive, fractal-like patterns and lots of "eyes" and "dog faces." Mostly a historical curiosity now, but influenced modern AI art aesthetics. Predates modern diffusion models.
Diffusion Model
The underlying architecture powering most modern AI image generators. Works by gradually adding noise to images during training, then learning to reverse that process. Generation starts with pure noise and progressively "de-noises" it into a coherent image matching your prompt. Stable Diffusion, Midjourney, and DALL-E all use variants of diffusion models. More technically: DDPM (Denoising Diffusion Probabilistic Models).
E
Embedding
A custom-trained concept you can inject into Stable Diffusion prompts using a short trigger word. Created through "textual inversion" โ training the model on 5-20 images of a specific subject, style, or person. Lighter weight than LoRA, less versatile, but faster. Example: train an embedding on your face, then use it with <myface> in prompts. Popular on Civitai and HuggingFace.
ElevenLabs
Leading AI voice synthesis and text-to-speech platform. Creates incredibly realistic human-sounding voices from text or clones voices from short audio samples. Popular for audiobooks, video voiceovers, and character voices. Tiered pricing from free (10K characters/mo) to Creator ($22/mo, 100K characters). Controversial due to potential misuse for deepfakes.
Epochs
In AI training, one complete pass through the entire training dataset. More epochs = more learning, but too many causes "overfitting" where the model memorizes training data instead of generalizing. When training custom LoRAs or embeddings, you'll typically run 10-50 epochs. For end users (not training models), this is mostly invisible.
Euler Sampler
A sampling algorithm used during Stable Diffusion image generation. Euler (and Euler A) are fast, stable, and produce good results with fewer steps. Alternative samplers include DPM++ 2M Karras, LMS, DDIM, each with different speed/quality tradeoffs. Most beginners can stick with Euler or DPM++ 2M Karras. Choice of sampler affects style subtly โ experiment to find your preference.
F
Fine-tuning
Further training a pre-trained AI model on specialized data to adapt it for specific styles, subjects, or domains. More involved than LoRA or embeddings โ essentially creating a custom model variant. Example: fine-tuning Stable Diffusion on 10,000 anime images creates an anime-specialized model. Requires technical knowledge, GPU resources, and significant time. Most users consume fine-tuned models rather than create them.
Firefly (Adobe)
Adobe's generative AI suite integrated into Creative Cloud. Includes image generation, generative fill (Photoshop), and more. Key differentiator: trained only on Adobe Stock and public domain content, giving it the cleanest commercial license for business use. Quality trails Midjourney but legal clarity is superior. Included with Creative Cloud subscription ($20-55/mo depending on plan).
Flux
A new open-source image generation model from Black Forest Labs (former Stability AI team) released in 2024. Flux.1 competes with SDXL in quality with better prompt adherence and text rendering. Three versions: Flux.1 [pro] (paid API), Flux.1 [dev] (non-commercial), Flux.1 [schnell] (fast, Apache 2.0 license). Gaining rapid adoption in the open-source community.
G
GAN (Generative Adversarial Network)
An earlier AI architecture where two neural networks compete: a "generator" creates images, a "discriminator" judges if they're real or fake. The generator improves by trying to fool the discriminator. GANs powered pre-2022 AI art (like StyleGAN, BigGAN) but have largely been superseded by diffusion models for image generation. Still used in some specialized applications.
Guidance Scale
Another term for CFG Scale. See: CFG Scale above. Different tools use different names for the same concept โ guidance scale, CFG scale, prompt strength all refer to how strictly the model follows your prompt.
Generation Credits
The "currency" many AI platforms use to meter usage. Example: Midjourney Basic gives 3.3 GPU hours โ 200 images. Leonardo gives 150 daily tokens โ 30 images. Suno Pro gives 500 credits โ 100 songs. Different actions consume different amounts: higher resolution costs more, video costs more than images. Monitor credit usage to avoid unexpected overages or throttling.
H
Hallucination (AI)
When an AI generates content that seems plausible but is incorrect or fabricated. In text models, this means making up facts. In image models, it manifests as anatomical errors (extra fingers, warped faces), impossible physics, or nonsensical objects. Not technically a "mistake" โ the model is functioning as designed, but its training data or prompt interpretation led to implausible output. Quality has improved significantly in 2024-2026.
HuggingFace
A platform for sharing, discovering, and deploying open-source AI models. The "GitHub of AI models" โ home to thousands of Stable Diffusion checkpoints, LoRAs, embeddings, and text models. Also provides APIs and inference hosting. Essential resource for anyone working with open-source AI. Alternative to Civitai (which focuses more on image models specifically).
Hypernetwork
An older method for customizing Stable Diffusion models, now largely replaced by LoRA. Works by training a small network that modifies the main model's behavior. Hypernetworks were popular in 2022-2023 but have fallen out of favor due to LoRA's superior results and flexibility. Mentioned here for completeness; new users should use LoRA instead.
I
Image-to-Image (img2img)
Using an existing image as a starting point for AI generation rather than starting from noise. The model uses the input image's composition, colors, or structure as guidance while applying your text prompt. Controlled by "denoising strength" parameter. Perfect for variations, style transfer, or refining existing images. Available in most image generation tools.
Inpainting
Selectively editing or replacing parts of an image while keeping the rest unchanged. You "mask" the area to change, provide a new prompt, and the AI regenerates only that region while blending seamlessly with surroundings. Essential tool for fixing mistakes (removing extra fingers), changing elements (swap a red car for blue), or adding objects. DALL-E, Stable Diffusion, and most tools support inpainting.
Inference
The process of using a trained AI model to generate outputs โ as opposed to training the model itself. When you hit "generate" in Midjourney or Stable Diffusion, you're running inference. "Inference time" is how long generation takes. Cloud platforms charge for inference, local setups require GPU capable of inference. Not to be confused with training (which is far more resource-intensive).
Iteration
The process of repeatedly generating, evaluating, and refining outputs to achieve desired results. Central to AI creative workflows: generate โ review โ adjust prompt โ regenerate. Professional AI artists often iterate 10-50+ times per final image. Efficient iteration means systematic prompt refinement rather than random changes. Also: in technical terms, one step in the diffusion denoising process.
J
JPEG Artifacts
Visual distortions caused by lossy image compression โ blocky patterns, color banding, blurring around edges. AI models can inadvertently generate images with artifact-like patterns, especially at lower quality settings or with poor prompts. Include "high quality, sharp, no artifacts" in prompts to minimize. Also: actual compression artifacts appear when you save/re-save AI outputs at low quality settings.
K
Kling AI
Chinese AI video generation model by Kuaishou that rivals Runway Gen-3 in quality. Known for excellent motion coherence and cinematic output. Available internationally at $10-30/mo. Prompt syntax and aesthetic differ from Western tools โ tends toward certain visual styles. Worth exploring if Runway's results don't match your needs. Access sometimes restricted by region.
KNN (K-Nearest Neighbor)
A machine learning algorithm sometimes used in AI image tools for similarity search or style matching. Finds the K most similar items to a query. Not central to generation itself, but used in some backend systems for organizing and retrieving training data or style references. Mostly invisible to end users.
L
Latent Diffusion
The specific type of diffusion model used by Stable Diffusion. Instead of working directly on pixel space (slow, expensive), latent diffusion compresses images into a smaller "latent space," runs diffusion there (fast, cheap), then decodes back to pixels. This innovation makes Stable Diffusion runnable on consumer GPUs. The "Stable" in Stable Diffusion refers to this latent approach.
Latent Space
A compressed, abstract mathematical representation of data used internally by neural networks. Think of it as a "concept map" where similar concepts are located near each other. Moving through latent space produces smooth transitions between concepts. Why interpolating between two AI-generated images creates coherent intermediate images. Fundamental to how generative models work, but mostly invisible to users.
LoRA (Low-Rank Adaptation)
A lightweight method for fine-tuning AI models without retraining the entire model. LoRAs add specific capabilities (new art styles, characters, concepts) to base models with small file sizes (10-200MB vs 2-7GB for full models). Load multiple LoRAs simultaneously. Installed by placing files in a folder and referencing them in prompts with <lora:name:weight>. The dominant customization method for Stable Diffusion in 2024-2026.
LDRM
Latent Diffusion Resolution Model โ technical term sometimes used in academic papers. Refers to variants of latent diffusion models optimized for specific resolutions. Not commonly used in practice; usually just called "Stable Diffusion" with resolution specified separately. Included here in case you encounter it in technical documentation.
Luma AI
AI company known for 3D capture technology and emerging video generation tools. Luma Dream Machine competes with Runway and Pika for text-to-video. Known for strong motion and camera movement. Pricing and availability have fluctuated โ check current offerings. Worth monitoring as video generation space evolves rapidly.
M
Midjourney
The most popular AI image generator, known for stunning artistic results and strong aesthetic sensibility. Operates through Discord (divisive UX). Three paid tiers: Basic ($10/mo), Standard ($30/mo), Pro ($60/mo). Excels at: fantasy art, concept art, surrealism, cinematic imagery. Weaker at: photorealism (improving), text rendering, precise control. Best for creative exploration and polished artistic output.
Model (AI)
The trained neural network that performs AI generation. "The model" can refer to: (1) The architecture (e.g., "Stable Diffusion"), (2) A specific checkpoint/version (e.g., "SDXL 1.0"), or (3) A fine-tuned variant (e.g., "Realistic Vision v5"). Models are large files (2-7GB typically) containing billions of learned parameters. Swapping models dramatically changes output style and capability.
Multimodal
AI models that can understand and generate multiple types of content (text, images, audio, video). Examples: GPT-4V understands images and text, Runway Gen-3 uses text to generate video, DALL-E uses text to create images. The future of AI is increasingly multimodal โ single models handling diverse inputs and outputs seamlessly.
MIDI
Musical Instrument Digital Interface โ a protocol for encoding music as note events rather than audio waveforms. Some AI music tools (like AIVA) export MIDI files you can edit in DAWs. More flexible than fixed audio: change tempo, instrumentation, individual notes. Other tools (like Suno) only export audio files. MIDI support indicates production-oriented features.
N
Negative Prompt
Text describing what you DON'T want in your generation. Essential tool for quality control. Common negative prompts: "blurry, low quality, distorted, extra fingers, text, watermark." In Midjourney: --no hands. In DALL-E: "without hands." In Stable Diffusion: separate "negative prompt" field. Most effective when specific: "malformed hands" works better than "bad hands."
Noise
Random pixel values that serve as the starting point for diffusion model generation. Pure static โ gradually refined into coherent image. Controlled by the seed value: same seed = same initial noise = reproducible results (if other parameters match). Understanding noise helps explain why generation isn't deterministic without seed control.
Neural Network
The fundamental architecture of modern AI, inspired by biological brains. Consists of interconnected layers of "neurons" that process and transform data. "Deep learning" refers to neural networks with many layers. All AI creative tools (Midjourney, Stable Diffusion, Suno, etc.) are powered by various types of neural networks. Understanding this isn't necessary to use the tools, but explains their "black box" nature.
NSFW Filter
Content moderation system that blocks or flags "Not Safe For Work" (adult/explicit) content. All major commercial AI tools enforce NSFW filters: DALL-E (strict), Midjourney (strict, auto-bans violations), Stable Diffusion (optional โ local installs can disable, but most hosted services enforce). Filters also block violence, hate speech, celebrity likeness. Occasionally triggers false positives on innocent prompts.
O
Outpainting
Extending an image beyond its original borders by generating new content that seamlessly continues the scene. Opposite of inpainting (which edits within borders). The AI analyzes the existing image and imagines what might exist beyond the frame. Useful for changing aspect ratios, expanding compositions, or creating panoramas. Available in DALL-E, Stable Diffusion, and specialized tools.
Overfitting
When an AI model trains "too well" on its dataset and memorizes rather than generalizes. Results in poor performance on new inputs. In custom LoRA training, overfitting happens with too many epochs or too few training images. Signs: generated images look almost identical to training data, no variation. Prevented by proper training techniques, regularization, and adequate dataset size. Mostly relevant if you're training models, not using them.
P
Parameters
Settings that control AI generation behavior. In Midjourney: flags like --ar, --stylize, --chaos. In Stable Diffusion: values like steps, CFG scale, sampler, seed. In Suno: genre, vocals, tempo. Understanding parameters is key to consistent, predictable results. Each tool has different parameters โ see our comparison tools for syntax across platforms.
Pika
Budget-friendly AI video generation tool ($8/mo Standard tier). Produces 3-10 second clips from text or image prompts. Quality trails Runway Gen-3 but at 1/4 the price. Good for: social media clips, motion graphics, creative experiments. Improving rapidly. Best value option if video is occasional need rather than primary focus.
Prompt
The text description you provide to guide AI generation. The single most important factor in output quality. Good prompts balance specificity with flexibility, include style/mood/technical details, and match the platform's preferred syntax. Prompt engineering is the skill of crafting effective prompts. Different platforms respond to different prompt styles: Midjourney likes poetic descriptors, DALL-E prefers natural language, Stable Diffusion wants keywords.
Prompt Engineering
The practice of systematically designing prompts to reliably produce desired AI outputs. Involves understanding: how models interpret language, which descriptors produce which effects, how to structure multi-concept prompts, platform-specific syntax. A learnable skill that dramatically improves results. Professional AI artists spend 80% of their time on prompt engineering, 20% on post-processing.
Prompt Weighting
Adjusting the relative importance of different parts of your prompt. In Stable Diffusion: (keyword:1.5) emphasizes, (keyword:0.8) de-emphasizes. In Midjourney: implicit through word order and phrasing (earlier words carry more weight). Mastering weighting lets you fine-tune which elements dominate your composition versus serve as subtle influences.
Q
Quality Settings
Parameters controlling output fidelity, detail level, and generation time. Higher quality = better results but slower/costlier. In Midjourney: --quality 2. In Stable Diffusion: higher step counts (30-50), higher resolution. In Suno: "high fidelity" mode. Balance quality vs speed based on use case: high quality for final outputs, lower for rapid iteration.
R
Random Seed
A number (typically 0-4294967295) that initializes the random number generator, controlling the "randomness" of generation. Same seed + same prompt + same parameters = identical output (reproducible). Essential for: iterating on specific results, creating variations, debugging. In Midjourney/SD: --seed or seed parameter. DALL-E doesn't expose seeds. Pro tip: save seeds of successful generations.
Resolution
The dimensions of generated images in pixels (e.g., 1024ร1024, 1920ร1080). Higher resolution = more detail but slower generation and more memory. Most models have native training resolution: Stable Diffusion 1.5 (512ร512), SDXL (1024ร1024), Midjourney v6 (~1456ร1456). Generating far from native resolution risks quality degradation. Use upscaling for larger final outputs.
Refiner
In SDXL (Stable Diffusion XL), an optional second model that adds fine details after the base model generates the composition. Used in a two-stage process: base model creates layout/structure, refiner adds texture/detail. Not always necessary โ many workflows skip the refiner. Costs extra time/memory. Experiment to see if refiner improves your specific style.
Runway Gen-3
State-of-the-art AI video generation model from Runway ML. Produces 5-10 second clips with exceptional motion coherence and cinematic quality. Three tiers: Standard ($15/mo), Pro ($35/mo), Unlimited ($95/mo). Used by professionals in film, advertising, music videos. Expensive but currently the quality benchmark. Gen-3 Turbo (faster, lower quality) available for previewing.
S
Sampling Method
The algorithm used to progressively denoise images during diffusion generation. Different samplers produce subtly different aesthetics and converge at different speeds. Popular options: Euler, Euler A, DPM++ 2M Karras, LMS, DDIM. Most users stick with Euler or DPM++ 2M Karras. Affects both speed (steps needed) and style (subtle). Stable Diffusion specific โ Midjourney/DALL-E handle sampling internally.
Scheduler
Controls how aggressively noise is removed at each step during diffusion generation. Related to sampler but technically distinct. Common schedulers: Karras, Exponential, Normal. Karras is most popular. Scheduler choice affects: speed to convergence, final detail level, tendency toward artifacts. Another "sampler-adjacent" setting in Stable Diffusion that most users set once and forget.
Seed
See Random Seed above. The two terms are used interchangeably.
Stable Diffusion
The leading open-source text-to-image AI model, developed by Stability AI. Unlike Midjourney or DALL-E, you can run it locally for free (requires GPU) or use hosted services. Fully customizable: swap models, load LoRAs, adjust every parameter. Steeper learning curve but unmatched flexibility. Versions: SD 1.5 (legacy), SDXL (current), SD3 (recent). Powers Leonardo, Clipdrop, and dozens of services.
SDXL
Stable Diffusion XL โ the current flagship version of Stable Diffusion, released mid-2023. Native 1024ร1024 resolution (vs 512ร512 in SD 1.5), better prompt understanding, improved image quality. Larger model (7GB vs 2GB) requires more VRAM but produces noticeably better results. Most new models and LoRAs target SDXL. SD 1.5 still used for some specialized workflows.
Stems (Audio)
Individual instrument/vocal tracks that make up a complete song (drums, bass, vocals, etc.). Having stems allows remixing, volume balancing, or using parts separately. Most AI music generators (Suno, Udio) only export mixed stereo audio, not stems. Some offer "stem separation" features to extract drums/vocals after generation. True multitrack stems are rare in AI music generation currently.
Spectrogram
A visual representation of audio showing frequency content over time. Looks like a heatmap: time on X-axis, frequency on Y-axis, brightness shows intensity. Some AI audio tools use spectrograms internally or display them for analysis. Useful for understanding audio structure but not essential for most users. More relevant for audio engineers than creators.
Style Transfer
Applying the visual style of one image to the content of another. Example: make a photo look like a Van Gogh painting, or render a portrait in anime style. Core capability of AI image tools. In practice: use img2img with style-focused prompt, or use LoRAs trained on specific art styles. Some tools have dedicated "style transfer" modes. Powerful for consistent aesthetic across multiple images.
Suno
Leading AI music generation platform. Creates full songs with vocals, lyrics, and production in any genre in ~2 minutes. Free tier: 10 songs/day. Pro ($10/mo): 500 credits/mo (100 songs). Best for: quick music creation, demos, content soundtracks. Limitations: songs sound "AI-ish" on close listen, no stems export, limited control over structure. Still the most capable AI music tool available in 2026.
T
Text-to-Image
The core capability of AI image generators: creating images from text descriptions. Abbreviated txt2img. The "default mode" for Midjourney, DALL-E, Stable Diffusion. Opposite of image-to-image (which starts from an existing image). Text-to-image starts with pure noise, guided only by your prompt. The foundational AI creative capability that kicked off the current AI art revolution.
Text-to-Video
Generating video clips from text prompts. Significantly harder than text-to-image due to temporal consistency requirements (motion coherence, no flickering, logical progression). Current leaders: Runway Gen-3, Kling, Pika. Quality improving rapidly but still far from photorealistic human motion. Best current use cases: abstract motion, landscapes, establishing shots, motion graphics.
Text-to-Speech (TTS)
AI-generated human-like voice from written text. ElevenLabs is the market leader. Applications: audiobooks, video voiceovers, character voices, accessibility. Quality now nearly indistinguishable from real voices. Ethical concerns around voice cloning and deepfakes. Most platforms require consent for voice cloning, ban impersonation. Commercial TTS starts at $5-22/mo.
Textual Inversion
The training technique used to create embeddings in Stable Diffusion. You provide 5-20 images of a concept, and the model learns to associate a trigger word with that concept. Lighter-weight than LoRA training but less versatile. Useful for: specific faces, unique objects, niche art styles. Results saved as small embedding files (.pt or .safetensors). Mostly superseded by LoRA for new use cases.
Token
The basic unit of text that AI models process. Roughly: 1 token โ 0.75 words. Models have token limits: GPT-4 handles 8K-32K tokens, CLIP (for image prompts) ~75 tokens. Extremely long prompts get truncated. Also: some services call generation credits "tokens" (confusing dual meaning). When discussing prompts, token count matters for very long descriptions.
Training Data
The images, text, or audio used to teach an AI model. Stable Diffusion trained on billions of image-text pairs from the internet (LAION dataset). Midjourney's training data is proprietary. Training data determines: what the model "knows," potential biases, licensing questions. Major controversy: artists unhappy about their work in training sets without permission. Commercial tools increasingly use licensed-only data (Adobe Firefly).
U
Upscaling
Increasing image resolution while adding detail and sharpness (not just stretching pixels). AI upscalers (ESRGAN, Real-ESRGAN, Topaz Gigapixel) analyze and enhance. In Midjourney: click U1-U4 buttons for 2-4x upscale. In Stable Diffusion: use separate upscaler models or built-in options. Essential for print-ready outputs. Some upscaling is "creative" (adds invented detail), some is "conservative" (preserves exactly).
Udio
Suno's main competitor in AI music generation. Similar capabilities: full songs with vocals in any genre. Udio offers slightly better structural control and slightly worse vocal quality (subjective). Pricing matches Suno: ~$10/mo. Worth trying both to see which aesthetic you prefer. Some genres work better in one vs the other. Both improving rapidly.
V
VAE (Variational Autoencoder)
The component of Stable Diffusion that compresses images into latent space (encoder) and decodes them back to pixels (decoder). Different VAEs affect color saturation, sharpness, and overall aesthetic. Most users stick with the model's default VAE. Custom VAEs available for specific effects (more vibrant colors, etc.). Technical component most beginners don't need to worry about.
Variation
Creating alternate versions of a generated output while maintaining core elements. In Midjourney: V1-V4 buttons create variations of one image from the grid. In Stable Diffusion: use same seed with slightly changed prompt, or use "variation seed strength." Essential for iterative refinement: generate, pick best, create variations, repeat until perfect.
Vector
Graphics defined by mathematical paths rather than pixels, allowing infinite scaling without quality loss. Most AI generators produce raster (pixel) images, not vectors. Some tools claim "AI vector generation" but actually output rasters. For true vectors: generate raster AI image, then trace in Illustrator or use specialized tools. Important for logos, icons, print graphics needing extreme scaling.
W
Weighting
See Prompt Weighting above.
Workflow
A systematic process for AI creative work, typically involving: ideation โ prompt drafting โ batch generation โ selection โ refinement โ post-processing โ export. Professional AI creators develop efficient workflows to produce consistent quality at scale. Also: in ComfyUI, "workflow" refers to saved node-based pipelines. Developing good workflows dramatically improves output quality and speed.
X, Y, Z
X/Y/Z Plot
A Stable Diffusion feature that generates grids of images systematically varying 2-3 parameters. Example: X-axis varies CFG scale (5, 7, 10, 15), Y-axis varies steps (20, 30, 50), generating a 4ร3 grid showing all combinations. Invaluable for testing which parameter combinations work best for your prompt. Also called "parameter grid" or "prompt matrix." Helps find optimal settings efficiently.
No terms found matching your search. Try different keywords or browse alphabetically.
(function() {
const searchInput = document.getElementById('glossary-search');
const termEntries = document.querySelectorAll('.term-entry');
const letterSections = document.querySelectorAll('.letter-section');
const noResults = document.getElementById('no-results');
searchInput.addEventListener('input', function(e) {
const query = e.target.value.toLowerCase().trim();
let visibleCount = 0;
if (query === '') {
// Show all
termEntries.forEach(entry => entry.classList.remove('hidden'));
letterSections.forEach(section => section.style.display = 'block');
noResults.classList.remove('visible');
return;
}
termEntries.forEach(entry => {
const termName = entry.querySelector('.term-name').textContent.toLowerCase();
const termDef = entry.querySelector('.term-definition').textContent.toLowerCase();
const searchTerms = entry.dataset.terms.toLowerCase();
if (termName.includes(query) || termDef.includes(query) || searchTerms.includes(query)) {
entry.classList.remove('hidden');
visibleCount++;
} else {
entry.classList.add('hidden');
}
});
// Hide letter sections with no visible entries
letterSections.forEach(section => {
const visibleInSection = section.querySelectorAll('.term-entry:not(.hidden)').length;
section.style.display = visibleInSection > 0 ? 'block' : 'none';
});
if (visibleCount === 0) {
noResults.classList.add('visible');
} else {
noResults.classList.remove('visible');
}
});
})();
How to Use This Glossary Effectively
Bookmark this page and return often โ the AI creative space evolves rapidly, and terminology shifts as new tools emerge. When you encounter an unfamiliar term in a tutorial, community discussion, or tool documentation, search here first before Googling (which often returns outdated or context-less definitions). If you're new to AI creation, start by reading the definitions for: AI Art, Prompt, Prompt Engineering, Model, CFG Scale, Seed, and the specific tools you're using (Midjourney, Stable Diffusion, Suno, etc.).
Cross-reference with our guides: Each glossary entry links to related concepts and full articles where relevant. After understanding a term's definition, dive deeper with our comprehensive guides: Prompt Anatomy Guide, Midjourney Prompts Guide, Stable Diffusion Prompts, and AI Music Prompts Guide. Understanding terminology accelerates learning, but practical experimentation solidifies knowledge โ read the definition, then immediately test it in your tool of choice.
๐ AI and machine learning reference books on Amazon are a useful desk companion alongside this glossary โ they often go deeper on the mathematical and conceptual foundations behind these terms. Contains affiliate links โ disclosure.