What Is Data Modeling?
Data modeling is the process of creating a visual representation of an information system or parts of it to communicate connections between data points and structures. A data model defines the structure of data — the entities (things we want to store data about), their attributes (properties), and the relationships between them.
Data modeling is fundamental to database design, system integration, and data governance. A well-designed data model makes data easier to understand, query, and maintain. A poorly designed model leads to data redundancy, update anomalies, and performance problems.
The Three Levels of Data Modeling
Data models exist at three levels of abstraction, each serving a different audience and purpose:
1. Conceptual Data Model
The conceptual data model is the highest level of abstraction. It represents the business view of data — the key entities and their high-level relationships — without any technical detail. It is designed to communicate with business stakeholders, not technical implementers.
A conceptual model typically shows:
- Major business entities (e.g., Customer, Product, Order)
- High-level relationships between entities (e.g., "Customer places Order")
- No attributes, no data types, no keys
The conceptual model answers the question: What are the key things the business needs to store data about, and how are they related?
2. Logical Data Model
The logical data model adds detail to the conceptual model — defining the attributes of each entity, the data types, and the relationships in precise terms — but remains independent of any specific database technology.
A logical model typically shows:
- All entities and their attributes
- Primary keys and foreign keys
- Relationship cardinalities (one-to-one, one-to-many, many-to-many)
- Data types (but not database-specific types)
- Normalisation applied (typically to 3NF or BCNF)
The logical model answers the question: What data needs to be stored, in what structure, and with what constraints?
3. Physical Data Model
The physical data model is the implementation-specific version of the logical model. It translates the logical design into the specific syntax and features of the target database platform (e.g., MySQL, Oracle, PostgreSQL).
A physical model typically shows:
- Table names and column names (following database naming conventions)
- Database-specific data types (e.g., VARCHAR(255), BIGINT, TIMESTAMP)
- Indexes, partitioning, and storage parameters
- Stored procedures, triggers, and views
- Performance optimisations
Entity-Relationship (ER) Diagrams
The most widely used notation for data modeling is the Entity-Relationship (ER) diagram, developed by Peter Chen in 1976. ER diagrams use standardised symbols to represent entities (rectangles), attributes (ovals), and relationships (diamonds or lines with cardinality notation).
Modern ER diagrams typically use Crow's Foot notation, which uses lines with symbols at each end to show cardinality:
- A single vertical line (|) means "exactly one"
- A crow's foot (three lines) means "many"
- A circle means "zero"
So a line with a crow's foot at one end and a single line at the other represents a one-to-many relationship.
Normalisation
Normalisation is the process of organising data in a relational database to reduce redundancy and improve data integrity. It involves decomposing tables into smaller, more focused tables and defining relationships between them.
The normal forms are:
- First Normal Form (1NF): Each column contains atomic (indivisible) values; no repeating groups
- Second Normal Form (2NF): In 1NF + every non-key attribute is fully functionally dependent on the entire primary key (eliminates partial dependencies)
- Third Normal Form (3NF): In 2NF + no transitive dependencies (non-key attributes do not depend on other non-key attributes)
- Boyce-Codd Normal Form (BCNF): A stricter version of 3NF — every determinant is a candidate key
Most production databases are normalised to 3NF or BCNF. Higher normal forms (4NF, 5NF) exist but are rarely used in practice.
Dimensional Modeling
Dimensional modeling, developed by Ralph Kimball, is a data modeling technique optimised for analytical (OLAP) workloads — data warehouses and business intelligence systems. Unlike normalisation (which minimises redundancy), dimensional modeling deliberately denormalises data to optimise query performance.
The two key components of dimensional modeling are:
- Fact tables: Store the measurable, quantitative data about business events (e.g., sales transactions). Each row represents one event. Fact tables contain foreign keys to dimension tables and numeric measures (e.g., quantity, revenue).
- Dimension tables: Store the descriptive attributes that provide context for the facts (e.g., Customer, Product, Date, Store). Dimension tables are typically wide (many columns) and relatively small.
The star schema arranges one fact table at the centre surrounded by dimension tables — like a star. The snowflake schema normalises dimension tables into multiple related tables — like a snowflake. Star schemas are simpler and faster to query; snowflake schemas use less storage but are more complex.
Slowly Changing Dimensions (SCD)
Slowly Changing Dimensions (SCD) handle the challenge of tracking how dimension attributes change over time. The most common SCD types are:
- Type 0: Retain original — the attribute never changes
- Type 1: Overwrite — update the existing record; no history is kept
- Type 2: Add new row — create a new record for each change; full history is preserved
- Type 3: Add new attribute — add a column for the previous value; only one level of history
- Type 4: History table — maintain a separate history table
- Type 6: Hybrid — combines Types 1, 2, and 3
Type 2 is the most commonly used SCD type in data warehouses because it preserves full history while allowing point-in-time analysis.