Data Quality Profiling Metrics¶
This page describes and explains the profiling metrics used in the Data Quality & Profile tab of the Explore menu. The focus is on clarity and ensuring that users understand each field that is visible for column profiling.
Table-Level Profiling¶
The table-level metrics provide an overall snapshot of the table's structure and health:
-
Row Count:
The total number of rows in the table.
Usage: Useful for monitoring data volume and detecting unexpected changes in data influx. -
Column Count:
The number of columns in the table.
Usage: Helps verify the expected structure of the table's schema. -
Profile Creation Date:
The timestamp indicating when the profile was generated (e.g., "Created Date 7 Mar 2025, 2:30").
Usage: Indicates the freshness of the profile data which is critical for data quality assessment.
Column-Level Profiling¶
Each column in the table is assessed using a set of specific metrics. The following metrics are used:
-
Check Name:
The identifier for the data quality check applied to the column. It specifies the type or category of the check. -
Data Type:
The type of data stored in the column (for example, NUMBER).
Usage: Ensures that the data conforms to expected formats and supports type-specific validations. -
Null %:
The percentage of null or missing values within the column.
Usage: High null percentages might indicate incomplete data or potential issues with data collection. -
Unique %:
The percentage of unique values relative to the non-null entries.
Usage: Provides insight into data diversity; values closer to 100% denote high uniqueness within the data. -
Distinct %:
Represents the proportion of distinct values out of the total entries.
Usage: Useful for identifying redundancy or data variety in the column. -
Value Count:
The total count of valid data values in the column.
Usage: Serves as an indicator of data completeness and is often used in calculating other metrics. -
# Of Tests:
Indicates the number of data quality tests executed on the column.
Usage: Helps track coverage of quality checks applied to the data. -
Test Status:
The overall results of the quality tests, typically divided into: - Failed Tests: The number of tests the column did not pass.
- Warning Tests: The number of tests that issued warnings, suggesting potential issues.
- Passed Tests: The number of tests that the column successfully passed.
Usage: Provides insight into the overall health of the data. A higher number of failed or warning tests may signal the need for data cleansing or further investigation.
How to Use These Metrics¶
-
Overall Profiling:
Start with the table-level metrics to get an overview of your dataset's volume and schema consistency. -
Drill Down into Columns:
Use the column-level metrics to identify specific issues or areas for improvement in your data. For example, focus on columns with high null percentages or low distinct values. -
Quality Testing:
Pay attention to the test status metrics. Columns with a significant number of failed or warning tests may require further data validation and cleansing efforts.