PII Handling in Precog
Overview
When you move data from business applications to your data warehouse, some of that data may contain personally identifiable information (PII) — email addresses, social security numbers, credit card numbers, and similar sensitive fields. Regulations like GDPR, HIPAA, and CCPA require organizations to control how this data is stored and accessed.
Precog automatically detects columns that contain PII and gives you control over how that data is handled before it lands in your destination. You choose a handling policy when you create or edit a connection, and Precog applies it consistently to all detected PII columns in that connection.
How Precog Detects PII
During schema discovery, Precog analyzes the values in each column and checks them against known PII patterns. Detection is automatic — you do not need to manually tag columns.
Precog currently detects these PII types:
| PII Type | Examples |
|---|---|
| Email addresses | jane@example.com |
| Social security numbers | 123-45-6789 |
| Credit card numbers | 4111 1111 1111 1111 |
| Phone numbers | +1 (555) 867-5309 |
| IP addresses | 192.168.1.1, 2001:db8::1 |
Detection is conservative: if even a single value in a column matches a PII pattern, the entire column is tagged. A column can be tagged with multiple PII types — for example, a column might contain both email addresses and phone numbers.
The metadata API always shows the full source schema, including which columns were detected as PII and which policy is applied. This means you can see exactly what Precog found, regardless of the handling policy you choose.
PII Handling Policies
When you create or edit a connection, you choose one of four handling policies. The policy applies uniformly to all detected PII columns in that connection. Different connections can use different policies, so you can tailor the approach to each use case.
Passthrough (Default)
Data flows through unchanged. PII columns are loaded into your destination exactly as they appear in the source.
When to use: When your destination has its own access controls and you need the original data — for example, a secured analytics environment where authorized analysts need real email addresses for customer matching.
Block
PII columns are excluded entirely — removed from both the data and the destination table schema. Blocked columns do not appear in your destination at all.
When to use: When you want to ensure sensitive data never reaches your warehouse. This is the most restrictive option and is appropriate for compliance scenarios where the data should not exist outside the source system.
Important: Blocking a column also removes any primary keys or foreign keys that depend on it. If a PII column is part of a key relationship, consider using Hash instead.
Hash
PII values are replaced with a deterministic, keyed cryptographic digest (SHA-256). The same input always produces the same output, which means joins and grouping still work — but the original value cannot be recovered.
When to use: When you need to preserve referential integrity (joins between tables) while protecting the actual values. For example, you can still join orders to customers by hashed email address without exposing the real email.
Important: Hashing is irreversible by design. Once data is hashed, the original values cannot be recovered from the hash. This is a feature, not a limitation — it ensures GDPR-compliant data minimization.
Nullify
PII values are replaced with empty strings. The column remains in the destination schema with its original data type, but all values are blank.
When to use: When you need to preserve the table structure (column names and types) for compatibility with downstream queries and dashboards, but the actual PII values are not needed. This avoids breaking existing SQL queries that reference the column.
Choosing a Policy
For New Connections
When you create a connection, you can select a PII handling policy during the setup wizard. The policy applies immediately when the connection first loads data.
For Existing Connections
You can change the PII handling policy for an existing connection in the connection settings. The new policy takes effect on the next data load — affected columns are fully reloaded with the new policy applied.
Note: Changing a policy does not retroactively transform data that has already been loaded. If you switch from Passthrough to Hash, historical data in your destination still contains the original values. You may need to perform a full reload or manually clean up historical data in your warehouse.
Policy Comparison
| Policy | Column in destination? | Data preserved? | Joins work? | Best for |
|---|---|---|---|---|
| Passthrough | Yes | Yes (original) | Yes | Secured environments with access controls |
| Block | No | No | No | Strictest compliance — data never lands |
| Hash | Yes | No (digest) | Yes | Preserving joins while hiding values |
| Nullify | Yes | No (empty) | No | Preserving schema without breaking queries |
Practical Advice
- Start with Passthrough if you are unsure. You can always apply a stricter policy later.
- Use Hash when joins matter. If tables reference each other by a PII column (like email or customer ID), hashing preserves those relationships while protecting the data.
- Use Block for maximum protection. If a column should never reach your warehouse, blocking is the safest choice.
- Different connections, different policies. A production analytics connection might use Hash, while a development or testing connection might use Block.
- Changes take effect on the next sync. After updating a policy, run the connection to apply the change.