Shade V7
On-device PII detection with phonetic awareness

A 22M-parameter PhoneticDeBERTa model that detects 12 types of sensitive information in business text, using Double Metaphone phonetic encoding to learn names the way humans do. Runs entirely on-device.

97.6%
F1 Score
22M
Parameters
12
Entity Types
<50ms
Inference Latency

How Shade compares

Detection Accuracy
97.6
98.0
97.0
F1 Score (%)
PII entity detection accuracy
Financial Text
97.5
91.1
81.0
BANKACCT + MONEY + CARD
Financial entity detection
Identity Detection
97.2
96.0
64.0
PERSON + GOVID + ORG
Name & identity classification
Contact Information
99.5
94.0
81.0
EMAIL + PHONE + IPADDR
Contact & network detection
Model Efficiency: Parameters Required for 90%+ F1
22M
50M
209M
300M
400M
1.5B
Smaller is better
Fewer parameters is better. Shade achieves 97.6% F1 at just 22M parameters
PhoneticDeBERTa
Learning names the way humans do

Shade V7 uses a fine-tuned DeBERTa-v3-xsmall encoder enhanced with Double Metaphone phonetic encoding. By converting names to phonetic representations, the model learns to recognise names the same way a human would - by how they sound, not just how they're spelled.

Trained on 116K examples spanning business meetings, legal proceedings, financial transactions, and HR documents. The phonetic layer means the model generalises to unseen names from any culture or language without retraining.

T
Input Text
Meeting transcript, pre-tokenised
D
DeBERTa-v3 Encoder
6 layers, 384 hidden dim, 22M params
C
BIO Classification Head
25 labels: O + B-/I- for 12 entity types
P
Token Substitution
Detected PII replaced with tokens like [PERSON_1]

12 PII types, one model

PERSON
96.3% F1
ORG
97.6% F1
EMAIL
100% F1
PHONE
98.4% F1
MONEY
99.6% F1
DATE
97.8% F1
ADDRESS
99.4% F1
GOVID
97.7% F1
BANKACCT
92.9% F1
CARD
100% F1
IPADDR
100% F1
CASE
97.8% F1

What we learned building Shade

01
Small models can match large ones on domain-specific NER
DeBERTa-v3-xsmall (22M) achieves 97.6% F1 on business PII detection, matching GLiNER fine-tuned (209M) and the Kaggle 1st-place ensemble (1.5B). Domain-specific training data matters more than model scale.
02
Targeted data generation closes the OOD gap
Adding 10,000 targeted examples for underperforming entity types (ADDRESS, BANKACCT, GOVID) alongside 20K diverse examples reduced the distribution gap between in-domain and out-of-domain evaluation to just 0.2% (97.6% → 97.3% F1).
03
Cloud-based PII detection is a contradiction
Sending sensitive text to a cloud API for PII detection means the data has already left your control. Shade runs entirely on-device. The sensitive data never leaves the machine where it was created.
04
Phonetic encoding bridges the name diversity gap
Names from different cultures and languages are systematically missed by models trained on Western text. Double Metaphone encoding converts names to phonetic representations, teaching the model to recognise 'Thabo', 'Sipho', and 'Naledi' the same way it recognises 'John' and 'Sarah'. Regional PII types like SA ID numbers and rand-denominated amounts are handled by specialised entity types.
05
BIO token classification outperforms span extraction for real-time use
The BIO tagging approach processes text in a single forward pass with O(n) complexity, enabling sub-50ms inference on consumer hardware. Span-based approaches require O(n²) comparisons and are impractical for real-time meeting transcription.
06
Token-direct substitution preserves AI utility
By replacing detected PII with structured tokens like [PERSON_1] and [ORG_2] instead of masking with [REDACTED], downstream AI models can track entities across the document and produce coherent summaries and action items. The tokens are mapped back to real values on your device.

Use Shade V7 in your app

VeilPhantom is an open-source Python SDK that wraps Shade V7 and the full 7-layer pipeline into a drop-in privacy layer for any LLM. Redact PII, call your model with safe tokens, rehydrate the response.

$ pip install veil-phantom

Explore the SDK
93.3%
Tool Accuracy
885
PII Detected
6ms
Overhead

Try Veil for free

Free during beta. No account required. No data leaves your device.