Skip to main content

How to use Vault to securely search encrypted PII data

Build a secure search index of encrypted data for your application and perform substring searches

Overview

Piiano Vault secures your customers' data by encrypting it in the database. While many applications rely solely on encryption at rest, Piiano Vault adds an extra layer of application-level encryption. As a result, even if a rogue employee or hacker gains administrative access to your database, the data remains fully protected.

Vault then provides mechanisms that enable secure exact match and substring searches on the encrypted data.

But what if you have a mature system and don't want to migrate your data to Vault and re-architect your application? Fortunately, you can still take advantage of these features without that engineering overhead.

In this tutorial, you learn about:

  • How data stored in Piiano Vault is set up to index it for exact match and substring searches.
  • The considerations for integrating new and existing systems with Piiano Vault.
  • The options for adding application-level encryption to sensitive data stored in your database.
  • Using Piiano Vault to implement substring search on your local encrypted data.

Storing sensitive data in Piiano Vault

Ideally, all sensitive data should be stored in Piiano Vault rather than in your regular database. By adding an INDEX attribute to a property, you can easily perform exact-match searches using the Vault API. If you need to support substring-based lookups, use the SUBSTRING_INDEX attribute on the property. For more details on how substring search works, check out the substring search guide.

Integration considerations

New projects

For new applications, integrating with Vault typically involves minimal overhead because you can design your data access patterns to work with the Vault APIs.

Existing applications

Migrating an application can require more effort. Depending on the application’s size, complexity, and data access methods the changes required can be significant.
From a code perspective, each database access may need to be divided into two requests: one to retrieve non-sensitive information from the database and another to fetch sensitive information from Vault. The code then combines these results into one object. Moreover, queries involving sensitive information may need adjustments, even if they don't directly process the sensitive data but use it to reference other tables (e.g., in SQL join statements).

From a data perspective, sensitive information needs to be migrated to Vault and then removed from the original database. This typically involves a migration script that reads from your database, updates Vault, and then removes the migrated data from the source database.

Balancing security and ease of integration

Given the changes needed for a mature system, a full migration to Vault may involve a high integration cost. Therefore, Piiano recommends the alternative approach of encrypting sensitive data in place and replacing plaintext data with its encrypted equivalent. This strategy minimizes the need for extensive application changes while maintaining strong security. It enables you to use Vault's advanced security features, such as exact and substring search indexing, without requiring a complete overhaul of your data access architecture.

This in-place approach simplifies data migration for several reasons:

  1. Join statements involving sensitive encrypted data continue to function without modification.
  2. Queries that don’t require the plaintext version of the sensitive information can remain unchanged.
  3. Transactional behavior, such as rollbacks and commits, is preserved.
  4. Only one request to the database is needed, avoiding the need for a dual fetch (one from the database and another from Vault).
  5. ORM users can use a transformer to decrypt data during access, eliminating the need for additional changes.

As part of this integration, you must identify all instances of the sensitive information that you want to encrypt and transition them consistently from plaintext to encrypted text. Using an LLM tool to scan your schema can help locate sensitive information.

warning

It's critical to identify every instance, as overlooking one could lead to application malfunctions.

Storing sensitive data in your database

It's crucial to encrypt sensitive data before storing it in your database using application-level encryption. See the column level encryption 101 blog post for more information. With encryption implemented, your regular queries are almost unaffected. Also, you minimize your integration efforts, as they are limited to encrypting data before it's stored in the database and decrypting it when it is retrieved.

Some options for implementing encryption are:

  1. Database specific - Databases typically offer a built-in encryption module. For example, PostgreSQL supports encryption and decryption as part of an SQL query with pgcrypto.

  2. Natively - Most programming languages offer an encryption feature. For instance, in TypeScript, you use the crypto APIs.

  3. External package - An external package dedicated to encryption can be more straightforward to use. Piiano recommends Tink, which is maintained by Google and supports many languages, including Go, Java, and Python. Vault uses Tink to handle encryption.

  4. Using the Vault APIs All the proceeding options require an understanding of encryption, secure key management, key rotation, and more. You can offload these complexities to Vault using the Vault Crypto operations. You should be aware that, as this involves an additional API call, it can introduce some latency.

Consider a Users table containing these fields prior to encryption:

idfirst_namelast_namessnemailsubscription_levelcontact_method
001JohnDoe123-45-6789john.doe@example.comPremiumEmail
002JaneSmith987-65-4321jane.smith@example.comBasicPhone
003AliceJohnson456-78-9123alice.j@example.comPremiumEmail

The fields: first_name, last_mame, ssn, and email are sensitive. The rest aren't sensitive and can remain in plaintext. With the encryption in place, the data may look like this:

idfirst_namelast_namessnemailsubscription_levelcontact_method
001Jx8Nd3L...Dk4Rt7P...4D9Fg+L...eR3+8Lk9...PremiumEmail
002Ax1Pd6T...Sm2Nr9M...7P3Rx2N...jQ7+6Tk3...BasicPhone
003Al9Dr5K...Jo3Fm4B...8L2Np7K...aT4+9Lq8...PremiumEmail

When you encrypt data, you're responsible for ensuring sufficient entropy to minimize exposure and prevent data leaks. Even with encryption in place, a naive approach can unintentionally reveal significant information. For instance, if all occurrences of John are encrypted deterministically, knowing one encryption for John would enable an attacker to identify all other instances of the name. This risk increases if an attacker can use the application to create entries in the database and analyze them, as this could help identify patterns and expose sensitive data.

Assuming you accept these risks, encrypt the text John and search for the cyphertext. The search looks something like this:

select * from users where First_Name='Jx8Nd3L...';

While this approach works, it has several drawbacks:

  1. Your team must have a deep understanding of encryption, e.g., salt, initialization vector (IV), etc.
  2. You need to deal with secure key management and key rotation.
  3. You have to accept the information leakage security risk.
  4. It doesn't work with substring searches, only with exact match searches.

Use Vault as an encrypted search index

Migrating your sensitive data to Piiano Vault is ideal. However, to minimize integration costs, you may prefer to avoid fully migrating to Vault. You can, however, take advantage of Vault's substring search by storing a copy of your sensitive data in Vault. This enables your main application logic to remain unchanged while Vault handles substring search queries.

The remainder of this article assumes that your data is encrypted in your database. It then shows you how to use Vault to create a search index for your data.

Setup

Assume your database has an encrypted users table, and you need to search for email and SSN. For brevity, the examples focus on these two fields; ssn is used to demonstrate exact match searches, and email for substring searches.

info

Why use exact match search rather than always perform substring searches? While substring indexing enables you to search for non-glob patterns and returns the same results as an exact match search, it comes with overheads. Vault uses an additional in-memory cache to process substring queries, which makes filtering slower compared to exact match searches. Additionally, there is a short synchronization delay before results appear in the index. Therefore, for scenarios that don't require substring searches, use exact match searches.

So, you have a table structure such as this:

-- create the users table
CREATE TABLE users (
id UUID PRIMARY KEY,
ssn TEXT UNIQUE,
email TEXT UNIQUE
);
-- add seatch indices
CREATE INDEX idx_email ON users (email);
CREATE INDEX idx_ssn ON users (ssn);

You now create an equivalent collection in Vault to mirror that table, like this:

pvault collection add --collection-pvschema "
users PERSONS (
email EMAIL NULL UNIQUE SUBSTRING_INDEX,
ssn SSN NULL UNIQUE INDEX
);"

Vault collections include a built-in id property. If your objects are identified by a UUID, you can set the id to this value when you add data. Otherwise, you need to add a custom column in your database to store the Vault identifiers of the copied data. For simplicity and brevity, this tutorial assumes that your identifiers follow the UUID format and you set the Vault object id property to your identifier.

Initial data import

Now, you import your users table into the Vault collection using the add-objects bulk operation. For example, with the TypeScript SDK, the implementation might look like this:

import { VaultClient } from "@piiano/vault-client";

const client = new VaultClient({
vaultURL: "https://vault.example.com", // replace with your vault URL
apiKey: "vaultApiKey", // replace with your vault API key
});

const response = await client.objects.addObjects({
collection: "users",
requestBody: [
{
id: "b1e90723-358f-4252-93c5-abcab12ac38c", // this is the record's UUID from your database
email: "john.dow@acme.com",
ssn: "123-45-6789",
},
{
id: "6b0323e6-b587-4f1b-8cfc-712fc4a4781f", // this is the record's UUID from your database
email: "jane.smith@acme.com",
ssn: "987-65-4321",
},
],
});

Searching for sensitive data - exact match queries

With your data loaded and indexed in Vault, you can, for example, search for the user with the email of john.doe@acme.com using the CLI object query command. In this case, requesting the id and ssn of the user found:

pvault object  query --match email='john.doe@acme.com' --props id,ssn -c users

That gives you a response similar to this:

+--------------------------------------+-------------+
| id | ssn |
+--------------------------------------+-------------+
| b1e90723-358f-4252-93c5-abcab12ac38c | 123-45-6789 |
+--------------------------------------+-------------+

Alternatively, using TypeScript and the SDK, and again requesting the id and ssn of the found user:

const queryResponse = await client.objects.searchObjects({
collection: "users",
props: ["id", "ssn"],
requestBody: {
match: {
email: "john.doe@acme.com",
},
},
});

That returns JSON similar to this:

{
"results": [
{ "id": "32077c80-3792-4a45-a957-e365bb1c9533", "ssn": "123-45-6789" }
],
"paging": { "cursor": "", "size": 1, "remaining_count": 0 }
}
info

When sensitive data is stored in Vault, it's encrypted using a secure, non-deterministic encryption method. A blind index is then created for each entry to enable searches. The blind index is salted per property, which removes the correlation between identical values on different properties, reducing the risk of exposing exploitable patterns.

Searching for sensitive data - substring queries

Substring search queries involve glob pattern matching, such as john* or *@some-domain.com. These patterns are valuable for management applications where you need to retrieve lists of users based on partial information. For this use case, exact match approaches are ineffective, because you cannot encrypt the glob pattern and expect it to match encrypted values.

Vault supports seamless substring search queries on encrypted data. So, by storing data in Vault, you can use its powerful search APIs to perform substring queries.

In Vault, for example, you can search for all users with an email ending with @acme.com using the CLI object query command. In this case, requesting the id and ssn of the users found:

pvault object  query --like email='john*' --props id,ssn -c users

That gives you a response similar to this:

+--------------------------------------+-------------+
| id | ssn |
+--------------------------------------+-------------+
| 32077c80-3792-4a45-a957-e365bb1c9533 | 123-45-6789 |
+--------------------------------------+-------------+

Alternatively, using TypeScript and the SDK, and again requesting the id and ssn of the users found:

const queryResponse = await client.objects.searchObjects({
collection: "users",
props: ["id", "ssn"],
requestBody: {
like: {
email: "*@acme.com",
},
},
});

That returns JSON similar to this:

{
"results": [
{ "id": "32077c80-3792-4a45-a957-e365bb1c9533", "ssn": "123-45-6789" }
],
"paging": { "cursor": "", "size": 1, "remaining_count": 0 }
}
info

How does Vault perform substring searches on encrypted data? The technical blog Substring Matching over Field-Level Encrypted Data provides details.

Syncing data with Vault

You must keep your data in sync with Vault to ensure your searches remain relevant. Regardless of your encryption tool, you have several options for maintaining a synced copy of the data.

To illustrate, here are some strategies you can use when you deploy PostgreSQL. Similar approaches can be applied to other databases.

  1. Change Data Capture (CDC) or Logical Replication Use PostgreSQL's logical replication to stream updates from tables using libraries, such as pg-logical-replication for Node.js. Using this method, your code is called on every change or on a bulk of changes. In this handler, you receive the changes detected by PostgreSQL and push them to Vault using the Vault APIs. This approach isolates all Vault-related code to the replication handler or hook. However, it has some drawbacks, including needing additional PostgreSQL configuration and decrypting the data before pushing it to Vault.

  2. Implement a lightweight logical replication Use a updated_at field to track changes. Most systems include a updated_at column for auditing and debugging purposes. Run a background job that polls your tables for records with updated timestamps and pushes these changes to Vault. This approach avoids PostgreSQL configuration and keeps Vault-related code in a background sync process. However, like logical replication, you must decrypt the data before sending it to Vault. Example query:

    SELECT * FROM users WHERE updated_at > '2024-12-24 10:00:00';
  3. Call Vault during add, update, and delete operations The simplest approach is to call the Vault API during your CRUD operations. This adds minimal latency, which is acceptable for most use cases. A significant advantage of this method is that you use the plaintext data, simplifying the process as you don't need to decrypt your data before synchronizing. Example using the TypeScript SDK:

    const response = await client.objects.addObject({
    collection: "users",
    requestBody: {
    email: "john.dow@acme.com",
    ssn: "123-45-6789",
    },
    });

Conclusion

When you encrypt data in your database, you can perform exact match searches. However, by storing your data in Vault, you can also perform substring searches on your encrypted data. This capability enhances your application's functionality while maintaining robust data security.

See also