Data Masking
Nowadays, the public pays increasing attention to personal information protection. Therefore, developers need to encrypt all sensitive information, such as ID numbers, phone numbers, and bank card numbers in their systems, or process such data using Hash algorithms before storing it. This has become a common sense. However, it is inevitable for developers to encounter legacy problems, where sensitive information is stored in plaintext without encryption.
If these plaintext fields are used in a core service system, data masking in this system becomes very difficult. First, the services of this system usually cannot be interrupted. Second, the services are usually wildly used and a massive amount of data has been accumulated. Therefore, online data masking in these systems is as hard as replacing engines of flying aircrafts. This article takes an actual practice as an example to introduce a solution for the most difficult challenge: data masking in online systems1 .
I. Decide the Encryption Method for Sensitive Data Storage
Although the sensitive data is encrypted during storage, it is converted into plain text when used. So, the encryption must be reversible. To encrypt plaintext sensitive data in online systems, you can choose the wildly used AES256-CBC algorithm and add the xxx_secure field to store encrypted information. However, the existence of random initial vector (IV) leads to different encrypted fields after multiple times of encryption on the same plaintext sensitive data. As a result, this field cannot be indexed or directly compared. To solve this problem, a hash field xxx_hash is introduced for searching.
II. Decide the Mapping of Data Objects
The system in this example is configured with a great object-relational mapping (ORM) management mechanism and uses a Java Persistence API (JPA)-based data persistence framework. No bare read from or write to the database is involved. Therefore, to conduct data masking, you only need to redirect the reads and writes of entity objects at the ORM layer to access the corresponding encrypted fields. By doing so, the read and write logic of the objects remains unchanged, and the business code does not need to be modified. This is the greatest because it proves the value of data access objects (DAO). Provided with the DAO design, even a simple framework can provide more possibilities to resolve serious system faults.
Generally, two methods are available for implementing data object mapping in JPA-based frameworks:
1. Define a Converter and Call the AttributeConverter API to Implement Bidirectional Conversion
AttributeConverter is a generic API that requires two generic types: Java and SQL. In addition, the data must be bidirectionally convertible. We implemented an encrypted AttributeConverter in the common package. Related methods are described in the following code example:
△ Figure 1: Code example
Apply the @Convert annotation to attributes of the Entity class to specify the converter. In this case, we can implement automatic encryption and decryption when reading from and writing to the database. Code example:
△ Figure 2: Code example
2. Implement the Custom UserType API
The custom UserType API is applicable to encrypted hash data that needs to be automatically converted. It can effectively eliminate the impact of bidirectional data conversion. During data persistence, the JPA-based framework needs to create original object replicas through deep copy. If we use AttributeConverter to convert encrypted hash data, AttributeConverterMutabilityPlanImpl.deepCopyNotNull(T value) triggers the deep copy. The copy operation eventually calls the value of convertToEntityAttribute(String dbData) to copy objects. (For more information, see Figure 1.) However, plaintext data cannot be obtained if the data in the database is stored as hash values, because hash encryption is irreversible. To store the data, convertToDatabaseColumn(HashStringField attribute) is called for data encryption. If the value of a plaintext field is a null string, the data is lost.
The custom UserType API overrides the clone() method to map data from Java types to SQL types. This API defines many methods, such as the methods for specifying the Java types, SQL types, and the bidirectional conversion between SQL data and Java objects. It also defines the object identification methods such as hashCode() and euqals(), deep copy methods, and serialization-related methods. Code example:
△ Figure 3: Code example
After implementing the custom UserType API, we can apply the type annotation to specify attributes of the Entity class. The code example is as follows:
△ Figure4: Code example
Then, we can implement the unidirectional conversion of custom user data types. In actual practices, this method is complicated. It is also awkward to define the type of a hash string (which is String) as HashStringField. Therefore, we finally adopt the getter and setter methods for direct type conversion, instead of the custom UserType API.
III. "Replace the Engine of a Flying Aircraft": Seamless Data Masking for an Online System
In addition to avoiding large-scale modifications on the business logic, we also need to solve the problem of cross-node and cross-region service deployment to implement automatic conversion of plaintext fields. It is because that both new and old services exist in the online system at this time. We cannot shut down a core system for upgrade. How do we "replace the engine of a flying aircraft"? We have adopted a three-step solution after fields to be masked are added to the database. Step 1: Perform double write but trust plaintext fields. Step 2: Perform double write but trust encrypted fields. Step 3: Independently use encrypted fields. After these three steps, we can discard plaintext fields to prevent potential security risks. Detailed description of these three steps:
Step 1: Perform double write but trust plaintext fields
Take the phone field in the Entity class as an example. First, add fields to be masked such as phone_hash and phone_secure to the database. Their values can be empty by default. Then, update objects of the Entity class and upgrade the service. The service reads only data stored in plaintext fields. The service writes both plaintext and encrypted data into the database to ensure that the data read during the service upgrade is accurate. Then, use the getter and setter methods to the convert data types.
△ Figure5: Code example
Step 2: Perform double write but trust encrypted fields
After fields to be masked are updated, check all service functions to verify whether the service upgrade is successful. Run set phone_secure = Encrypt(phone) to fill in values for all encrypted fields. Then, the phone_secure field stores all the accurate and encrypted data.
Next, modify the code2 to change the trust relationship. The service is deployed across nodes and regions. Therefore, double write is still required to ensure that data read from both the new and old nodes is the latest. However, the service reads data of the encrypted fields when it reads objects of the database. Take the phone field as an example again. Code example:
△ Figure6: Code example
Step 3: Independently use encrypted fields
After step 2 is completed, all nodes trust the encrypted fields. In step 3, plaintext write is no longer performed and plaintext fields is discarded from the system. Now, we need to change the mapping relationship in the Entity class. Code example:
△ Figure7: Code example
Now, the Entity class is restored to the clean version that is free of phoneSecure attribute. The code is slightly updated, with only two annotations added to the phone field. If the system runs properly for a period after the plaintext fields are discarded, rename the table and then permanently delete the plaintext fields. This step completes the data masking process for an online system.
[1]Note: Systems mentioned in this article are relational databases and the data masking solution introduced in this article applies only to these databases.
[2]Note: In this step, the code needs to be modified if the phone field is used as the query condition. The query condition needs to be changed from plaintext fields to encrypted (MD5) fields.