Data Migration and Data Tiering Best Practices: Why Symlinks Fall Short for Serving Active Applications
Posted in tech
Enterprises have been trying to make data easier to move for decades, but until now, solutions have required some manual oversight, making them complex and error-prone. Symlinks (symbolic links) is one of these solutions. Symlinks decouple data from a data’s location by creating a pointer that redirects applications to the new location. While symlinks work well for poor man’s cloning and system configuration, they have serious problems when attempting to enable data tiering and migration, especially for active data. Let’s take a closer look with a comparison between symlinks-based data movement and native data movement through a metadata engine like DataSphere, which delivers an elegant and automated solution to data tiering that enables organizations to non-disruptively place and move data using application and client metadata intelligence.
Automating Migration with Symlinks Can Cause Silent Data Loss or Corruption
Automating data migration with symlinks requires basic steps similar to the following:
1. If storage provides a specialized API, query storage and verify the original file is “closed;” otherwise wait until it is closed to begin.
2. Copy the original file to a target destination. This creates a “tgt/f1” file from the original “f1”.
3. Compare f1 and tgt/f1 to make sure there have been no changes. If changes have been made, delete the copy “tgt/f1” file and wait for the file to be closed and start over.
4. Rename the “f1” file on the source system as “old/f1”.
5. Create a symlink on the source system using the “f1” filename that points to the “tgt/f1” file located on the target system.
6. Verify “old/f1” has not changed and at some point in the future delete it.
For inactive or cold files, this process is seamless. When used in environments where there might be open/active files, this process opens a window for silent data loss or corruption. Data migration applications can’t check ahead of time to see if a file is open without specialized API calls to the storage device. Instead, they will copy it and then verify afterward that it was not modified, by verifying “last accessed” attributes haven’t changed and performing checksums on the file.
Once the data migration application/script performs the checksums, it can rename the target file and replace the original file with a symlink that points to the new file (step 4 in the process described above). The problem is that this breaks POSIX semantics, as the original source files (now old/f1) that were opened prior to the symlink creation will continue to be read from and modified by those applications holding them open. This is true even after a delete (unlink) has occurred in the filesystem. When modifications happen to the old/f1 file, silent data loss or silent data corruption may occur—something that is unacceptable for mission- and business-critical applications in most enterprise environments. This makes this type of automated migration and tiering unsuitable for active files/applications. While symlinks-based movement might still be useful for archiving unused cold data, applications must be paused or stopped during the archival process.
Once a file has passed the hurdles of quiescing related applications, and a safe migration has completed, the symlinks must still be scanned continuously to protect data against the following failure modes, which can block access or even result in data loss/corruption:
1. If a symlink is moved or the mount namespace changed, the symlink will no longer be pointing to the correct path and that file (by way of the symlink) will become unreachable.
2. When a file is deleted, the normal delete process just deletes the symlink, leaving the file data intact in the archive location. The scanner must crawl the storage for orphaned data to actually delete it.
3. If file f1 is archived and replaced with a symlink (f1 -> tgt/f1), the symlink is then renamed to f2 (f2 -> tgt/f1). If a user or application creates a new file with the name “f1,” that gets archived, it will overwrite the data f2 is pointing to, as both symlinks will point to the same file (f1 -> tgt/f1) (f2 -> tgt/f1). The contents of the file will be most recently archived f1’s contents.
While a scanner can protect against thes issues, the scanner itself creates a fourth failure mode: a large type 1 event, for example, a large distance move or namespace reorganization event (possibly moving the symlinks outside of the scanner’s path) will appear to the scanner to be a type 2 or delete event. In this fourth failure mode, standard administrative functions can accidently and unknowingly cause the actual migrated data files to be deleted by this scanner.
DataSphere’s metadata-based intelligence and native movement is a much more elegant approach to data movement because it eliminates the risk of data loss and corruption presented by symlinks-based data tiering and migration, both during and post movement. The DataSphere architecture creates a global namespace and enables all storage in the namespace to be simultaneously available to applications. When an application needs access to data in the global namespace, it asks DataSphere for a layout that includes the location of the data, similar to the way a DNS server resolves www.google.com to a URL for a Google server.
When DataSphere moves data (which can be automated according to user-defined policy), the Data Mover Extended Service temporarily moves into the data path and handles all application read and write access to the source file. All writes are performed to both the source and target files, guaranteeing data integrity. Once the migration is complete, DataSphere deletes the original source file, updates the data’s location in the metadata table and seamlessly steps out of the data path. All future read/write requests go the new location. The diagram below illustrates this process:
Figure 1. DataSphere maintains data integrity while migrating active files.
Symlinks-Based Migration and Tiering Software Slows Application and System Performance
Data tiering with symbolic links introduces performance degradation in two ways. First, every time a file is opened, accessing data through symlinks requires multiple open calls and multiple metadata operations. Secondly, software that uses symbolic links requires the use of scanners to continually verify integrity, clean up data, and delete orphaned files. This continuous scanning of filesystems, creates additional load on filers, which will generally slow down the entire system. This adds latency and slows response times for every request.
DataSphere’s metadata-based intelligence and movement eliminates the performance overhead created with symlinks-based solutions. Access between applications and data continues to be direct, avoiding the double open of symlink and file. In addition, DataSphere can distribute workloads down to the file level, ensuring available performance and capacity is evenly balanced across the entire system. Finally, as described in the previous section, since DataSphere maintains data integrity even while moving open/active files, it eliminates the burden of performance-consuming scanners to monitor and clean up after data migration operations.
Orphaned Files Waste Capacity
When data is deleted during a migration in an automated symlink-based environment, the initial delete removes just the symlink and not the actual file. Scanners continually search to discover files that do not have symlinks and don’t actually delete them until an unknown time in the future. This creates problems on both the source and the target file systems. It is common for organizations to want to free capacity on higher tier resources, as needed. Time lags in deletion on source file systems make this use case problematic and could even affect business if, for example, data was being migrated to rebalance workloads. Time lags on target file systems affect the efficiency of archival resources, since there will almost always be some data that has been “deleted” but continues to consume capacity. This time lag between when files are orphaned and when they are actually deleted may also violate compliance, as the data in the deleted file can easily be recovered from the orphaned file.
Comparing Data Migration and Tiering with Metadata vs Symlinks
Symlinks are an incredibly powerful tool, useful in a vast number of use cases such as cold data archiving, but data migration and tiering aren’t a good use of symlink capabilities. Automated symlinks-based technologies create a risk of silent data loss and corruption. They also introduce scanners and environmental conditions that reduce clustered and individual filer performance. In contrast, DataSphere’s metadata engine enables even active data to migrate without risking data integrity. It moves the control path out of the data path to accelerate file access operations and enables file granular workload distribution across all storage systems to deliver the highest performance. The result for enterprises is secure data migration and tiering, faster applications, simple scalability, smarter archiving to reduce costs, and fewer fire drills for IT.