Copy Data Management (CDM) is a relatively new IT category of solutions designed to manage the creation, use, distribution, retention, and clean-up of copies of production data or “copy data.” “Copy data” is all data not currently being used in production. This can be a snapshot, backup, or replica of a version made for various IT or business functions—data recovery, Dev-Test, analytics, and so on.
The CDM category is growing rapidly because of several IT problems:
- Copy data consumes the majority of storage capacity and is growing at a multiple of the production data environment (as much as 20x faster)
- Demand for access to recent copies of production data is soaring, driven both by traditional IT functions and by new business use cases
- Until recently, there have not been central solutions to manage creating and distributing copies, leaving most IT organizations with a complex mix of scripts and vendor tools without centralized control
Today, CDM solutions are available from several vendors in the market, large and small, and investment in this area is accelerating.
The Copy Data Challenge
The problem with copy data is there is too much of it. The problem starts in the IT department, which has a requirement to create protection copies. Multiple versions of data must be maintained for recovery, and this includes local copies for operational recovery and remote copies for disaster recovery. But protection copies are just the beginning.
The largest consumers of data are business users. This group includes:
- Software development and test
- Software quality assurance (QA) testing
- Reporting and analytics
- Legal compliance
- Internal system training
These groups and others demand frequent access to data. Not only does this grow the storage footprint dramatically, it also consumes large amounts of IT time since all these requests result in IT work.
The problem has become acute. According to a recent study by analyst firm IDC, copy data will carry a cost of $50 billion by 2018. IDC estimates that copy data consumes up to 60 percent of current data growth, and there is no indication it will slow down.
Addressing the Copy Data Problem
The potential impact from deploying a CDM solution is significant and can enable immediate and dramatic benefits.
The basic function of a CDM solution is to take over the creation and use of data copies, unifying all the copy processes under a single tool. And CDM allows you to use a single data copy for multiple purposes. For example, if you have five software developers that each need a copy of Oracle, you would create five full copies of the database, requiring five times the amount of storage. With CDM, you may be able to use a single copy for all five developers.
It is important to stay aware of the performance of this shared copy. If it relies on spinning disk storage, it may not have the IOPs to satisfy five developers at the same time. But a system using all-flash storage can easily handle this workload.
Copy Data Management Architecture
There are two common CDM architectures. The first is designed like a backup model. Data is copied from the primary storage through the host server to a secondary storage device. Copies are then mounted from that device and shared among end users. This method was popularized by Actifio and is used by other vendors such as Rubrik and Cohesity.
This is the typical design of a backup-type copy data management architecture.
The main advantage of this model is that it can work with any primary storage. Things to keep in mind are that the data copy process may impact host performance, especially for high transaction systems like databases. Also, this model means you have a new hardware stack to manage, which can create additional IT work.
The second approach is to provide copy data services using the copy features of the primary storage array. This is known as in-place or integrated copy data management and was first provided by Catalogic. Following them, IBM and EMC have both released in-place CDM solutions. With in-place CDM you can use copies on the primary storage device, or you can replicate data to a second device and use that as the CDM source.
This is a typical architecture of an in-place copy data management solution
The limitation of the in-place architecture is that the software needs to support the existing storage. But if support is available, there are key advantages to the in-place model. First, it works with existing storage so there is no need to deploy and learn a new storage system. Secondly, if you are using CDM for Dev-Test purposes, it is helpful to develop applications on the same storage stack that will be used in production. If not, it can be difficult to troubleshoot the source of a problem. Finally, in-place storage snapshots can restore lost data in minutes because no data must cross the network. This can be critical when responding to events such as an accidentally deleted database, data corruption or virus attack.
A few common characteristics should exist in any true CDM offering. Things you should look for are:
- Central Catalog – A CDM solution should have a central catalog that tracks all copies across the environment, including remote sites and cloud environments, and maintains all the metadata for each copy, including being able to locate the copies associated with an application or VM.
- Policy Engine – The CDM should have a policy-based model that allows IT to define a set of policies for creating copies of data and then apply them as needed to workloads. This reduces the amount of management effort.
- Robust Reporting – CDM solutions should have the capability to run reports for the IT team to understand their copy environment and to make informed decisions and take key actions to address problems or clean up old unused copies.
When deciding between copy data architectures, the backup-type model may be the right choice if you are already interested in a new backup solution. If you are happy with your backup software, this approach would be a conflict.
The in-place approach may be right if you want to simplify the creation and use of copies on your primary storage, and to get more value out of your storage purchase. This is particularly true with all-flash systems that allow you to run more workloads from a single snapshot.