![]() |
A question that I surprisingly never get asked is “what about putting the data lake in The Cloud?” Now maybe I’m not asked that question because organizations are still confused as to what is a data lake? Or maybe I’m not asked that question because everyone (but me) already knows the answer? Well, I thought I’d partner with my super smart friend Brandon Kaier (twitter: @bkaier) to write a blog then for mostly my benefit (wouldn’t be the first time). I need to start understanding how the data lake benefits or doesn’t benefit from the cloud. There must be some overlap because both are focused on driving down the economics of managing IT resources and users’ ability to get access to those resources But I get this feeling that there are some serious considerations and issues about how organizations should be thinking about the data lake in The Cloud. I bet that the most serious issues come not from storing and managing the data itself. My bet is that the issues arise in providing an agile, fail fast, analytic sandbox environment, securely, with data and analytic mobility, features that we would expect out of The Cloud. Let me explore that further. What is a Data Lake?There are plenty of technical definitions you can google about what “it” is. More importantly let’s start the conversation by making sure that we understand what a data lake “does” and what it “means” to the business: Here is what I think is most important about a data lake: A Data Lake is a SINGLE repository for storing (either physically or logically) all the organization’s data including data generated from internal transactions and interactions as well as data gathered from third party and publicly available sources. The Hadoop Distributed File System (HDFS) is the preferred data lake platform because it provides a cost-effective, powerful, agile, scale out environment for assembling, preparing, aligning, enriching, and analyzing diverse structured and unstructured data sources The Data Lake provides the following benefits:
What is The Cloud?The Cloud is a general term for the delivery of hosted services over the Internet. The Cloud should enable companies to consume compute resources as a utility — just like electricity — rather than having to build and maintain computing infrastructures in-house. This tends to be what people, especially in the lines of business, think of when they hear the phrase “The Cloud.” Or maybe Jason Segel in the movie “Sex Tape” got it right: When many folks think of The Cloud, they immediately think of the Amazon and Google public clouds providing an inexpensive option for organizations that quickly want to stand up a computing and related storage environment. One can literally buy this environment with a credit card and (roughly) only pay for what computing and storage is actually needed. Again, this perception is especially true within the lines of business. Why Not Put The Data Lake In The Cloud?If The Cloud is delivering resources to me in utility model it seems like a natural match to put the data lake in the cloud, in fact one might call a Data Lake a purpose built cloud. The conversation just isn’t that simple. There are some important considerations before one should make the jump to The Cloud, especially the public cloud. For the business these considerations include:
For the Data Science Team, the biggest challenges are physical – it’s just darn difficult to move large volumes of data between an on-premise environment and the public cloud (and when it needs to be done repeatedly, it can get very expensive very quickly). In the world of data science, data scientists want to work with large sets of very detailed (highly granular) data that can change often. For example, let’s say that we want to determine (predict) how valuable a customer might be to the organization. Organizations should be able to easily calculate (if they don’t have data silos) how valuable a customer is today by looking at that customer’s purchase history, returns history, payment history, product margins, frequency of purchases, time sequence of purchases and any costs associated with selling to and servicing that customer. But let’s say that we are trying to predict how valuable a customer might be to the organization, in which case we might want to bring in other data sources such as:
Each of these data sources are quite voluminous, the integration of the organization’s financial and operational data with the organization’s social media data, clickstream data with mobile data requires a significant amount of bandwidth just to move the data into and out of the different data science sandboxes. Exacerbating the problem is the “fail fast / learn faster” mentality of the Data Science process. Your data scientist just loaded two extremely large data sets into your data lake to discover that those data sets provide no appreciable value or insight into their problem set so they just want to delete it. There are incremental costs to the movement of that data to The Cloud. So how would an organization move this data to the public cloud cost effectively? Amazon’s solution (which seems very economical) is to use their snowball product. What is snowball? It’s a large storage device that organizations have delivered by, yep, data transfer courtesy of FedEx! As the Amazon website states: “AWS Import/Export Snowball – Transfer 1 Petabyte Per Week Using Amazon-Owned Storage Appliances.” This is nothing new technologically; it has existed for decades by the name “Sneakernet”. It is known to have very high bandwidth capabilities but incredibly long latency. That’s not exactly my idea of how to best support the data science “rapid data ingest / fail fast / learn faster” model development, testing and refinement processes.” SummaryThe data lake in the public cloud problem is a physics problem – data movement is still the bitch of our industry. Let’s face it, the real problem with Big Data is that it is big and big things are hard to move. Given the data science team’s need for “rapid data ingest / fail fast / learn faster” model development, testing and refinement processes”, I just don’t see how the public cloud plays in the data lake deployment other than to support one-off, skunk works types of projects that once any level of value is demonstrated doesn’t get moved back to the on-prem cloud. That said, there is a place for The Cloud in the broader ecosystem. Global deployment or access to the results of the data science is an excellent use case. There are many organizations that are currently doing just that. The data science is all done in house but the applications that interface or deliver the analytics results (recommendations, scores) are being hosted in The Cloud allowing for global access to the results. Am I missing something? The post Data Lake and the Cloud: Pros and Cons of Putting Big Data Analytics in the Public Cloud appeared first on InFocus Blog | Dell EMC Services. |
