Capturing the Loch Ness Data Monster: How to Build A Data Lake

In 1934, Dr. Robert Kenneth Wilson took this picture that proved hundreds of years of speculation: the Loch Ness monster exists. Of course, the famous "surgeon’s photograph" was later proven to be an infamous hoax, but still. Can you tell me with 100% certainty that Nessie isn’t real?
Before you start rolling your eyes, please bear with me. This isn’t supposed to be a lesson in cryptozoology. This is a tale of our own lake—a data lake—and there’s no monster in it, just a monstrous amount of data. You can use this story as a foundation for setting up your own data lake.
Digital Transformation: We Need To Talk
It doesn’t matter where you are or what company you’re in. Almost every self-styled “business expert” can’t wait to tell you that you need to go digital, citing Blockbuster versus Netflix or taxi companies versus Uber examples as proof that they’re right. And, as annoying as they may be, they are right.
The digital era sets us up for a world where change happens at an increasingly faster pace, and decisions often have to be made in real-time. Two years ago, not taking immediate action could mean a missed opportunity, but now it can mean jeopardizing a company’s survival. The danger is that anyone can make a fast decision, but it might not be a good one.
Leaders and decision-makers don’t look for some crystal ball to tell them what to do or else they would be out of a job. What they aim for is data — accurate, precise, clean, insightful, relevant, and contextualized data. When they have it, they can use their experiences, expertise, and knowledge to make better decisions that are resistant to the inherent biases and preconceptions that we all have. No bias can withstand the impact of a proper histogram or line chart.
Having accurate insights removes “I think that,” “my perception is,” and “in the way I see it” from conversations. Reality is just there, shown on a big screen in the meeting room, and you can’t avoid it. This is when really productive conversations start.
Now here’s where it gets personal. I proudly work as part of the OutSystems digital team, which provides smooth and integrated experiences and innovative solutions to OutSystems customers. To do this and also help our whole company make the best decisions possible, we needed insights from all the data we’ve been collecting. And, so we settled on a data lake.
What Is a Data Lake?
A data lake is a repository for storing all relevant business data, in its original form, to be used for reporting, analytics, advanced data science, AI, machine learning, and more. James Dixon uses the term to differentiate it from a data mart, which he compares to bottled water.
Why is this so cool? It’s simple. A data lake can collect information from any source, store, and process it quickly and reliably, scaling when needed, and ultimately provide insights to the whole company. As a result, everyone can understand and support decisions based on the monitoring of accessible, relevant data.
The waters of any data lake should remain calm in the face of a whirlwind business. And, like any good man-made lake, it should be easy to add it to the existing landscape, even if there are all kinds of different tools in that landscape. When teams use skills and tools they already have, setup time is minimal.
A Technology Search Results in… a Snowflake?
Almost every successful digital initiative started with technology research. Ours was no different. Armed with the knowledge of what we wanted our lake to do and be, we went on the hunt.
We chose Fivetran to collect structured data because it uses out-of-the-box connectors to some of the most common sources, which significantly reduces the extract/load effort. We selected Amazon Web Services to handle streaming and unstructured data because of its ability to scale and quality services.
For data storage, we decided to use Snowflake. We can now store massive amounts of data with almost zero maintenance and navigate that information using SQL, the most common querying language. We highly recommend this solution to anyone building their own data lake.
In just two months, we had the lake fully functioning, and it was a thing of beauty. This is a 20-foot view:
But like a real lake or pond, a data lake needs conservators who can keep it clear of data debris while sharing important insights from new sources.
Building a Data Team
Choosing a team to be the gatekeepers of the data lake and all data-related data matters should be part of any good data lake strategy. For our lake, we gathered a team of data engineers, data modelers, and data scientists, each one focused on one section of the data supply chain. We called them the “Mighty Lords and Ladies of Data” until someone pointed out that this description was much too long, so we went instead with the less epic name of “Data Team.”
This team implements, monitors, maintains, and evolves the data lake, transforming it into specific and consolidated views of the business, composed of all the relevant metrics and KPIs in each domain. Just as lake conservators stock fish and ensure water purity, our team adds new data sources, ensures data quality, shares insights with the company, standardizes metrics and KPIs, and answers data science requests. To put it simply, thanks to this team, everyone at OutSystems has access to rich data and can use it for the good of the company. And if you put a similar team to work in your company, well, there will be no data monsters there!
Taming the Data Monster
Loch Ness covers an area of 56.4 kilometers, with a length of 36.3 kilometers, and reaches 226.96 meters at its deepest point. That’s a lot of space for a shy monster to hide. Now imagine if you could compress that space and use your favorite water-draining tool. You might find the Loch Ness Monster, befriend it, and encourage it to work for you. Your data lake can be the same. Care for it and properly dredge it, and you’ll be making it easy for everyone to find your version of “Nessie”: important insights that enable you to take the right action whenever it’s needed.