Kaggle, the Google-acquired data science platform, started as a virtual meeting point for machine-learning geeks to compete on predictive accuracy scores.
It evolved into a Swiss Army knife for data science and analytics—one that can help data professionals, including data-driven marketers, elevate their analytics game.
Despite being a free service, Kaggle can help address an increasing number of data challenges:
- How to find reliable data sources to enrich existing customer and marketing data;
- How to find ideas, inspiration, and relevant code for a new data analysis without reinventing the wheel;
- How to collaborate efficiently on a data project with colleagues;
- How to apply machine learning and artificial intelligence to marketing analytics projects;
This is, of course, just a partial list. This post focuses on these and other marketing-friendly use cases for Kaggle.
What is Kaggle?
Kaggle launched in 2010. It became known as a platform for hosting machine-learning competitions. The competitions were typically sponsored by large companies, governments, and research institutes.
Their goal was (and still is) to leverage the collective intelligence of thousands of data scientists around the world to solve a data problem.
In a Kaggle competition, you can compete for jobs or money (or glory). But the platform has evolved from that initial use case.
In 2017, Kaggle was acquired by Google. After the acquisition, it started branching out into more areas of data science and analytics. The aim is clear—to become a one-stop shop for data professionals. (It’s currently being rebranded as “the home for data science.”)
Below, I discuss five fresh and relevant features for marketers, regardless of technical ability:
- Kaggle datasets;
- Kaggle community analyses;
- Kaggle Notebooks;
- Kaggle cloud integrations;
- Machine learning with Kaggle.
To make the most of Kaggle, having some ability to work with code is helpful. If you don’t code however, no worries—this isn’t a technical post.
1. Kaggle datasets: Access high-quality, relevant data.
Have you ever been in the following situation? You’re gazing over a large data file with lots of numbers but little explanation. You’re trying to figure out what each row and column represent, and no one seems to have precise documentation.
What if we could ensure our datasets were clearly documented? This goes beyond just having a data dictionary for feature definitions.
What if we knew who collected the data, the sources and methodology they used, and if any data is missing? And, if so, why? Is it random? Is there a pattern or reason behind it? Wouldn’t it be nice to know, too, if someone, somewhere, is actively maintaining the dataset?
This is the idea behind Kaggle datasets, a collection of thousands of high-quality datasets—all with an automatic quality score based on availability of metadata. These datasets are searchable and have helpful tags attached to them (e.g., industry, data type, associated analyses, etc.)
Where applicable, the data sources are verified, too. And there’s an added bonus: Given an initial dataset, Kaggle can make recommendations for relevant, complementary datasets.
There are more than 20,000 datasets in Kaggle, including census, employment, and geographic data, which analysts can access and analyze directly from their browsers. Most importantly, there’s a large variety of datasets related to marketing, ecommerce, and sales.
Some interesting marketing datasets to explore. They come with a quality score ranging from 1 to 10 based on how complete the documentation is.
How do you find datasets on Kaggle?
It couldn’t be easier:
- Connect to kaggle.com. (There’s an optional Google login.)
- Look for the datasets section near the top of the page.
- Enter a keyword to search the datasets database.
- Scan the results, review the dataset quality scores, interestingness scores, and short descriptions.
- Select the dataset that resonates most with you.
Bonus dataset: Google Analytics data from the Google Merchandise store
If you work with Google Analytics, there’s a bonus for you: a dataset associated with the first Kaggle machine-learning competition, which was based on Google Analytics data and concluded earlier this year.
Digital analysts can access raw, hit-level data (with full ecommerce implementation) that spans a full year of customer activity in the Google Merchandise store.
Working with this dataset can be valuable in terms of understanding the underlying structure of Google Analytics data and experimenting with a number of advanced statistical and data mining techniques that can’t be applied when the data is in aggregate form (which is the norm with standard Google Analytics.)
2. Kaggle community analyses: Jump start your analysis by reviewing others’ work.
When starting to analyze your marketing data, finding relevant datasets to combine with your original one is useful. But it’s even better if you can see all existing work that’s been published on a given dataset by other Kagglers. This can be a source of inspiration but also a time saver, especially in the initial stage of an analysis.
It’s sometimes daunting to choose among all available analyses. Similar to a social network, Kaggle shows you how the community has interacted with each piece of work, which can help you spot ideas and analyses that stand out. It’s also a good opportunity to interact and network with members of the Kaggle community who have overlapping interests.
Kaggle has 3.5 million members contributing code and data. It’s always possible to find inspiration in other Kagglers’ work. (Image courtesy of Kaggle)
A good example of this is the Google Analytics dataset from the previous section. It’s accompanied by hundreds of approaches on how to analyze digital analytics data from the Kaggle community—including some from Kaggle grandmasters.
How do you find relevant marketing analyses on Kaggle?
- After selecting a dataset as described in the previous step, you’ll notice that there are several independent Notebooks associated with it. (Notebooks are discussed below in more detail.)
- Every Notebook represents an analysis that includes narrative, code, and output, such as visualizations and data tables with summary statistics.
- To get started, select the one with the highest number of upvotes, a sign of quality and approval from the community.
- If the analysis is indeed of high interest, it’s possible to “fork” the Notebook, thus generating a copy of both the code and data.
- Then, either run the script as is or make changes by creating your own version. An interesting option is to substitute the original author’s data with your own similar dataset before executing the code.
3. Kaggle Notebooks: Access a powerful laptop on the cloud.
By now, you’ve selected a dataset and collected some good ideas from the Kaggle community to help you get started. As a next step, you’ll want to apply this to your own data.
What’s the most suitable place for all this to happen? An obvious option is your local desktop or laptop. Alternatively, you can go the Kaggle way by working with Kaggle Notebooks (previously known as Kaggle Kernels). This has benefits, especially in cases when:
- The dataset is several gigabytes in size and impractical to move around or load into local memory every time you analyze it.
- The task is computationally intensive, and you don’t want to slow down your laptop for the rest of the day.
- You’re planning to share your analysis with collaborators.
Let’s have a closer look.
Kaggle Notebooks contain code, computation, and narrative. Work with R, Python, and SQL code directly from the browser—no need to install anything.
Notebooks and computation
A Kaggle Notebook is essentially a powerful computer that Kaggle lets you access in the cloud. It used to be available only for use with public data during competitions. Recently, Kaggle started offering it for private projects at no cost and with the option to use private datasets.
Visually, Kaggle Notebooks look like Jupyter Notebooks, containing computation, code, and narrative—but they come with some nice extras:
- They’re equipped with processing hardware, CPUs and GPUs, for computationally demanding analyses. This processing power is useful if you have a lengthy computation or expect a high volume of data to be returned after an API call.
- They have RAM memory of 16 gigabytes, which can be used to fit large datasets into memory. (This is more capacity compared to the average laptop.)
- Notebooks have all the latest software libraries preinstalled, as well as versions of R and Python, the main programming languages for data science and analytics.
- You can attach one or more datasets to a Notebook in a single click, with a total size of up to 100 gigabytes.
Notebooks and collaboration
You can share your analyses with colleagues—without the dreaded “but it works on my machine” scenario. When you share a private Notebook with your collaborators, they automatically access the same isolated computational environment, including the software libraries and version of the programming languages.
Thanks to Docker, the popular containerization technology, there’s no need to install or update software, and no risk of causing software conflicts.
As soon as your work is done, select public or private visibility for the notebook and share it with collaborators. They can view and run the analysis interactively with one click, straight from their browser.
4. Kaggle cloud integrations: Get access to Google Cloud tech.
Working within the Kaggle environment acquaints you with cloud workflows. It also offers exposure to new tools and tech—opportunities to pick up new skills, many of which are vital to marketers and digital analysts.
This is thanks largely to integrations Kaggle has with BigQuery and BigQuery ML, and Google Data Studio.
I won’t discuss these integrations in great detail here—CXL has several sources (linked above) with detailed product walkthroughs. When it comes to how this works with Kaggle, the essence is that you can:
- Access data stored in BigQuery directly via Kaggle with some SQL code, then analyze it directly on Kaggle with R or Python.
- Build and evaluate regression and clustering models without extensive knowledge of machine-learning frameworks.
- Load a dataset in Kaggle, shape it, and then—via the Data Studio connector—explore the data visually in the Data Studio interface or create dashboards to share with your team.
There’s also an integration with Google Sheets and a brand new one with Google AutoML (see the next section). I wouldn’t be surprised to see more integrations since Kaggle is now part of Google Cloud.
5. Machine learning with Kaggle: High-quality machine learning and AI with zero code.
Integration with Google’s AutoML was announced in November 2019. It deserves a section of its own because of its potential impact.
As a concept, AutoML isn’t entirely new, but making it accessible as a product en masse via Kaggle is a noteworthy development. The human expertise that’s required for machine-learning development is scarce, a fact often brought up as a bottleneck for the field.
AutoML can lower the barrier to entry for development of machine-learning applications in marketing. It allows marketers with a general understanding of the machine-learning process to use advanced, powerful AI models safely—and without needing to be programmers.
AutoML, which is now available on Kaggle, can also save massive amounts of time spent developing and testing a model manually (the typical case right now).
This won’t, of course, be “AI at the push of a button.” The marketer (or whoever applies AutoML) will need to understand the basics of the process. Unlike other features in Kaggle, its use may result in costs for computation.
In any case, AutoML is a hands-on way to get started with machine learning and AI for marketing, directly within Kaggle.
Conclusion
Kaggle doesn’t cover all aspects of a data and analytics workflow. It’s not the tool to develop production-level systems or store and manage all of your analysis code and artifacts. However, it’s a practical collaboration tool with which marketers can access relevant datasets, explore data, and get ideas to jumpstart their analysis.
Computationally, it’s like a powerful, cloud-based laptop that’s always available for public or private projects. It’s also a bridge to many other cloud services provided by Google, such as BigQuery and Google Data Studio.
Last but not least, AutoML has the potential to be a game changer. It lowers the barrier to entry and empowers marketers to get directly involved in the development of AI and machine learning for projects.
Becoming familiar with Kaggle Notebooks, the Cloud integrations, and all the other elements of the Kaggle environment can make a future transition to a full-fledged AI platform, including Google’s AI platform, much easier.
The best way to get started? Explore the datasets and ways the Kaggle community has analyzed them. Try the Google Analytics revenue prediction dataset and analysis Notebooks, or the conversion optimization dataset with ROI analysis for Facebook marketing campaigns.
Happy Kaggling.
Digital & Social Articles on Business 2 Community
(201)