Optimizing Google Scholar Data with MS Fabric

Introduction

Welcome to our blog post on taming Google Scholar data using Microsoft Fabric Data Pipelines. In this series of videos, we will guide you through a mini project where we leverage data pipelines in Microsoft Fabric to extract and transform data from Google Scholar. By the end of this journey, you will have a solid understanding of how to use data pipelines effectively and efficiently.

Longer term, we will be leveraging what we build here to fully integrate a Dataverse higher ed academic unit manager app with a PowerBI reporting mechanism, a data warehouse, and perhaps some fun AI workflows in Teams (all using fake data). Microsoft Fabric tools like Data Factory will help orchestrate our data movement and integration activities.

Make sure you like and subscribe to my YouTube channel @ Chat JPT.

Prerequisite context or knowledge is required

This article and accompanying video series assume you have used Azure Data Factory before, are familiar with Microsoft Fabric, and understand where Data Factory/Data Pipelines fit in the data engineering process. If not, you should start with some context: https://learn.microsoft.com/en-us/fabric/data-factory/data-factory-overview. Even more so, I would recommend that you go through this full hands-on walkthrough of all of the Fabric features first: https://learn.microsoft.com/en-us/fabric/data-factory/tutorial-end-to-end-introduction.

Getting Started

To begin, we'll start with a simple setup. In the first video, we'll walk you through the process of creating a data pipeline called "Get Author Articles" in the data factory persona. This pipeline will enable us to retrieve a list of authors from Google Scholar, extract their publications, and save them into our bronze layer. Let's dive in!

https://youtu.be/hzUIiYe6dak

(My YouTube channel is new since I am just beginning this journey. Please like and subscribe to join this journey to help others find my content.)

Setting up the Data Pipeline

I don't intend to transcribe the video as the steps below are demonstrated in the embedded video. But below are the general steps that we walk through:

Open the data pipeline workspace and click on "Copy Data" to launch the setup wizard.
Choose the REST data source option, as we'll be using the SERPAPI to convert Google Scholar pages into JSON format. (these SERPAPI mentions are NOT affiliate links nor am I being paid. I just like the service and their customer service when you need help as a developer.)
If you haven't already, create a new connection by entering the SERPAPI web address in the connection settings.
Proceed to the next step and provide the relative URL, which should include everything after the web address and slash.
Select the option to export the data as JSON files.
Now, let's set the data destination. We'll store the retrieved data in our lake house, specifically in the "bronze" folder under "pubs".
Save the configuration and run the pipeline. You can monitor the progress in the output section.

Enhancing the Pipeline

In the next video, we'll take things up a notch by introducing more complexity to the pipeline. We'll demonstrate how to loop through a list of author IDs, fetch their profiles from the SERPAPI, and save them to our lake house. This step will allow us to retrieve data for multiple authors efficiently.

To achieve this, we'll make use of the relative URL field we mentioned earlier. By storing the full URL in this field, we can pass different author IDs to pull their profiles dynamically. Stay tuned for this exciting step!

Conclusion

Congratulations! You've completed the first part of our series on taming Google Scholar data using Microsoft Fabric Data Pipelines. In this blog post, we introduced you to the concept of data pipelines and guided you through the initial setup. We hope you found this information helpful and are excited to continue this journey with us.

In the next video, we'll begin to make our pipeline a LITTLE more dynamic to set the stage for looping through a list of authors and pulling all of their data from Google Scholar to our Microsoft Fabric Lakehouse. The playlist will probably advance faster than this blog, so subscribe and watch there first for updates.

Remember, the key to success is to start simple and gradually build on complexity. With each step, you'll gain more confidence and skills in working with data pipelines. So, hang in there and keep learning!

See you in the next video!

CHAT JPT on Cloud, Data, AI and Low Code