Challenges Turned Learning Opportunities (A Short Story)

Nancy Amandi
11 min readMar 9, 2023
Image by Author

I was thinking of a better title for this story I’m about to tell you but “Challenges Turned Learning Opportunities” was the title I could think of. I really hope it portrays this story.

This story is about how I worked on a tiring project for as long as 62 days (Dec. 28th 2022 — March 6th 2023) due to different hurdles I faced but came out strong with these skills in the bag:

  1. Python
  2. Pyspark
  3. Knowledge of required AWS services for data Analysis
  4. Geopy
  5. Setting up dataproc clusters on Google Cloud
  6. Linux
  7. Apache-Sedona
  8. SQL
  9. PostGIS
  10. Amazon S3
  11. Amazon Redshift
  12. Fivetran

I know you’re finding it hard to believe. That’s why I want to tell you the story.

So hold my hand and let me lead you through my journey, shall we?

It all started when I wanted to work on my fourth official data project.

My aim was to follow one of Mrs Jessica Ayodele’s posts on showcasing my SQL skills by querying data from the cloud (specifically BigQuery). So I targeted data in the TransTech (Transportation Technology) industry.

I haven’t even signed up on Google cloud yet and was already feeling myself.

Image by makeameme.org

It was time to begin my newfound project idea and I tried signing up on Google cloud and my card didn’t work. Well, it’s just one card I’ve tried, I said to myself. Let’s try others.

I tried my second card. It didn’t work.

Got a virtual card on Kuda and tried it. It didn’t work.

Tried my friend’s Africard. It didn’t work.

Tried my friend’s Chipper Cash Card. It still didn’t work.

Then I started freaking out. I haven’t even started yet and it felt like I needed to start rethinking if I should continue.

GIF by equipdefrance on giphy.com

But I didn’t want to give up so I ran to my mentor, Mr Ayo Samuel. He advised me to use Amazon Redshift if BigQuery wasn’t allowing me to sign up.

I checked for public datasets on AWS and I couldn’t find any. At that point, I knew that I was about to see more shege.

Getting the Data Outside AWS

Getting the data wasn’t hard actually. All I did was search for “Uber public dataset” on the internet and I found some on Kaggle. They were divided into different tables according to their respective months and also had datasets from other transportation companies.

Here’s the link to where I found them.

Knowing the Problem Statement

It was a little bit difficult to know the problem statement because I didn’t have any context for the datasets. So when I viewed them on Kaggle, I noticed they were from different companies and some of them had the same kind of columns.

Hence, I went with the idea of identifying trends while comparing the different transportation companies. This problem statement wasn’t entirely clear to me but I ran with it anyways.

Learning Python

Before this project, the only skills I knew were Excel and SQL. I could say I knew python. Buuuuttttt…

Okay, let me explain.

I learnt the basics of python in April 2022. And this wasn’t python of data science. Just the basic syntax of python like variables, for and while loops, etc.

Seeing that the data I wanted to handle was too big to be contained in Excel, I knew I had to go back to learning python and this was already December 2022.

I couldn’t even remember any single syntax but Datacamp came to my rescue. I gave myself five days to learn Python for Data Science. Three modules per day. Imagine the audacity😂.

I went straight into it. The second day was almost over and I was still in module three. I looked at the curriculum again and noticed that I didn’t need to learn everything to clean the data I had so I resolved to stop at module four while dragging my brain to just reach the finish line before the second day ends.

Fortunately, I succeeded.

Cleaning the Data

Cleaning the data on python wasn’t hard after taking the course. I had handled the necessary things but there was a problem.

Another problem!

Image by makeameme.org

One of the columns in some of my datasets contained distinguishable house addresses while some columns in the other datasets contained longitudes and latitudes.

I needed to unify them if I really wanted to analyze the data but how??

I thought and thought and thought. The process of thinking almost split my brain in half asides the two halves that already existed in my brain.

I finally got the answer. I just needed to extract their cities, and countries and I was good to go. But how??

I started thorough research and realized the terms: “geocoding” and “reverse geocoding”.

Geocoding turns addresses to longitude and latitude. While reverse geocoding turns longitude and latitude into addresses. I needed to do the latter to some of the datasets.

The total number of datasets I selected was 14. 7 out of them had distinguishable addresses. When I studied them, I noticed they were mostly in New York City so I knew I needed a combination of def and if functions to get what I needed.

But I still had a lot of work ahead of me.

Understanding New York City Addresses

I had to go the extra mile to research how addresses in New York City worked and what I found wasn’t what I expected. Here’s a list of my findings:

  1. New York City (NYC) has five boroughs: Manhattan, Queens, Bronx, Brooklyn, and Staten Island.
  2. Bronx addresses could either have “Bronx” or “BX” in them.
  3. Manhattan was also known as New York County and could have “MN”, “New York”, “NYC”, or “NY” in the addresses.
  4. Brooklyn which was also known as Kings simply had “Brooklyn” or “BK” in the addresses.
  5. Queens addresses had “QN”, “QNs”, or “-” in them.
  6. Staten Island also known as Richmond had “Staten Island” or “SI” in them.

Yes, it was this complicated. And I had it in my gut that it was more complicated than this but I couldn’t find further information so I moved on to working on 7 out of my 14 datasets extracting these counties from them based on these conditions.

It was a bit messy while trying to write a working code but I did it anyways. This code ran for 3 hours for rows up to 800,000+ rows.

When my mentor found this out via my WhatsApp status, he advised that I learn Pyspark to make my code run faster.

This was the first time I was learning Pyspark. Another challenge just met me when I was just recovering.

Learning Pyspark

I don’t know if it’s right to say it was hard or easy for me to learn because honestly, I don’t even know. All I know is, it took me a day to pick up the syntax from a course. I even bagged a certificate from the course that day.

And it took me another day to understand that there was Spark Scala, Pyspark and Spark SQL. On this day, I also tried to find out their differences. I also tried to understand what it means to host Pyspark on the local computer and on clusters. I wasn’t sure if I got the whole gist fully at that time because my head started spinning.

Well, I started working with Pyspark on my local computer while modifying my python code to Pyspark. What took python 3 hours to process took about 30 minutes for Pyspark.

I was very excited.

7 out of my 14 datasets were good to go. What about the other 7 that had longitudes and latitudes to be converted to addresses?

Getting to know and use Geopy

I researched python libraries that could reverse geocode and I found Geopy. I got a subset of my data of just 100 rows and tested my code using geopy and it worked. I also wrote about it here.

I tried to use the same code on my main dataset where I used Pyspark instead and it was going to 6 hours and I haven't gotten my output.

At that point, I knew another challenge had hit me.

GIF by Robert on tenor

Setting up Dataproc Clusters on Google Cloud

I think I was getting exhausted at this point. I ran to the Young Data professionals Community for help. And Mr Ability asked that I set up a call so we can run my code on the cloud instead of my local computer to increase the speed.

We did that and he showed me how to set Dataproc clusters on Google cloud but our time was limited so we couldn’t run the code and we would need to start all over again the next day.

The next day, he told me to set it up while he watched me and I did it successfully. We uploaded the datasets and ran the code but the code was still taking time.

We did this back to back from 2 am every single day because of our limited time and the code won’t run within the limited time. So I thought it could be that my code wasn’t optimized well.

Learning to Optimize my Pyspark Code

I read several articles on optimizing my Pyspark code and learnt a whole lot. Things started getting clearer. I learnt about:

  1. Lazy evaluation which made me stop using the .show() action after each transformation.
  2. Importance of caching my results and different ways to do it.
  3. Partitioning my datasets

With this knowledge, I optimized the code and we set up a meeting again to rerun the code on Google cloud but it still took longer than the limited time we had.

Now I knew I had to use another tool.

Knowing Apache-Sedona

I found out that Apache-Sedona was a tool built to transform geospatial data so I opted to use it. Unfortunately, during installation, it didn’t work on my windows system so I had to use a Linux environment.

Before installing it, I had already gone through the documentation on how to reverse geocode my dataset by joining it to a reference open source data. I also learnt some terms like ST_Contains, ST_Transform, etc.

Knowing and Using Linux

At first, I was scared that my computer system was going to get damaged if I installed Linux in my system because of the stories I’d heard. I swallowed a heavy load of saliva and resolved to buy another system if this one got spoilt. Not like I had the money though.

I carefully followed a guide online to install Ubuntu in my system and it was successful. I then battled with installing a different python version from the default one and also succeeded.

I also installed Apache-Sedona but at the time of trying to use it, I got an error.

I uninstalled Apache-Sedona and carefully followed the guide on the documentation but I was still getting the error during usage.

Sighs! I got frustrated and cried my eyes out. I was very very tired already. But seeing that I have already spent close to two months on this project, I didn’t want to back down yet so I researched another tool where I found out about PostGIS.

Knowing and Using PostGIS

I researched if reverse geocoding can be done using PostGIS and it was possible but I had to merge my dataset to a reference dataset just like Apache-Sedona.

It took me time to choose between Tiger and OSM dataset but I went for Tiger because it was more elaborate.

I installed the PostGIS extension in my PostgreSQL and started researching the different geospatial functions to use where I found out about more functions like ST_SetGrid, ST_MakePoint, ST_Distance, etc.

Surprisingly, my journey with PostGIS was really smooth. Before the end of the day, I was looking at my reverse geocoded dataset and I just couldn’t believe my eyes.

I had finally overcome! I even wrote an article about it here.

GIF by Reddit on giphy

But I wasn’t done yet. At least most part of the job was done though.

Taking a Tough Decision

When the other 7 out of my dataset were reverse geocoded, several locations asides from the five boroughs in New York City were included. If I merged this with my other 7 datasets, there won’t be consistency and my analysis will be flawed.

So I made the decision to drop the other datasets and go with the 6 uber datasets that had successfully been reverse geocoded. This made me revise my problem statement to study trends in the data as a data analyst at Uber.

It was time to load my data to Amazon Redshift.

Choosing between Airflow and Fivetran

I was contemplating the right tool to use to move my data from PostgreSQL to Amazon Redshift so I went back to asking the Young Data Professionals Community. Most of them advised that I use Airflow.

However, the idea of learning a new tool was already exhausting for me so I went with Fivetran.

Setting up Fivetran was difficult. I needed to use a public IP address and I didn’t have that so I opted to load my data in a CSV file since it had just 4M+ rows. Then I loaded it to Amazon S3 before loading it to Amazon Redshift.

Using Amazon Redshift

Oh, using Amazon Redshift was sweeeeet. I was tired of going through articles so I resorted to navigating my way and while doing this, I found out new things.

I successfully loaded and queried my data from Amazon Redshift. I made a youtube video on this so you can check it out here.

Let’s Recap

Heaves a sigh of relief! Sorry if this was longer than you expected. I know you might have expected me to write “A Long Story” in my title but this is the irony of my project.

I saw myself finishing this project in one month or less but it’s been 62 days and here I am. But I would say it was worth it because over period, I’ve gathered these skills:

  1. Python
  2. Pyspark
  3. Knowledge of required AWS services for data Analysis
  4. Geopy
  5. Setting up dataproc clusters on Google Cloud
  6. Linux
  7. Apache-Sedona
  8. SQL
  9. PostGIS
  10. Amazon S3
  11. Amazon Redshift
  12. Fivetran

Thanks to one of my huge role models in the data space, Mr Ayoade for encouraging me to make a Youtube video on how I used Amazon Redshift.

Here is the project documentation, the video I made and the articles I wrote during the course of this journey.

Connect with me on Twitter, LinkedIn and GitHub.

--

--