PyDev of the Week: Kevin Kho

This week we welcome Kevin Kho as our PyDev of the Week! Kevin is a core developer on the Fugue package. You can catch up with Kevin on Medium where Kevin writes about Fugue, Python, and more. You can also see what projects Kevin is working on over at GitHub.

Let’s spend some more time getting to know Kevin better!

Can you tell us a little about yourself (hobbies, education, etc):

I grew up in Manila, Philippines. Both of my grandfathers immigrated to the Philippines from China, so I am Filipino-Chinese. I did all my pre-college schooling there and came to the US for college. I studied Civil Engineering at the University of Illinois at Urbana-Champaign (both Bachelor’s and Master’s) focused on water resources. Volunteering has been a big part of me, so I was heavily involved in my school chapter of Engineering Without Borders (for water and bridge projects). When I became a data scientist, I started volunteering for DataKind.

Professionally, I was a data scientist for four years across two companies, and then I joined Prefect, a Python-based workflow orchestrator. I was there for just over a year before I left to work more on the open-source Fugue project. I currently contract part-time for Citibank around distributed computing tooling.

I love watching and playing basketball, but am mostly a homebody. If I watch a show, it’s likely to be an anime or Korean drama. I used to play more computer games (mainly DOTA and League of Legends), but I can’t keep up anymore. Since COVID, I have been infected by the keyboard bug and spend time assembling and tinkering with keyboards. It’s an expensive hobby, though!

Why did you start using Python?

The short answer is I wanted to go into data science, so I came across it when self-studying. There is a longer story around that.

March 2016 was when AlphaGo went against Lee Sedol in a five-game series. I stayed up until 3 am or 4 am those nights watching the games. I don’t even play Go and only know the basic rules, but this event inspired me to pursue machine learning. I had no idea where to start and wasn’t even a heavy coder, but it did get me looking into data science.

In May 2016, I graduated with my master’s degree. I interviewed for a couple of civil engineering jobs, but it didn’t go so well because I wanted a coding component to my job. At this point, I had been doing research with the US Geological Survey (USGS) for one year, with a lot of work in R. I decided to try to see if I could self-study and break into data science myself.

Over the following six months, I took a bunch of Coursera courses around Python, machine learning, and algorithms, and then I got my first data science job at the start of 2017. My first job primarily did things in R, so I didn’t get to use Python professionally until late 2019 when I changed jobs.

What other programming languages do you know, and which is your favorite?

I don’t know a lot, especially because I frequently don’t finish courses if I don’t have a use case.

Matlab and C were the CS 101 requirement in my college program. I know R and Python well and have tried out Javascript and Java. I definitely like Python the most because it’s very accessible yet versatile to do most things you need. It’s incredible how people new to programming can learn it quickly while still having it be capable of advanced use cases with packages like PyTorch or Dask.

What projects are you working on now?

I am primarily working on Fugue. Fugue takes SQL, Python, and Pandas code and scales it to Spark, Dask, and Ray. We make big data projects easier to develop and maintain. One of the problems with distributed computing is that the code is coupled with the infrastructure. If you write Spark code, it needs to run on the Spark engine. Fugue decouples the business logic and execution. This lets users develop on their local machine, and then bring it to the cluster just by specifying the backend.

More recently, we added BigQuery as a SQL backend, so we are interested in having combinations like BigQuery-Ray or Snowflake-Spark. Connectivity is our focus so that users can utilize the optimal combination of different tools depending on the task.

Which Python libraries are your favorite (core or 3rd party)?

I’m only going to mention libraries I have not been significantly involved in.

docker-py – from a code standpoint, I found it interesting because of the organization and mixins.
whylogs – I believe they are laying building blocks that will redefine data validation
pyswmm – it’s inspiring to see Python being more adopted in civil engineering

How did you get involved with the Fugue project?

I saw the main Fugue author, Han Wang, present at the Databricks and AI Summit. I reached out to him immediately afterward because I thought it could solve some problems we had at work. We had small data projects using Pandas, and big data projects using Spark, but we were implementing the same business logic twice. One version for Spark and one version for Pandas. I wanted to consolidate that with Fugue.

I was expecting just to be an end-user, but then I talked to Han and got involved. It has been a lot of work, but I am also heavily inspired by the other open-source developers I have met through it.

What are your top three features of Fugue?

1. Incremental adoption – users frequently only really need to scale out one expensive step of their pipeline. For example, maybe you want to train ten machine learning models, and you want to bring the training time down by running them in parallel or distributedly. Fugue can run a single step distributedly because it’s non-invasive, and you can leave the rest in Pandas. Actually, of the cool things Fugue does is read the type hints and comments to perform conversions. If users choose to move off Fugue, these just stay as helpful comments. Example here: https://fugue-tutorials.readthedocs.io/tutorials/beginner/schema.html#defining-schema

2. Interoperable SQL and Python. SQL code tends to be a second-class citizen, often invoked in-between Python code. FugueSQL elevates SQL as a first-class interface, so SQL can be the one invoking Python instead. SQL lovers can now utilize distributed backends like Spark and Dask without learning framework-specific code because of added keywords like LOAD, SAVE, PERSIST, PREPARTITION. Both the SQL and Python interfaces of Fugue can be used independently and are equivalent.

3. Easily extensible. Fugue can scale Python code, or can be used as a backend by existing code to scale. For example, libraries like whylogs, pycaret, and statsforecast all can be used with Fugue as a backend to scale to Spark, Dask, and Ray. These open-source maintainers benefit from not having to maintain three separate implementations so support all distributed backends.

Is there anything else you’d like to say?

1. Contributing to open-source is a lot easier than people think. You can always start with smaller issues, and if there are none, documentation and tutorials are always helpful and appreciated. Don’t hesitate to reach out to project maintainers (especially smaller maintainer teams). They will likely appreciate it.

2. Keeping an open mind – it’s very common for data scientists to completely avoid SQL. There are debates on Python vs. SQL, and I genuinely don’t understand this because they can be used powerfully together (enabled by Fugue, but even without it). Data practitioners can be very set in their ways for some reason, and are excessively in love with their tooling (R vs Python debates). I don’t think these debates matter as much as you’d expect with social media.

Thanks for doing the interview, Kevin!