My first internship: data mining in a real world setting

Eventhough I’ve played with real data sets in other occasions, I had never had the chance to work on a real project. But in September 2021, things changed, I started working as an intern at a engineering company called GRUPO STIN. My role wasn’t extremely well defined, most of the time I was doing data analysis, but I also designed and run field experiments, which sounds more data scientist-ish to me. The project in which I was involved is related with smartwatch data, but I cannot say much about it due to the non-disclosure agreement I signed. However, that doesn’t really matter. What I would like to share are some of the things I’ve learnt or that I consider to be more relevant about the experience.

The importance of data preprocessing

This is something I already knew, the fact that data cleaning, data integration… can easily occupy half or even more of the time spent on the project. I can now say that’s not an overestimation. When the data you’re going to use comes from different sources and is gathered from far from perfectly reliable devices (in this case smartwatches) in far from perfect conditions you end up devoting quite a lot of time doing all kind of data preprocessing. In my case I had to deal with very noisy data, with a lot of outliers and “strange” values. Also, data integration was crucial, I needed to combine different types of data coming in different file formats all the time. But the point is that all the time spent is worth it. I think data preprocessing is a step you’ll always revisit when you receive new data, but once you have identified and dealt with the main issues the first time, the process becomes more automated.

Advice to myself (or anyone interested): Spend some time with data cleaning and examine your data carefully before doing any kind of predictive or inferential analysis.

The importance of data exploration

Exploratory Data Analysis (EDA) is also one of those steps in which is worth investing a decent amount of time. That was especially true in my case, because I was working in a novel project and little was known about what to expect from the data. “Fortunately” for me, I’m still a rookie in the machine learning area, so I wasn’t specially eager to try some fancy algorithm I knew so I didn’t overlook EDA.

Advice to myself (or anyone interested): Don’t forget doing EDA, it’s sometimes surprising how much information you can get from some boxplots and descriptive analysis.

The importance of visualization

Again, a topic mentioned many times, but with a reason. I think visualizations and plots are some of the most powerful ways to convey complex information both to experts and non-experts. And I would say more, based on my experience. Sometimes it is the only or most useful way to extract some knowledge from data. In my case, for example, we were interested in analyzing an specific pattern in data, but there were not (to my knowledge) any appropiate statistical technique for data. However, there was a way to put the data into a plot to do some “visual analytics”.

Advice to myself (or anyone interested): It can be hard with multidimensional data, but always try to make some plots and display the information graphically. They can make you notice things that “plain numbers” and tables can’t, and they make the communication of results easier.

The importance of versatility

Most tasks in a data science workflow, from data cleaning, to data transformation and statistical analysis, can be done entirely with one tool, like Python (assuming you don’t need to manage Big Data and SQL databases, like my case). Nonetheless, it can be very beneficial to know other programming languages and software tools, because some tasks are better suited for some languages. For example, R works great if you need inferential statistics and linear models. You can use Python too, of course, but R is just better and offers more options (my opinion). Sometimes you can even use SPSS or JASP if you need to get some p-values quick. That was precisely my case, almost all the time I was using python, but I switched to R or even other statistical packages when I thought I needed it. And I believe I worked faster that way, because I didn’t need to learn how to do X in python when I already knew how to do it in R, or viceversa.

Advice to myself (or anyone interested): I don’t think is bad to stick with a programming language, but be flexible and learn others, it can become very handy.

The importance of organization

That is something I admit I could have done better. I worked mainly with Jupyter lab, and because I was always exploring new data, testing some interesting functions, making prototypes… I ended up generating a lot of files and folders. I managed to find everything I needed when I needed, but it was a bit of a chaos.

Advice to myself (or anyone interested): If you are responsible of storing the data and files generated by your analysis, organize the folder dedicated to the project. Don’t improvise and create folders and folders just because you don’t know where to save your data/notebooks/scripts (like I did). Remove everything you don’t need.

(extra) The importance of efficiency and scalability

The project in which I participated was a Proof Of Concept (POC), like a pilot study, so I didn’t really work with huge data sets, but still I got a taste of the challenges and problems Big Data can generate. One time I had to discard the idea of using an algorithm I was thinking of using because when I tested with a sample of data I ran into memory issues. In that moment I realized that sooner or later I would need to use cloud services to be able to move to the “next level” .

Advice to myself (or anyone interested): Learn cloud computing, you’ll need it.

Hope you’ll find it useful!