This page may be out of date. Submit any pending changes before refreshing this page.
Hide this message.

How can a computer science graduate student prepare himself for data scientist/machine learning intern interviews?

Please mention the skill set that is expected or good to possess like scikit-learn, R, Weka, etc. If one has a time period of 2-3 months, what can be the possible platforms to improve skills like Kaggle, etc.
8 Answers
Sean Owen
First, here is my list of all skills I might want to see for this position:

Academic
  • CS coursework
  • Stats and linear algebra
  • Some ML coursework, covering at least
    • regression
    • classification
    • clustering
    • recommendation
    • graphical models

Data Collection Tools
  • Hadoop-based tools like Flume / Sqoop
  • Text munging languages like Python, or maybe Perl
  • Basic SQL

Data Modeling Tools
  • A library like scipy / numpy or Weka
  • A tool like R (or commercial equivalents like SAS, SPSS)

Model Serving Tools
  • (Ideally) some familiarity with PMML
  • Basic knowledge of a NoSQL store
  • Systems language skill, like Java

Business Smarts
  • Communication skills
  • Some facility with a visualization tool, even if gnuplot or Excel
  • Domain knowledge relevant to my business

You certainly don't need all of that. In fact, for an internship, you can't be expected to have most of it. I assume you are in school, so I would expect you to have much of the academic background, and would like to see that you have some of the tool skills. I would not expect business skills, but believe me, communication skills are a big differentiator.


So what to focus on? First, academics. If I were interviewing you I would probably ask about this as a filter. If you're not able to explain the very basics, like what linear regression does, that means there's a big lack of either knowledge or communication skills. So I would feel comfortable with the very basics. I'd ask you to explain one moderately advanced algorithm and why it works, of your choice. Same reasoning, if you can't pick something out of everything you know to explain reasonably, probably not going to proceed.

Unfortunately I do think a lot of interviews focus too much on the math and algorithms like it was an exam. I would not want to work at places that think that's the important thing. I personally would want to see that you're smart and communicate well and know the basics. Chances are that whatever math is relevant to my business is something you'll need to learn (more) anyway.


I know you're asking about tools though. The tools that are relevant really depend on the kind of place you're applying. A classic research department is going to focus mostly on modeling tools. Since you can't get SAS / SPSS easily, focus on R and Weka as a skill.

At the other end of the spectrum, say, a small startup, the requirement is broader and shallower. They won't need you to know R. They will need you to quickly understand a business problem and put together a production-ready system to solve it. So it's much more about data collection, munging, a little modeling, and then integration. For that I would make sure you know how to get data out of a DB or log files, into a modeling tool, and then how to transform a model into some code someone could put in a web server. So: basic SQL, Python or Java, and whatever DB / web serving tools the company uses.


Kaggle is great practice although it will not 'test' your data collection skills or the serving side of things. But it will challenge you to understand a business problem, munge real data and model it. I would look favorably on an intern who had taken the time to solve a Kaggle problem and done reasonably well.
Keith Callenberg
If you are applying for a position in a large, established data science department, it would be worthwhile to figure out what software they're using. While we all have preferences, you should be comfortable picking up and running with any common environment like Python, R or MATLAB/octave. I often run into situations where a package I want to use isn't available, or isn't as mature in my environment of choice (Python) so it's good to be at least comfortable in several of them. You should also be comfortable working with a SQL database -- maybe not in a design sense, but at least for collecting data. Knowledge of any more contemporary NoSQL solution would be a plus, but not necessary unless you know that's what they're using.

Equally important to those skill sets is your ability to understand the data and where it's coming from. It may be really foreign to you, but read up and at least get the basics. Start with a few wikipedia articles. Check out a textbook or two from the library if necessary. Some people may not agree with this, but when it comes time to manipulate features for some machine learning task, you'll have much more power at your disposal if you have some intuition about why certain features are more noisy than others, or why some interact with one another, etc. I may be outside the norm on this issue, but if I was hiring for a data science position, I wouldn't want to hire anyone who couldn't show some enthusiasm for the data itself.
Here are the most sought after skills employers are looking for Data Scientist positions, based on analysis performed on job postings:
  • Big Data
  • Python
  • Hadoop
  • R
  • Machine learning
  • Data analysis
  • Data mining
  • SQL
  • Statistics

I would try to read a bit and watch relevant videos in  order to learn more about these skills and technologies. If you have time, consider taking an eLearning course to increase your chances on this interview or on the ones to follow.

Here are some resources that may be useful:


Big Data Tutorial - Apache Hadoop Tutorial

Big Data & Hadoop Fundamentals

You can see more relevant information about the required skills and find relevant resources here.