February 19, 2015
Modern computing has no shortage of tools for the data scientist. The open source community alters the landscape every six to 12 months, and competition keeps you on the bleeding edge. In my career as a data scientist, I use everything from scientific Python™ packages to the newest cloud computing architectures—and sometimes all within the same project, as the initial stages of data exploration and mining are often done in a different language than the final product implementation. Here is a brief tour of my experiences with some essential tools for data science:
Python - Python has the richest collection of packages I have come across. When I see a new data set, I am inclined to tackle my problem by dissecting the data with one of the scientific libraries (SciPy, scikit-learn, etc.) and then visualizing the results (Matplotlib). Python is the easiest tool to use and has provided me with the most mileage for my data set investigations.
Java™ - Java is the backbone of cloud computing and a must-know for any data scientist who wants to build portable, production-quality products.
Scala - This language acts as a flavor of Java and is an easy, functional, programming language. Scala is the natural language for the latest distributed computing platform, Spark. As a high-level language, Scala allows the author to focus on what should be done with the data set and not on how to position the data to do it effectively. This is one of my favorite tools.
Hadoop® - It is the quintessential cloud environment. Hadoop provides long-term storage of data across a cluster of computers.
Storm - Originally developed at Twitter, this tool enables stream processing of data collected from live feeds. This is an easy tool for Java developers working on streaming analytics projects.
Spark™ - This replaced the old MapReduce paradigm by hiding the processing details behind the functional language Scala. It has the ability to work with streaming data and with large static data sets. This is the latest and greatest tool in distributed computing.
HBase - A sturdy Hadoop-based database system that is easy to use.
Kafka - This message passing system sits between the raw data source and the consuming process like Spark or Storm. Kafka prevents data loss between the source and the analytics engine in the streaming analytics process.
Mahout™ - A configurable cloud analytics toolkit that allows for advanced computing techniques through an API, allowing the data scientist to think about the data set instead of the code.
Elasticsearch - The latest search engine platform on top of which you can place a friendly Kibana user interface.
Lucene™ - This tool is the backbone to open source text processing—and the basis for text processing in many other tools, including Mahout, Solr™, and Elasticsearch. This is a must-have for text processing gurus.
I hope you will find these tools useful in examining the data sets in the National Data Science Bowl and in your everyday life as a data scientist. Good luck!