11 things you should know as a Data Scientist
During the meetups we conduct, we get a mix of audience. From complete starters in data science to experts in the field, every one attacks the problem under a single roof. However, one thing stands out when we interact with these people – a large proportion of these people (including some experts) didn’t have their machines set up and tuned for data science. A lot of them never took time out to set themselves up for data science journey. As a result of which they came across some of the industry resources as a matter of chance.
No one told them which blogs to follow, which newsletters to subscribe, where to read industry news. They also never tuned their machines or did not have the necessary hardware or software. This then leads to a lower productivity and even frustration in some cases, when they should be actually loving the experience.
Still don’t relate to it? Think of visiting a website, which take more than 10 seconds to load. You will likely get bored in this time, open up a new tab for another site or would just steer away from what was to be done. Same thing happens with data science. The longer your code runs, the chances of you steering away from work increases!
This is how we came across this unsaid problem people face in industry and hence we thought to create a guide for people to get ready for data science.
Who is this guide meant for?
As mentioned above, this guide is meant for any one in data science industry, who has not tuned their machine to performance. I think it would be of more use to the beginners than to experts, but I have seen experts benefit from these tips equally well.
Let’s start with setting up the machine
1. Hardware – choice of your machine
The first thing to ensure is that you are on the right hardware for data science. There is not much any one can do, if your hardware does not have what you would need. Since laptops are the mainstream device for computing now a days, my recommendations below are for laptop. If you use a desktop / iMac, you can go with even better configuration.
While this choice will ultimately boil down to how much you can shell out for a machine, I would recommend a machine with quad-core processor, preferably i7 (in case of Intel chips). Make sure you check that the processor you choose if quad core and not dual core. Lately, it has been really difficult to find good quad core chips. You can check benchmark performance of various chips in your budget against each other using sites like cpuboss.
Next, it is always recommended to maximize your RAM to the extent possible. A lot of tools use RAM for computations and you don’t want to run out of RAM while doing them (you eventually will in some cases!).
If your budget allows, you should upgrade to SSD as your read / write operations with datasets will take a fraction of time compared to normal SATA hard disk. For those, who are really serious about learning machine learning and deep learning, it is recommended to have a NVIDIA GPU, so that you can run intense computations using CUDA.
Here are a few good recommendations available currently:
- Mac Book Pro – 15 inches model.
- I had purchased a Lenovo Z510 model about 3 years back – i7 (3632QM) chip, 16 GB RAM with NVIDIA GPU and it has served me well. It is still one of the better machines in the market (in terms of performance).
- If you are based in the US and want something out of the world, you can check out Malibal 9000 – it’s a beauty, if you can live with a bit of extra weight.
A few additional notes:
- Skylake chips (6th generation) from Intel were announced recently and machine based on them are just round the corner. I believe that they will push the envelope once again. You can check out Lenovo Thinkpad P50 & P70 configuration as an evidence. So, even if you have a moderate machine today, I would recommend you to stay put for another 2 – 3 months and then buy 6th generation quad core chip based machine.
- If you have to buy a machine today, it might be a good idea to stick with 4th generation quad core i7 chip. There weren’t many options available with 5th generation chip-set available at time of writing this article.
People might argue that you don’t need to invest in such an advanced machine. You might be better off working with a mediocre machine over the cloud. I personally like accessibility provided by a personal machine and the fact that I can start working at any place without hooking on to the internet.
2. Operating System (OS)
Once you have selected your machine, the next most important choice would be your OS.
- If you have a Mac, then this choice is already made for you. A few tools do not offer MAC compatible products (e.g. QlikView), but you can run them in a virtual machine.
- If you are on a PC, I would recommend to set up a dual boot. Linux is better for doing any serious computations, where as Windows is better for using Office products from Microsoft and a few other products which are only available for Windows. So, you get the best of both worlds.
- Another option I have seen people using is running virtual Linux machines within Windows, but that limits the amount of memory and performance you can achieve.
- It is also possible to stay on Linux and use Office 360 offering from Microsoft. I have not done that myself, so I can not comment on that, but looks like a viable option. Again, there might be a few softwares, you may not be able to run in these scenario.
Once you have finalized the OS, make sure you tune your OS to high performance. For example, in Windows, you can disable the transition effects and animations in Windows (Run sysdm.cpl . Go to advanced Tab -> performance section -> Settings and then disable the visual effects), remove unnecessary startup programs and switch the power plan to Performance.
3. Software – general
Here is the list of a few softwares you will need apart from the analytics / data science tools (which are discussed in coming points).
- MS Office for Excel, building presentations and writing documents.
- FileZilla for transferring files using FTP
- Git & GitHub for version management.
- VMWare / Oracle Virtual Box / Vagrant for running virtual machines
- Cygwin / Putty (for windows)
- I use Evernote for taking notes. In case of Linux, I run it in the browser.
- Terminator (for Linux) to run multiple terminals in a single view (it is awesome!)
- Sublime Text for editing codes. You should install additional plugins for the languages you use.
4. Software – Analytics / Data Science
This section would vary depending on your choice of main tools you choose for data mining. If you are still to choose your main tool, check out this comparison – SAS vs. R vs. Python. If you already have a tool of choice, select the one which apply to you:
- SAS – Base SAS along with Enterprise Guide (for GUI driven interface) and Enterprise Miner and the modules depending on the license you have. It also offers TextMiner / JMP and a lot of industry specific modules.
- R – R along with all the key libraries. RStudio is a good choice of environment.
- Python – iPython notebooks, Dato (Graphlab), vowpal-wabbit, import.io are some interesting additional libraries apart from other scientific libraries.
Other options include MATLAB / Octave / RapidMiner.
5. Software – Data Visualization
In addition to the softwares mentioned above, it makes sense to have a tool specifically for data visualization. They usually help a lot while data exploration and when you present the data story to your customers at the end of every project. Again there are a lot of options here. A comprehensive coverage of them would be an article in itself. If you just want one, I would recommend QlikView – it is easy to use, has a personal version which is free to download and can handle large data really well. Tableau is another popular choice, which is very intuitive to use, but is not as effective for use on large datasets in my experience.
6. Databases / File storage
At times, when data set is huge or you are building an application for end users, you will need to use databases – SQL being the most common one. You can use MySQL or PostgreSQL. SQLite, which comes bundled in Python packages can be a effective option for small applications as well. If you work frequently on huge datasets, setting up a Hadoop cluster is inevitable. If you work on real time streams of data, you will need Spark as well.
In addition to these databases, you should also keep a couple of NoSQL databases, in case you need them. I would recommend MongoDB and Neo4j for usage.
By this time, your machine has almost all the resources you need for your data science journey. Now, let us look at a few other resources, you should use during your data science journey.
6. Cloud services
What if you want to work on a dataset which is 400 GB in size? Even the machines I recommended above would fail to load this in their memory while using R! It is scenarios like this, where a cloud account will come in handy. You can use either of the 2 services on cloud – Amazon Web Services (popularly known as AWS) or on Microsoft Azure. Both of them provide highly scalable solutions. The Azure platform in its new avatar is probably more user friendly, but Amazon is still the King of cloud services. You can sign up for accounts on both of them and give them a try.
7. Industry blogs and newsletters
8. Mobile apps
I use my mobile to read a lot of content on the go. Whether I am travelling in metro or just have 5 minutes to sneak the latest publications, I rely a lot on my mobile for that. I use a combination of Prismatic and Flipboard to find new content. Combined, both of them provide me with all the latest gemstones published in the industry.
In addition, I have Termux, a fully functional Linux terminal, just in case I need to ssh into a server while on the go. I use it occasionally to play around in a Python shell for quick prototyping as well.
You can look out for meetups happening in your area. They provide opportunity to people to interact with like minded people. Analytics Vidhya conducts its hackathons in several cities in India. DataKind has several meetups as well.
10. Datasets for practice
You can also look at data.gov to find data from open sources.
11. Communities and Social Media
If you have not done so already, sign up for our discussion portals. You would not only interact with other data scientists from community, but can also participate in various hackathons we conduct. In addition to this, you should check out Kaggle competitions and DataTau for hacker news style industry news.
I think you are all set now. You now have a machine with all the necessary software, tuned for performance. You would also be part of multiple communities and portals to stay tuned with industry.
If you have done all of this, you might be wondering what next? Stay tuned with us, we are coming up with a resource finder shortly, which will assume you have done all of this and will provide you necessary resources to master various concepts, tools and techniques in Data Science.
In the meanwhile, if you think there should be some more steps or resources I have missed on, please feel free to add them here. I hope this article proves to be immensely helpful to all those people, who work with non-optimized machines and resources which leads to frustration and loss of productivity.