To start off with the learning of Big Data Hadoop, there are few pre-requisites. Learning will be easy if candidate has knowledge of Core Java, Python, Linux or Unix and Database. Let’s understand how knowledge of each is required to work with Big Data Hadoop. There are variety of languages which are being used to collect data, compute and analyse. The below given figure is a simulation of Hadoop Stack
The above image shows that the data need to be loaded into Hadoop and processed. There are various places from where the data can be loaded and to perform the same we need different languages:
- From Database – knowledge of sqoop
- From Web Server – knowledge of flume
- From Messengers – knowledge of kafka
- From Files – knowledge of Unix/Linux shell commands
- Data stored in NoSQL can be loaded directly into Hadoop, major being MongoDB and HBase.
The loaded data can be in structured, unstructured or semi structured format. Following languages can be used to process the data loaded in Hadoop:
- MapReduce – based on Java
- Yarn – based on Java and Python
- Pig – based on latin
- Hive – based on SQL queries
- Spark – based on Scala
Starting with the installation setup following is the recommended system requirement of the System:
- RAM: 16 GB
- HDD: 1 TB
- Operating System: Windows 10
For setting up Big Data Hadoop, follow the steps:
- Hadoop works on Linux platform, our 1st step will be to ensure we have the simulator to get the Linux OS in place. The best part would be to install the operating system. But if we are unable to do so we can create a simulator of Linux by using VMware Workstation Player.
- Search VMware Workstation Player download on Google.
- Click on the first link and download the latest version. (Current latest version is 12.5)
- After the download is done, install the software.
- Double click on the exe file and follow the installation wizard.
- Accept the terms and conditions and proceed till you finish.
- Once VMware Workstation Player is installed, we need to load the Linux operating system to setup and work on Hadoop. Follow the given steps:
- Download latest version of Ubuntu, you can get the latest version from https://www.ubuntu.com/download/desktop
- Once the software is download, Run VMware Workstation Player.
- Click on “Create a New Virtual Machine” > a new window would open select “I will install the operating system later” > click Next button > Select Guest Operating System as “Linux” > Select Version as “Ubuntu 64-bit” > click Next button > Change the Virtual Machine name: “Big Data-Ubuntu 64 bit” and click Next. (check the images below for the same)
- Select “Store virtual disk as a single file” > click Next > click on “Customize Hardware” > Select Processor under device > Select “Number of processor cores” as 2 > Select New CD/DVD (SATA) under device > Select “Use ISO image file” > click on browse and select the downloaded Ubuntu 17.04 software > click Open.
- Select USB Controller > click on Remove button > in the similar way remove Printer > Select Display > Uncheck “Accelerate 3D Graphics” > click on Close Button > Click Finish Button.
- From the left pane select the Virtual Machine “Big Data-Ubuntu 64 Bit” > Click on “Play virtual Machine” > After you click the installation of Ubuntu operating system will begin > In the left pane select the preferred language > click on Install Ubuntu button > Check “Download updates while installing Ubuntu” and “Install third-party software for graphics and Wi-Fi hardware, Flash, MP3 and other media” > click on Continue button > Select “Erase disk and install Ubuntu” > Select Install Now button > it will show a prompt of warning, click Continue button.
- Select your location and click on Continue button > Select Keyboard language and click on Continue button > Enter your details and password and click on Continue button > the installation process begins, you will have to wait for some time.
- It will prompt a message “Installation is complete. You need to restart the computer in order to use the new installation” > Click on “Restart Now” button.
- The operating system will be restarted and will prompt to enter the password.
- Enter the password and press enter key. Now the Virtual Machine is ready for use.