Big data is large amount of complex data which is increasing rapidly. This data can be of any type and this data is inconsistent in nature. This data cannot be processed through traditional methods.
Characteristics of Big data
There are four main characteristics of big data
1) volume: big data can be very large
2) variety: big data can be of any data type
3) veracity: big data can be inconsistent at times
4) velocity: the speed at which big data is generated is fast
Importance of big data
in todays generation with 8 billion population, big data plays an important role.
1) In business, big data can generate meaningful insights through big data tools to increase customer satisfaction.
Ex: if there is an company, it can use big data to analyze customer behavior and recommend relevant products.
2) Big data also plays an important role in science
large number of test can be analyzed very effectively using big data
3)Personal use: big data can also be useful to us
Ex: like in Spotify , we have each persons yearly recap, where it shows what kind of music, we listened to in that year.
Applications of Big Data
Big data is used and applied in almost every sector now - a- days.
This can be made possible through tools like hadoop.
Some of the applications of big data are
1) In hospitals, to manage large amount of patients
2) in business, to bring out meaningful insights
3) in education institutions to evaluate student performance
4) in social media
5) search engine
6) online shopping
7) in scientific research
Data sources: are those sources from which we can acquire big data
internal data source: these sources use sensor technologies and are commonly used in organizations i.e, these sources collect data from their devices, collects data like audio, video, temperature
Ex: mobile phones, IoT devices
third party data sources: where there is any small business that do not have enough money or inventory to have internal data sources, they go for third party data sourcing i.e, they source data using third party technology.
It is commonly seen in small web pages
It collects the data like no. of clicks, opens
Example: google analytics
External data source: these data sources are collected by different source and they are open to be accessed by anyone Ex: social media
Open data sources: these are similar to external sources but open data sources are very complex and are not relevant for us. These can be scientific data or research data or government data
Ex: www.govt.uk
Through these sources we can acquire big data.
Sturctured vs unstructured
Structured : These are in tabular format.
Can be interpreted through machines
easy to analyze and can done through both machines and humands
can use tools like SQL, Oracle
Unstructured: These are in video, audio, image format
can be difficult to interpret through machine
difficult to analyze and can be done through only humans
tools like noSQL, hadoop
PIG architecture
here are 4 main parts of pig
1) parser: it performs semantic checks and checks the syntax in pig scripts. After checking, it converts the pig script to DAG and logical operation format. The parser sends this DAG file to optimizer.
2) optimizer: it takes the DAG input from the parser and applies some functions like projection and push down to delete unnecessary columns. It also optimizes the logical plan of the script. After that it send this optimized DAG file to compiler.
3) compiler: Here, compiler takes optimized DAG file and compiles it. The output of compiler gives a series of map reduce jobs as multi- querying is available in pig compiler. Pig compiler can rearrange the order to execute efficiently
4) execution engine: it takes the final compiled map reduce tasks and executes it.
The other components are
i) Grunt shell: it is like command line interface like pig
ii) Apache pig: where all the libraries are stored
iii) map reduce: where mapping and reducing is done
iv) finally HDFS: where map reduce output is stored.
Hive Architecture
process of hive architecture is similar to apache pig
hive server: hive server takes all the requests from the drivers and serves and sends them to hive driver.
hive driver: hive driver compiles and optimizes the queries that are in DAG format outputs and sends them to execution engine as map reduce task
execution engine: it executes all the map reduce jobs
meta store: it stores all the information about the data present in Hive. It stores meta data about the columns and its information. it serializes and desterilizes data
CLI: Hive command line interface
Hive web UI: It is GUI commonly provided online
Hive Client:
1)Trift Server: it connects all programming languages that support thrift to HIVE
2)JDBC driver: as hive is built on top of map reduce and use java. it connects to java for application purposes
3) ODBC driver: application that connects to HIVE which supports ODBC