Mastering Unstructured Data Processing in Hadoop: A Practical Guide using Hive and Impala

Опубликовано: 14 Ноябрь 2023
на канале: Amin Karami

1,060

In this tutorial, we delve into the world of unstructured data processing in Hadoop and show you how to use Hive and Impala to collect and analyse data. We'll walk you through the process of collecting data into a warehouse using Regex, and then demonstrate how to run SQL on that data using Impala. Along the way, we'll share some industrial-based tricks and best practices that will help you tackle real-life projects with confidence. Whether you're new to Hadoop or a seasoned pro, this tutorial is a must-watch for anyone looking to master unstructured data processing in Hadoop.

👉 Please like, share, and subscribe to stay updated on more in-depth tech analyses and hands-on tutorials! 🚀🔥

-----------------------------------------------------------------------------------------
Download data: https://t.ly/WWs0f
Cloudera QuickStart VM: https://downloads.cloudera.com/demo_v...
-----------------------------------------------------------------------------------------
Hive Commands:

DROP TABLE IF EXISTS intermediate_logs;

CREATE EXTERNAL TABLE intermediate_logs (
ip STRING,
datee STRING,
method STRING,
url STRING,
http_version STRING,
code1 STRING,
code2 STRING,
dash STRING,
user_agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'input.regex' = '([^ ]*) - - \\[(\\d{2}/[A-Za-z]{3}/\\d{4}:\\d{2}:\\d{2}:\\d{2} -\\d{4})\\] "([^\ ]*) ([^\ ]*) ([^\ ]*)" (\\d*) (\\d*) "([^"]*)" "([^"]*)"',
'output.format.string' = "%1$$s %2$$s %3$$s %4$$s %5$$s %6$$s %7$$s %8$$s %9$$s")
LOCATION '/user/hive/warehouse/access_log';

DROP TABLE IF EXISTS logs;

CREATE EXTERNAL TABLE logs (
ip STRING,
datee STRING,
method STRING,
url STRING,
http_version STRING,
code1 STRING,
code2 STRING,
dash STRING,
user_agent STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/user/hive/warehouse/logs';

ADD JAR /usr/lib/hive/lib/hive-contrib.jar;

INSERT OVERWRITE TABLE logs SELECT * FROM intermediate_logs;