大型数据库应用技术课堂测试05

题目：

Result文件数据说明：

Ip：106.39.41.166,（城市）

Date：10/Nov/2016:00:01:02 +0800,（日期）

Day：10,（天数）

Traffic: 54 ,（流量）

Type: video,（类型：视频video或文章article）

Id: 8701（视频或者文章的id）

1、 数据清洗：按照进行数据清洗，并将清洗后的数据导入hive数据库中。

两阶段数据清洗：

（1）第一阶段：把需要的信息从原始日志中提取出来

ip: 199.30.25.88

time: 10/Nov/2016:00:01:03 +0800

traffic: 62

文章： article/11325

视频： video/3235

（2）第二阶段：根据提取出来的信息做精细化操作

ip—>城市 city（IP）

date–> time:2016-11-10 00:01:03

day: 10

traffic:62

type:article/video

id:11325

（3）hive数据库表结构:

create table data( ip string, time string , day string, traffic bigint,

type string, id string )

2、数据分析：在HIVE统计下列数据。

（1）统计最受欢迎的视频/文章的Top10访问次数（video/article）

（2）按照地市统计最受欢迎的Top10课程（ip）

（3）按照流量统计最受欢迎的Top10课程（traffic）

步骤1：

由于不会使用MySQL格式化数据格式所以使用Java程序将数据提出清洗存入Arraylist再导入新表。

import java.sql.*;
import java.util.ArrayList;

public class test002 {
    public static void main(String[] args) {
        try {
            Class.forName("com.mysql.jdbc.Driver");
        } catch (ClassNotFoundException e) {
            throw new RuntimeException(e);
        }
        Connection conn=null;
        try {
            conn= DriverManager.getConnection("jdbc:mysql://localhost:3306/test03","root","zber1574");
        } catch (SQLException e) {
            throw new RuntimeException(e);
        }
        String sql="select * from data";
        try {
            ArrayList<beanlist> temp=new ArrayList<beanlist>();
            PreparedStatement pt=conn.prepareStatement(sql);
            ResultSet rs=pt.executeQuery();
            while(rs.next()){
                String c=rs.getString("ip");
                String time=rs.getString("time");
                String day=rs.getString("day");
                String traffic=rs.getString("traffic");
                String type=rs.getString("type");
                String id=rs.getString("id");
                String[]a=c.split("\\.");
                String ip=""+a[0]+"."+a[1];
                beanlist temp1;
                temp1 = new beanlist(ip,time,day,traffic,type,id);
                temp.add(temp1);
            }
            add a=new add();
            for(int i=0;i<temp.size();i++){
                String sql1="INSERT INTO data2 (ip,time,day,traffic,type,id) VALUES('"+temp.get(i).getIp()+"','"+temp.get(i).getTime()+"','"+temp.get(i).getDay()+"',"+temp.get(i).getTraffic()+",'"+temp.get(i).getType()+"','"+temp.get(i).getId()+"')";
                a.insert(sql1);
                System.out.println(i);
            }

        } catch (SQLException e) {
            throw new RuntimeException(e);
        }
    }
}

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.SQLException;

public class add {
    public void insert(String sql) throws SQLException {
        try {
            Class.forName("com.mysql.jdbc.Driver");
        } catch (ClassNotFoundException e) {
            throw new RuntimeException(e);
        }
        Connection conn=null;
        try {
            conn= DriverManager.getConnection("jdbc:mysql://localhost:3306/test03","root","zber1574");
        } catch (SQLException e) {
            throw new RuntimeException(e);
        }
        PreparedStatement pt= null;
        try {
            pt = conn.prepareStatement(sql);
        } catch (SQLException e) {
            throw new RuntimeException(e);
        }
        pt=conn.prepareStatement(sql);
        int a=pt.executeUpdate();
        pt.close();
        conn.close();
    }
}

以上就是数据库的提取解析重构再导入

ps：此时从数据库导出的.csv文件可能会是带有””的

会影响hive的正常读取

之后将csv文件通过xftp传入虚拟机

打开到hive

创建表

create table city(
  ip             string  comment "城市"
  , c_date        string  comment "日期"
  , c_day     string  comment "天数"
  , traffic         string  comment "流量"
  , type          string  comment "类型"
  ,Id string comment "id"
)
row format delimited fields terminated by ',';

将文件导入表

hadoop fs -put /opt/software/city.csv /user/hive/warehouse/city

题目一表

create table type10(
id String,
total String
)ROW format delimited fields terminated by ',' STORED AS TEXTFILE;



insert into type10 select id, count(*) as total from result1 group by id order by total desc limit 10;

create table if not exists tb_ip
comment "ip"
as
select
    ip,
    type,
    id,
    count(*) as paixu from city1 group by ip,type,id
order by paixu desc
limit 10;

create table if not exists tb_traffic
comment "traffic"
as
select
    type,
    id,
    sum(traffic) as liuliang from city1
                             group by type,id
order by liuliang desc
limit 10;

最后用fineUI可视化就完成了

原文地址：http://www.cnblogs.com/Rebz/p/16804499.html

1. 本站所有资源来源于用户上传和网络，如有侵权请邮件联系站长！ 2. 分享目的仅供大家学习和交流，请务用于商业用途! 3. 如果你也有好源码或者教程，可以到用户中心发布，分享有积分奖励和额外收入！ 4. 本站提供的源码、模板、插件等等其他资源，都不包含技术服务请大家谅解！ 5. 如有链接无法下载、失效或广告，请联系管理员处理！ 6. 本站资源售价只是赞助，收取费用仅维持本站的日常运营所需！ 7. 如遇到加密压缩包，默认解压密码为"gltf",如遇到无法解压的请联系管理员！ 8. 因为资源和程序源码均为可复制品，所以不支持任何理由的退款兑现，请斟酌后支付下载声明：如果标题没有注明"已测试"或者"测试可用"等字样的资源源码均未经过站长测试.特别注意没有标注的源码不保证任何可用性

大型数据库应用技术课堂测试05

排行榜展示

3D打印机glb模型下载-机械glb模型

树glb模型下载-树2

水稻glb模型下载-水稻1

变电箱1glb模型下载-机械glb模型

树glb模型下载-树3

模型

水稻glb模型下载-水稻1

树glb模型下载-树3

树glb模型下载-树2

变电箱1glb模型下载-机械glb模型

3D打印机glb模型下载-机械glb模型

树glb模型下载-树1

大型数据库应用技术课堂测试05

排行榜展示

标签

模型