|NO.Z.00044|——————————|BigDataEnd|——|Hadoop&Spark.V05|------------------------------------------|Spa
2022/4/12 19:13:30
本文主要是介绍|NO.Z.00044|——————————|BigDataEnd|——|Hadoop&Spark.V05|------------------------------------------|Spa,对大家解决编程问题具有一定的参考价值,需要的程序猿们随着小编来一起学习吧!
[BigDataHadoop:Hadoop&Spark.V05] [BigDataHadoop.Spark内存级快速计算引擎][|章节四|Hadoop|spark|spark sql:spark sql编程&Transformation操作|]
一、Transformation 操作
### --- select * from tab where ... group by ... having... order by... # --- 1、RDD类似的操作持久化 ~~~ 缓存与checkpoint ~~~ select ~~~ where ~~~ group by / 聚合 ~~~ order by ~~~ join ~~~ 集合操作 ~~~ 空值操作(函数) ~~~ 函数
### --- 2、与RDD类似的操作 map、filter、flatMap、mapPartitions、sample、 randomSplit、 limit、distinct、dropDuplicates、describe
scala> df1.map(row=>row.getAs[Int](0)).show +-----+ |value| +-----+ | 7369| | 7499| | 7521| | 7566| | 7654| | 7698| | 7782| | 7788| | 7839| | 7844| | 7876| | 7900| | 7902| | 7934| +-----+
~~~ # randomSplit(与RDD类似,将DF、DS按给定参数分成多份) scala> val df2 = df1.randomSplit(Array(0.5, 0.6, 0.7)) df2: Array[org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]] = Array([EMPNO: int, ENAME: string ... 6 more fields], [EMPNO: int, ENAME: string ... 6 more fields], [EMPNO: int, ENAME: string ... 6 more fields]) scala> df2(0).count res76: Long = 2 scala> df2(1).count res77: Long = 4 scala> df2(2).count res78: Long = 8
~~~ # 取10行数据生成新的DataSet scala> val df2 = df1.limit(10) df2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [EMPNO: int, ENAME: string ... 6 more fields]
~~~ # distinct,去重 scala> val df2 = df1.union(df1) df2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [EMPNO: int, ENAME: string ... 6 more fields] scala> df2.distinct.count res79: Long = 14
~~~ # dropDuplicates,按列值去重 scala> df2.dropDuplicates.show +-----+------+---------+----+-------------------+----+----+------+ |EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO| +-----+------+---------+----+-------------------+----+----+------+ | 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30| | 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30| | 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20| | 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10| | 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30| | 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10| | 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20| | 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30| | 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30| | 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20| | 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20| | 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10| | 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20| | 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30| +-----+------+---------+----+-------------------+----+----+------+ scala> df2.dropDuplicates("mgr", "deptno").show +-----+------+---------+----+-------------------+----+----+------+ |EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO| +-----+------+---------+----+-------------------+----+----+------+ | 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30| | 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30| | 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10| | 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20| | 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10| | 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10| | 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20| | 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20| | 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20| +-----+------+---------+----+-------------------+----+----+------+ scala> df2.dropDuplicates("mgr").show +-----+------+---------+----+-------------------+----+----+------+ |EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO| +-----+------+---------+----+-------------------+----+----+------+ | 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20| | 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20| | 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10| | 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20| | 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30| | 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20| | 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10| +-----+------+---------+----+-------------------+----+----+------+ scala> df2.dropDuplicates("deptno").show +-----+-----+--------+----+-------------------+----+----+------+ |EMPNO|ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO| +-----+-----+--------+----+-------------------+----+----+------+ | 7369|SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20| | 7782|CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10| | 7499|ALLEN|SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30| +-----+-----+--------+----+-------------------+----+----+------+
~~~ # 返回全部列的统计(count、mean、stddev、min、max) scala> ds1.describe().show +-------+----+----------------+------------------+ |summary|name| age| height| +-------+----+----------------+------------------+ | count| 3| 3| 3| | mean|null| 18.0|164.33333333333334| | stddev|null|9.16515138991168|20.008331597945226| | min|Andy| 10| 144| | max| Tom| 28| 184| +-------+----+----------------+------------------+
~~~ # 返回指定列的统计 scala> ds1.describe("*").show +-------+----+----------------+------------------+ |summary|name| age| height| +-------+----+----------------+------------------+ | count| 3| 3| 3| | mean|null| 18.0|164.33333333333334| | stddev|null|9.16515138991168|20.008331597945226| | min|Andy| 10| 144| | max| Tom| 28| 184| +-------+----+----------------+------------------+
### --- 3、存储相关 ~~~ cacheTable、persist、checkpoint、unpersist、cache ~~~ 备注:Dataset 默认的存储级别是 MEMORY_AND_DISK
scala> import org.apache.spark.storage.StorageLevel import org.apache.spark.storage.StorageLevel scala> spark.sparkContext.setCheckpointDir("hdfs://hadoop01:9000/checkpoint")
scala> df1.show() +-----+------+---------+----+-------------------+----+----+------+ |EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO| +-----+------+---------+----+-------------------+----+----+------+ | 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20| | 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30| | 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30| | 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20| | 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30| | 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30| | 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10| | 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20| | 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10| | 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30| | 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20| | 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30| | 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20| | 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10| +-----+------+---------+----+-------------------+----+----+------+
scala> df1.checkpoint() res36: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [EMPNO: int, ENAME: string ... 6 more fields] scala> df1.cache() res37: df1.type = [EMPNO: int, ENAME: string ... 6 more fields] scala> df1.persist(StorageLevel.MEMORY_ONLY) 21/10/20 15:45:46 WARN CacheManager: Asked to cache already cached data. res38: df1.type = [EMPNO: int, ENAME: string ... 6 more fields] scala> df1.count() res39: Long = 14 scala> df1.unpersist(true) res40: df1.type = [EMPNO: int, ENAME: string ... 6 more fields] scala> df1.createOrReplaceTempView("t1") scala> spark.catalog.cacheTable("t1") scala> spark.catalog.uncacheTable("t1")
### --- 4、select相关 ~~~ 列的多种表示、select、selectExpr ~~~ drop、withColumn、withColumnRenamed、cast(内置函数)
~~~ # 列的多种表示方法。使用""、$""、'、col()、ds("") ~~~ # 注意:不要混用;必要时使用spark.implicitis._;并非每个表示在所有的地方都有效 scala> df1.select($"ename", $"hiredate", $"sal").show +------+-------------------+----+ | ename| hiredate| sal| +------+-------------------+----+ | SMITH|2001-01-02 22:12:13| 800| | ALLEN|2002-01-02 22:12:13|1600| | WARD|2003-01-02 22:12:13|1250| | JONES|2004-01-02 22:12:13|2975| |MARTIN|2005-01-02 22:12:13|1250| | BLAKE|2005-04-02 22:12:13|2850| | CLARK|2006-03-02 22:12:13|2450| | SCOTT|2007-03-02 22:12:13|3000| | KING|2006-03-02 22:12:13|5000| |TURNER|2009-07-02 22:12:13|1500| | ADAMS|2010-05-02 22:12:13|1100| | JAMES|2011-06-02 22:12:13| 950| | FORD|2011-07-02 22:12:13|3000| |MILLER|2012-11-02 22:12:13|1300| +------+-------------------+----+ scala> df1.select("ename", "hiredate", "sal").show +------+-------------------+----+ | ename| hiredate| sal| +------+-------------------+----+ | SMITH|2001-01-02 22:12:13| 800| | ALLEN|2002-01-02 22:12:13|1600| | WARD|2003-01-02 22:12:13|1250| | JONES|2004-01-02 22:12:13|2975| |MARTIN|2005-01-02 22:12:13|1250| | BLAKE|2005-04-02 22:12:13|2850| | CLARK|2006-03-02 22:12:13|2450| | SCOTT|2007-03-02 22:12:13|3000| | KING|2006-03-02 22:12:13|5000| |TURNER|2009-07-02 22:12:13|1500| | ADAMS|2010-05-02 22:12:13|1100| | JAMES|2011-06-02 22:12:13| 950| | FORD|2011-07-02 22:12:13|3000| |MILLER|2012-11-02 22:12:13|1300| +------+-------------------+----+ scala> df1.select('ename, 'hiredate, 'sal).show +------+-------------------+----+ | ename| hiredate| sal| +------+-------------------+----+ | SMITH|2001-01-02 22:12:13| 800| | ALLEN|2002-01-02 22:12:13|1600| | WARD|2003-01-02 22:12:13|1250| | JONES|2004-01-02 22:12:13|2975| |MARTIN|2005-01-02 22:12:13|1250| | BLAKE|2005-04-02 22:12:13|2850| | CLARK|2006-03-02 22:12:13|2450| | SCOTT|2007-03-02 22:12:13|3000| | KING|2006-03-02 22:12:13|5000| |TURNER|2009-07-02 22:12:13|1500| | ADAMS|2010-05-02 22:12:13|1100| | JAMES|2011-06-02 22:12:13| 950| | FORD|2011-07-02 22:12:13|3000| |MILLER|2012-11-02 22:12:13|1300| +------+-------------------+----+ scala> df1.select(col("ename"), col("hiredate"), col("sal")).show +------+-------------------+----+ | ename| hiredate| sal| +------+-------------------+----+ | SMITH|2001-01-02 22:12:13| 800| | ALLEN|2002-01-02 22:12:13|1600| | WARD|2003-01-02 22:12:13|1250| | JONES|2004-01-02 22:12:13|2975| |MARTIN|2005-01-02 22:12:13|1250| | BLAKE|2005-04-02 22:12:13|2850| | CLARK|2006-03-02 22:12:13|2450| | SCOTT|2007-03-02 22:12:13|3000| | KING|2006-03-02 22:12:13|5000| |TURNER|2009-07-02 22:12:13|1500| | ADAMS|2010-05-02 22:12:13|1100| | JAMES|2011-06-02 22:12:13| 950| | FORD|2011-07-02 22:12:13|3000| |MILLER|2012-11-02 22:12:13|1300| +------+-------------------+----+ scala> df1.select(df1("ename"), df1("hiredate"), df1("sal")).show +------+-------------------+----+ | ename| hiredate| sal| +------+-------------------+----+ | SMITH|2001-01-02 22:12:13| 800| | ALLEN|2002-01-02 22:12:13|1600| | WARD|2003-01-02 22:12:13|1250| | JONES|2004-01-02 22:12:13|2975| |MARTIN|2005-01-02 22:12:13|1250| | BLAKE|2005-04-02 22:12:13|2850| | CLARK|2006-03-02 22:12:13|2450| | SCOTT|2007-03-02 22:12:13|3000| | KING|2006-03-02 22:12:13|5000| |TURNER|2009-07-02 22:12:13|1500| | ADAMS|2010-05-02 22:12:13|1100| | JAMES|2011-06-02 22:12:13| 950| | FORD|2011-07-02 22:12:13|3000| |MILLER|2012-11-02 22:12:13|1300| +------+-------------------+----+
~~~ # 下面的写法无效,其他列的表示法有效 scala> df1.select("ename", "hiredate", "sal"+100).show scala> df1.select("ename", "hiredate", "sal+100").show
~~~ # 这样写才符合语法 scala> df1.select($"ename", $"hiredate", $"sal"+100).show +------+-------------------+-----------+ | ename| hiredate|(sal + 100)| +------+-------------------+-----------+ | SMITH|2001-01-02 22:12:13| 900| | ALLEN|2002-01-02 22:12:13| 1700| | WARD|2003-01-02 22:12:13| 1350| | JONES|2004-01-02 22:12:13| 3075| |MARTIN|2005-01-02 22:12:13| 1350| | BLAKE|2005-04-02 22:12:13| 2950| | CLARK|2006-03-02 22:12:13| 2550| | SCOTT|2007-03-02 22:12:13| 3100| | KING|2006-03-02 22:12:13| 5100| |TURNER|2009-07-02 22:12:13| 1600| | ADAMS|2010-05-02 22:12:13| 1200| | JAMES|2011-06-02 22:12:13| 1050| | FORD|2011-07-02 22:12:13| 3100| |MILLER|2012-11-02 22:12:13| 1400| +------+-------------------+-----------+ scala> df1.select('ename, 'hiredate, 'sal+100).show +------+-------------------+-----------+ | ename| hiredate|(sal + 100)| +------+-------------------+-----------+ | SMITH|2001-01-02 22:12:13| 900| | ALLEN|2002-01-02 22:12:13| 1700| | WARD|2003-01-02 22:12:13| 1350| | JONES|2004-01-02 22:12:13| 3075| |MARTIN|2005-01-02 22:12:13| 1350| | BLAKE|2005-04-02 22:12:13| 2950| | CLARK|2006-03-02 22:12:13| 2550| | SCOTT|2007-03-02 22:12:13| 3100| | KING|2006-03-02 22:12:13| 5100| |TURNER|2009-07-02 22:12:13| 1600| | ADAMS|2010-05-02 22:12:13| 1200| | JAMES|2011-06-02 22:12:13| 1050| | FORD|2011-07-02 22:12:13| 3100| |MILLER|2012-11-02 22:12:13| 1400| +------+-------------------+-----------+
~~~ # 可使用expr表达式(expr里面只能使用引号) scala> df1.select(expr("comm+100"), expr("sal+100"), expr("ename")).show +------------+-----------+------+ |(comm + 100)|(sal + 100)| ename| +------------+-----------+------+ | null| 900| SMITH| | 400| 1700| ALLEN| | 600| 1350| WARD| | null| 3075| JONES| | 1500| 1350|MARTIN| | null| 2950| BLAKE| | null| 2550| CLARK| | null| 3100| SCOTT| | null| 5100| KING| | 100| 1600|TURNER| | null| 1200| ADAMS| | null| 1050| JAMES| | null| 3100| FORD| | null| 1400|MILLER| +------------+-----------+------+ scala> df1.selectExpr("ename as name").show +------+ | name| +------+ | SMITH| | ALLEN| | WARD| | JONES| |MARTIN| | BLAKE| | CLARK| | SCOTT| | KING| |TURNER| | ADAMS| | JAMES| | FORD| |MILLER| +------+ scala> df1.selectExpr("power(sal, 2)", "sal").show +---------------------------------------------+----+ |POWER(CAST(sal AS DOUBLE), CAST(2 AS DOUBLE))| sal| +---------------------------------------------+----+ | 640000.0| 800| | 2560000.0|1600| | 1562500.0|1250| | 8850625.0|2975| | 1562500.0|1250| | 8122500.0|2850| | 6002500.0|2450| | 9000000.0|3000| | 2.5E7|5000| | 2250000.0|1500| | 1210000.0|1100| | 902500.0| 950| | 9000000.0|3000| | 1690000.0|1300| +---------------------------------------------+----+ scala> df1.selectExpr("round(sal, -3) as newsal", "sal", "ename").show +------+----+------+ |newsal| sal| ename| +------+----+------+ | 1000| 800| SMITH| | 2000|1600| ALLEN| | 1000|1250| WARD| | 3000|2975| JONES| | 1000|1250|MARTIN| | 3000|2850| BLAKE| | 2000|2450| CLARK| | 3000|3000| SCOTT| | 5000|5000| KING| | 2000|1500|TURNER| | 1000|1100| ADAMS| | 1000| 950| JAMES| | 3000|3000| FORD| | 1000|1300|MILLER| +------+----+------+
~~~ # drop、withColumn、 withColumnRenamed、casting ~~~ # drop 删除一个或多个列,得到新的DF scala> df1.drop("mgr") res42: org.apache.spark.sql.DataFrame = [EMPNO: int, ENAME: string ... 5 more fields] scala> df1.drop("empno", "mgr") res43: org.apache.spark.sql.DataFrame = [ENAME: string, JOB: string ... 4 more fields]
~~~ # withColumn,修改列值 scala> val df2 = df1.withColumn("sal", $"sal"+1000) df2: org.apache.spark.sql.DataFrame = [EMPNO: int, ENAME: string ... 6 more fields] scala> df2.show +-----+------+---------+----+-------------------+----+----+------+ |EMPNO| ENAME| JOB| MGR| HIREDATE| sal|COMM|DEPTNO| +-----+------+---------+----+-------------------+----+----+------+ | 7369| SMITH| CLERK|7902|2001-01-02 22:12:13|1800|null| 20| | 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|2600| 300| 30| | 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|2250| 500| 30| | 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|3975|null| 20| | 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|2250|1400| 30| | 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|3850|null| 30| | 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|3450|null| 10| | 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|4000|null| 20| | 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|6000|null| 10| | 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|2500| 0| 30| | 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|2100|null| 20| | 7900| JAMES| CLERK|7698|2011-06-02 22:12:13|1950|null| 30| | 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|4000|null| 20| | 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|2300|null| 10| +-----+------+---------+----+-------------------+----+----+------+
~~~ # withColumnRenamed,更改列名 ~~~ 备注:drop、withColumn、withColumnRenamed返回的是DF scala> df1.withColumnRenamed("sal", "newsal") res45: org.apache.spark.sql.DataFrame = [EMPNO: int, ENAME: string ... 6 more fields]
~~~ # cast,类型转换 scala> df1.selectExpr("cast(empno as string)").printSchema root |-- empno: string (nullable = true) scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> df1.select('empno.cast(StringType)).printSchema root |-- empno: string (nullable = true)
### --- 5、where相关 ~~~ where == filter
~~~ # where操作 scala> df1.filter("sal>1000").show +-----+------+---------+----+-------------------+----+----+------+ |EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO| +-----+------+---------+----+-------------------+----+----+------+ | 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30| | 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30| | 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20| | 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30| | 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30| | 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10| | 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20| | 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10| | 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30| | 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20| | 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20| | 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10| +-----+------+---------+----+-------------------+----+----+------+ scala> df1.filter("sal>1000 and job=='MANAGER'").show +-----+-----+-------+----+-------------------+----+----+------+ |EMPNO|ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO| +-----+-----+-------+----+-------------------+----+----+------+ | 7566|JONES|MANAGER|7839|2004-01-02 22:12:13|2975|null| 20| | 7698|BLAKE|MANAGER|7839|2005-04-02 22:12:13|2850|null| 30| | 7782|CLARK|MANAGER|7839|2006-03-02 22:12:13|2450|null| 10| +-----+-----+-------+----+-------------------+----+----+------+
~~~ # filter操作 scala> df1.where("sal>1000").show +-----+------+---------+----+-------------------+----+----+------+ |EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO| +-----+------+---------+----+-------------------+----+----+------+ | 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30| | 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30| | 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20| | 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30| | 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30| | 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10| | 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20| | 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10| | 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30| | 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20| | 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20| | 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10| +-----+------+---------+----+-------------------+----+----+------+ scala> df1.where("sal>1000 and job=='MANAGER'").show +-----+-----+-------+----+-------------------+----+----+------+ |EMPNO|ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO| +-----+-----+-------+----+-------------------+----+----+------+ | 7566|JONES|MANAGER|7839|2004-01-02 22:12:13|2975|null| 20| | 7698|BLAKE|MANAGER|7839|2005-04-02 22:12:13|2850|null| 30| | 7782|CLARK|MANAGER|7839|2006-03-02 22:12:13|2450|null| 10| +-----+-----+-------+----+-------------------+----+----+------+
### --- 6、groupBy相关 ~~~ groupBy、agg、max、min、avg、sum、count(后面5个为内置函数)
~~~ # groupBy、max、min、mean、sum1 、count(与df1.count不同) scala> df1.groupBy("Job").sum("sal").show +---------+--------+ | Job|sum(sal)| +---------+--------+ | ANALYST| 6000| | SALESMAN| 5600| | CLERK| 4150| | MANAGER| 8275| |PRESIDENT| 5000| +---------+--------+ scala> df1.groupBy("Job").max("sal").show +---------+--------+ | Job|max(sal)| +---------+--------+ | ANALYST| 3000| | SALESMAN| 1600| | CLERK| 1300| | MANAGER| 2975| |PRESIDENT| 5000| +---------+--------+ scala> df1.groupBy("Job").min("sal").show +---------+--------+ | Job|min(sal)| +---------+--------+ | ANALYST| 3000| | SALESMAN| 1250| | CLERK| 800| | MANAGER| 2450| |PRESIDENT| 5000| +---------+--------+ scala> df1.groupBy("Job").avg("sal").show +---------+------------------+ | Job| avg(sal)| +---------+------------------+ | ANALYST| 3000.0| | SALESMAN| 1400.0| | CLERK| 1037.5| | MANAGER|2758.3333333333335| |PRESIDENT| 5000.0| +---------+------------------+ scala> df1.groupBy("Job").count.show +---------+-----+ | Job|count| +---------+-----+ | ANALYST| 2| | SALESMAN| 4| | CLERK| 4| | MANAGER| 3| |PRESIDENT| 1| +---------+-----+
~~~ # 类似having子句 scala> df1.groupBy("Job").avg("sal").where("avg(sal) > 2000").show +---------+------------------+ | Job| avg(sal)| +---------+------------------+ | ANALYST| 3000.0| | MANAGER|2758.3333333333335| |PRESIDENT| 5000.0| +---------+------------------+ scala> df1.groupBy("Job").avg("sal").where($"avg(sal)" > 2000).show +---------+------------------+ | Job| avg(sal)| +---------+------------------+ | ANALYST| 3000.0| | MANAGER|2758.3333333333335| |PRESIDENT| 5000.0| +---------+------------------+
~~~ # agg scala> df1.groupBy("Job").agg("sal"->"max", "sal"->"min", "sal"->"avg", "sal"->"sum", "sal"->"count").show +---------+--------+--------+------------------+--------+----------+ | Job|max(sal)|min(sal)| avg(sal)|sum(sal)|count(sal)| +---------+--------+--------+------------------+--------+----------+ | ANALYST| 3000| 3000| 3000.0| 6000| 2| | SALESMAN| 1600| 1250| 1400.0| 5600| 4| | CLERK| 1300| 800| 1037.5| 4150| 4| | MANAGER| 2975| 2450|2758.3333333333335| 8275| 3| |PRESIDENT| 5000| 5000| 5000.0| 5000| 1| +---------+--------+--------+------------------+--------+----------+ scala> df1.groupBy("deptno").agg("sal"->"max", "sal"->"min", "sal"->"avg", "sal"->"sum", "sal"->"count").show +------+--------+--------+------------------+--------+----------+ |deptno|max(sal)|min(sal)| avg(sal)|sum(sal)|count(sal)| +------+--------+--------+------------------+--------+----------+ | 20| 3000| 800| 2175.0| 10875| 5| | 10| 5000| 1300|2916.6666666666665| 8750| 3| | 30| 2850| 950|1566.6666666666667| 9400| 6| +------+--------+--------+------------------+--------+----------+
~~~ # 这种方式更好理解 scala> df1.groupBy("Job").agg(max("sal"), min("sal"), avg("sal"), sum("sal"), count("sal")).show +---------+--------+--------+------------------+--------+----------+ | Job|max(sal)|min(sal)| avg(sal)|sum(sal)|count(sal)| +---------+--------+--------+------------------+--------+----------+ | ANALYST| 3000| 3000| 3000.0| 6000| 2| | SALESMAN| 1600| 1250| 1400.0| 5600| 4| | CLERK| 1300| 800| 1037.5| 4150| 4| | MANAGER| 2975| 2450|2758.3333333333335| 8275| 3| |PRESIDENT| 5000| 5000| 5000.0| 5000| 1| +---------+--------+--------+------------------+--------+----------+
~~~ # 给列取别名 scala> df1.groupBy("Job").agg(max("sal"), min("sal"), avg("sal"), | sum("sal"), count("sal")).withColumnRenamed("min(sal)", | "min1").show +---------+--------+----+------------------+--------+----------+ | Job|max(sal)|min1| avg(sal)|sum(sal)|count(sal)| +---------+--------+----+------------------+--------+----------+ | ANALYST| 3000|3000| 3000.0| 6000| 2| | SALESMAN| 1600|1250| 1400.0| 5600| 4| | CLERK| 1300| 800| 1037.5| 4150| 4| | MANAGER| 2975|2450|2758.3333333333335| 8275| 3| |PRESIDENT| 5000|5000| 5000.0| 5000| 1| +---------+--------+----+------------------+--------+----------+
~~~ # 给列取别名,最简便 scala> df1.groupBy("Job").agg(max("sal").as("max1"), | min("sal").as("min2"), avg("sal").as("avg3"), | sum("sal").as("sum4"), count("sal").as("count5")).show +---------+----+----+------------------+----+------+ | Job|max1|min2| avg3|sum4|count5| +---------+----+----+------------------+----+------+ | ANALYST|3000|3000| 3000.0|6000| 2| | SALESMAN|1600|1250| 1400.0|5600| 4| | CLERK|1300| 800| 1037.5|4150| 4| | MANAGER|2975|2450|2758.3333333333335|8275| 3| |PRESIDENT|5000|5000| 5000.0|5000| 1| +---------+----+----+------------------+----+------+
### --- 7、orderBy相关 ~~~ orderBy == sort
~~~ # orderBy scala> df1.orderBy("sal").show +-----+------+---------+----+-------------------+----+----+------+ |EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO| +-----+------+---------+----+-------------------+----+----+------+ | 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20| | 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30| | 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20| | 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30| | 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30| | 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10| | 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30| | 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30| | 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10| | 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30| | 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20| | 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20| | 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20| | 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10| +-----+------+---------+----+-------------------+----+----+------+ scala> df1.orderBy($"sal").show +-----+------+---------+----+-------------------+----+----+------+ |EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO| +-----+------+---------+----+-------------------+----+----+------+ | 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20| | 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30| | 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20| | 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30| | 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30| | 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10| | 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30| | 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30| | 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10| | 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30| | 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20| | 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20| | 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20| | 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10| +-----+------+---------+----+-------------------+----+----+------+ scala> df1.orderBy($"sal".asc).show +-----+------+---------+----+-------------------+----+----+------+ |EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO| +-----+------+---------+----+-------------------+----+----+------+ | 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20| | 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30| | 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20| | 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30| | 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30| | 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10| | 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30| | 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30| | 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10| | 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30| | 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20| | 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20| | 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20| | 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10| +-----+------+---------+----+-------------------+----+----+------+
~~~ # 降序 scala> df1.orderBy(-$"sal").show +-----+------+---------+----+-------------------+----+----+------+ |EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO| +-----+------+---------+----+-------------------+----+----+------+ | 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10| | 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20| | 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20| | 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20| | 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30| | 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10| | 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30| | 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30| | 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10| | 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30| | 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30| | 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20| | 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30| | 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20| +-----+------+---------+----+-------------------+----+----+------+ scala> df1.orderBy('sal).show +-----+------+---------+----+-------------------+----+----+------+ |EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO| +-----+------+---------+----+-------------------+----+----+------+ | 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20| | 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30| | 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20| | 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30| | 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30| | 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10| | 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30| | 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30| | 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10| | 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30| | 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20| | 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20| | 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20| | 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10| +-----+------+---------+----+-------------------+----+----+------+ scala> df1.orderBy(col("sal")).show +-----+------+---------+----+-------------------+----+----+------+ |EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO| +-----+------+---------+----+-------------------+----+----+------+ | 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20| | 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30| | 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20| | 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30| | 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30| | 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10| | 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30| | 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30| | 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10| | 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30| | 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20| | 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20| | 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20| | 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10| +-----+------+---------+----+-------------------+----+----+------+ scala> df1.orderBy(df1("sal")).show +-----+------+---------+----+-------------------+----+----+------+ |EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO| +-----+------+---------+----+-------------------+----+----+------+ | 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20| | 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30| | 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20| | 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30| | 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30| | 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10| | 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30| | 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30| | 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10| | 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30| | 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20| | 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20| | 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20| | 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10| +-----+------+---------+----+-------------------+----+----+------+ scala> df1.orderBy($"sal".desc).show +-----+------+---------+----+-------------------+----+----+------+ |EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO| +-----+------+---------+----+-------------------+----+----+------+ | 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10| | 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20| | 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20| | 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20| | 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30| | 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10| | 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30| | 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30| | 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10| | 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30| | 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30| | 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20| | 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30| | 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20| +-----+------+---------+----+-------------------+----+----+------+ scala> df1.orderBy(-'sal).show +-----+------+---------+----+-------------------+----+----+------+ |EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO| +-----+------+---------+----+-------------------+----+----+------+ | 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10| | 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20| | 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20| | 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20| | 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30| | 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10| | 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30| | 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30| | 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10| | 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30| | 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30| | 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20| | 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30| | 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20| +-----+------+---------+----+-------------------+----+----+------+ scala> df1.orderBy(-'deptno, -'sal).show +-----+------+---------+----+-------------------+----+----+------+ |EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO| +-----+------+---------+----+-------------------+----+----+------+ | 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30| | 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30| | 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30| | 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30| | 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30| | 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30| | 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20| | 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20| | 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20| | 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20| | 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20| | 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10| | 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10| | 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10| +-----+------+---------+----+-------------------+----+----+------+
~~~ # sort,以下语句等价 scala> df1.sort("sal").show +-----+------+---------+----+-------------------+----+----+------+ |EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO| +-----+------+---------+----+-------------------+----+----+------+ | 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20| | 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30| | 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20| | 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30| | 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30| | 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10| | 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30| | 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30| | 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10| | 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30| | 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20| | 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20| | 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20| | 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10| +-----+------+---------+----+-------------------+----+----+------+ scala> df1.sort($"sal").show +-----+------+---------+----+-------------------+----+----+------+ |EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO| +-----+------+---------+----+-------------------+----+----+------+ | 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20| | 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30| | 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20| | 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30| | 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30| | 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10| | 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30| | 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30| | 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10| | 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30| | 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20| | 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20| | 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20| | 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10| +-----+------+---------+----+-------------------+----+----+------+ scala> df1.sort($"sal".asc).show +-----+------+---------+----+-------------------+----+----+------+ |EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO| +-----+------+---------+----+-------------------+----+----+------+ | 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20| | 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30| | 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20| | 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30| | 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30| | 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10| | 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30| | 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30| | 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10| | 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30| | 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20| | 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20| | 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20| | 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10| +-----+------+---------+----+-------------------+----+----+------+ scala> df1.sort('sal).show +-----+------+---------+----+-------------------+----+----+------+ |EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO| +-----+------+---------+----+-------------------+----+----+------+ | 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20| | 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30| | 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20| | 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30| | 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30| | 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10| | 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30| | 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30| | 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10| | 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30| | 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20| | 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20| | 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20| | 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10| +-----+------+---------+----+-------------------+----+----+------+ scala> df1.sort(col("sal")).show +-----+------+---------+----+-------------------+----+----+------+ |EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO| +-----+------+---------+----+-------------------+----+----+------+ | 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20| | 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30| | 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20| | 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30| | 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30| | 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10| | 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30| | 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30| | 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10| | 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30| | 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20| | 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20| | 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20| | 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10| +-----+------+---------+----+-------------------+----+----+------+ scala> df1.sort(df1("sal")).show +-----+------+---------+----+-------------------+----+----+------+ |EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO| +-----+------+---------+----+-------------------+----+----+------+ | 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20| | 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30| | 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20| | 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30| | 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30| | 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10| | 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30| | 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30| | 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10| | 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30| | 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20| | 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20| | 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20| | 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10| +-----+------+---------+----+-------------------+----+----+------+ scala> df1.sort($"sal".desc).show +-----+------+---------+----+-------------------+----+----+------+ |EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO| +-----+------+---------+----+-------------------+----+----+------+ | 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10| | 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20| | 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20| | 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20| | 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30| | 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10| | 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30| | 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30| | 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10| | 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30| | 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30| | 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20| | 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30| | 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20| +-----+------+---------+----+-------------------+----+----+------+ scala> df1.sort(-'sal).show +-----+------+---------+----+-------------------+----+----+------+ |EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO| +-----+------+---------+----+-------------------+----+----+------+ | 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10| | 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20| | 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20| | 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20| | 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30| | 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10| | 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30| | 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30| | 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10| | 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30| | 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30| | 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20| | 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30| | 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20| +-----+------+---------+----+-------------------+----+----+------+ scala> df1.sort(-'deptno, -'sal).show +-----+------+---------+----+-------------------+----+----+------+ |EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO| +-----+------+---------+----+-------------------+----+----+------+ | 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30| | 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30| | 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30| | 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30| | 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30| | 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30| | 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20| | 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20| | 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20| | 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20| | 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20| | 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10| | 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10| | 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10| +-----+------+---------+----+-------------------+----+----+------+
### --- 8、join相关 ~~~ # 1、笛卡尔积 df1.crossJoin(df1).count
~~~ # 2、等值连接(单字段)(连接字段empno,仅显示了一次) df1.join(df1, "empno").count
~~~ # 3、等值连接(多字段)(连接字段empno、ename,仅显示了一次) df1.join(df1, Seq("empno", "ename")).show
~~~ # 定义第一个数据集 case class StudentAge(sno: Int, name: String, age: Int) val lst = List(StudentAge(1,"Alice", 18), StudentAge(2,"Andy", 19), StudentAge(3,"Bob", 17), StudentAge(4,"Justin", 21), StudentAge(5,"Cindy", 20)) val ds1 = spark.createDataset(lst) ds1.show()
~~~ # 定义第二个数据集 case class StudentHeight(sname: String, height: Int) val rdd = sc.makeRDD(List(StudentHeight("Alice", 160), StudentHeight("Andy", 159), StudentHeight("Bob", 170), StudentHeight("Cindy", 165), StudentHeight("Rose", 160))) val ds2 = rdd.toDS
~~~ # 备注:不能使用双引号,而且这里是 === ds1.join(ds2, $"name"===$"sname").show ds1.join(ds2, 'name==='sname).show ds1.join(ds2, ds1("name")===ds2("sname")).show ds1.join(ds2, ds1("sname")===ds2("sname"), "inner").show
~~~ # 多种连接方式 ds1.join(ds2, $"name"===$"sname").show ds1.join(ds2, $"name"===$"sname", "inner").show ds1.join(ds2, $"name"===$"sname", "left").show ds1.join(ds2, $"name"===$"sname", "left_outer").show ds1.join(ds2, $"name"===$"sname", "right").show ds1.join(ds2, $"name"===$"sname", "right_outer").show ds1.join(ds2, $"name"===$"sname", "outer").show ds1.join(ds2, $"name"===$"sname", "full").show ds1.join(ds2, $"name"===$"sname", "full_outer").show ~~~ # 备注:DS在join操作之后变成了DF
### --- 9、集合相关 ~~~ union==unionAll(过期)、intersect、except
~~~ # union、unionAll、intersect、except。集合的交、并、差 scala> val ds3 = ds1.select("name") scala> val ds4 = ds2.select("sname")
~~~ # union 求并集,不去重 scala> ds3.union(ds4).show
~~~ # unionAll、union 等价;unionAll过期方法,不建议使用 scala> ds3.unionAll(ds4).show
~~~ # intersect 求交 scala> ds3.intersect(ds4).show
~~~ # except 求差 scala> ds3.except(ds4).show
### --- 10、空值处理 ~~~ na.fill、na.drop
~~~ # NaN (Not a Number) scala> math.sqrt(-1.0) res90: Double = NaN scala> math.sqrt(-1.0).isNaN() res91: Boolean = true scala> df1.show +-----+------+---------+----+-------------------+----+----+------+ |EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO| +-----+------+---------+----+-------------------+----+----+------+ | 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20| | 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30| | 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30| | 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20| | 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30| | 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30| | 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10| | 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20| | 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10| | 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30| | 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20| | 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30| | 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20| | 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10| +-----+------+---------+----+-------------------+----+----+------+
~~~ # 删除所有列的空值和NaN scala> df1.na.drop.show +-----+------+--------+----+-------------------+----+----+------+ |EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO| +-----+------+--------+----+-------------------+----+----+------+ | 7499| ALLEN|SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30| | 7521| WARD|SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30| | 7654|MARTIN|SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30| | 7844|TURNER|SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30| +-----+------+--------+----+-------------------+----+----+------+
~~~ # 删除某列的空值和NaN scala> df1.na.drop(Array("mgr")).show +-----+------+--------+----+-------------------+----+----+------+ |EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO| +-----+------+--------+----+-------------------+----+----+------+ | 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20| | 7499| ALLEN|SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30| | 7521| WARD|SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30| | 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20| | 7654|MARTIN|SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30| | 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30| | 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10| | 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20| | 7844|TURNER|SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30| | 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20| | 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30| | 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20| | 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10| +-----+------+--------+----+-------------------+----+----+------+
~~~ # 对全部列填充;对指定单列填充;对指定多列填充 scala> df1.na.fill(1000).show +-----+------+---------+----+-------------------+----+----+------+ |EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO| +-----+------+---------+----+-------------------+----+----+------+ | 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|1000| 20| | 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30| | 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30| | 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|1000| 20| | 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30| | 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|1000| 30| | 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|1000| 10| | 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|1000| 20| | 7839| KING|PRESIDENT|1000|2006-03-02 22:12:13|5000|1000| 10| | 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30| | 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|1000| 20| | 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|1000| 30| | 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|1000| 20| | 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|1000| 10| +-----+------+---------+----+-------------------+----+----+------+ scala> df1.na.fill(1000, Array("comm")).show +-----+------+---------+----+-------------------+----+----+------+ |EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO| +-----+------+---------+----+-------------------+----+----+------+ | 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|1000| 20| | 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30| | 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30| | 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|1000| 20| | 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30| | 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|1000| 30| | 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|1000| 10| | 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|1000| 20| | 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|1000| 10| | 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30| | 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|1000| 20| | 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|1000| 30| | 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|1000| 20| | 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|1000| 10| +-----+------+---------+----+-------------------+----+----+------+ scala> df1.na.fill(Map("mgr"->2000, "comm"->1000)).show +-----+------+---------+----+-------------------+----+----+------+ |EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO| +-----+------+---------+----+-------------------+----+----+------+ | 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|1000| 20| | 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30| | 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30| | 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|1000| 20| | 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30| | 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|1000| 30| | 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|1000| 10| | 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|1000| 20| | 7839| KING|PRESIDENT|2000|2006-03-02 22:12:13|5000|1000| 10| | 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30| | 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|1000| 20| | 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|1000| 30| | 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|1000| 20| | 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|1000| 10| +-----+------+---------+----+-------------------+----+----+------+
~~~ # 对指定的值进行替换 scala> df1.na.replace("comm" :: "deptno" :: Nil, Map(0 -> 100, 10 -> 100)).show +-----+------+---------+----+-------------------+----+----+------+ |EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO| +-----+------+---------+----+-------------------+----+----+------+ | 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20| | 7499| ALLEN| SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30| | 7521| WARD| SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30| | 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20| | 7654|MARTIN| SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30| | 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30| | 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 100| | 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20| | 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 100| | 7844|TURNER| SALESMAN|7698|2009-07-02 22:12:13|1500| 100| 30| | 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20| | 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30| | 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20| | 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 100| +-----+------+---------+----+-------------------+----+----+------+
~~~ # 查询空值列或非空值列。isNull、isNotNull为内置函数 scala> df1.filter("comm is null").show +-----+------+---------+----+-------------------+----+----+------+ |EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO| +-----+------+---------+----+-------------------+----+----+------+ | 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20| | 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20| | 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30| | 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10| | 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20| | 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10| | 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20| | 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30| | 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20| | 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10| +-----+------+---------+----+-------------------+----+----+------+ scala> df1.filter($"comm".isNull).show +-----+------+---------+----+-------------------+----+----+------+ |EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO| +-----+------+---------+----+-------------------+----+----+------+ | 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20| | 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20| | 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30| | 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10| | 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20| | 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10| | 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20| | 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30| | 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20| | 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10| +-----+------+---------+----+-------------------+----+----+------+ scala> df1.filter(col("comm").isNull).show +-----+------+---------+----+-------------------+----+----+------+ |EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO| +-----+------+---------+----+-------------------+----+----+------+ | 7369| SMITH| CLERK|7902|2001-01-02 22:12:13| 800|null| 20| | 7566| JONES| MANAGER|7839|2004-01-02 22:12:13|2975|null| 20| | 7698| BLAKE| MANAGER|7839|2005-04-02 22:12:13|2850|null| 30| | 7782| CLARK| MANAGER|7839|2006-03-02 22:12:13|2450|null| 10| | 7788| SCOTT| ANALYST|7566|2007-03-02 22:12:13|3000|null| 20| | 7839| KING|PRESIDENT|null|2006-03-02 22:12:13|5000|null| 10| | 7876| ADAMS| CLERK|7788|2010-05-02 22:12:13|1100|null| 20| | 7900| JAMES| CLERK|7698|2011-06-02 22:12:13| 950|null| 30| | 7902| FORD| ANALYST|7566|2011-07-02 22:12:13|3000|null| 20| | 7934|MILLER| CLERK|7782|2012-11-02 22:12:13|1300|null| 10| +-----+------+---------+----+-------------------+----+----+------+ scala> df1.filter("comm is not null").show +-----+------+--------+----+-------------------+----+----+------+ |EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO| +-----+------+--------+----+-------------------+----+----+------+ | 7499| ALLEN|SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30| | 7521| WARD|SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30| | 7654|MARTIN|SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30| | 7844|TURNER|SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30| +-----+------+--------+----+-------------------+----+----+------+ scala> df1.filter(col("comm").isNotNull).show +-----+------+--------+----+-------------------+----+----+------+ |EMPNO| ENAME| JOB| MGR| HIREDATE| SAL|COMM|DEPTNO| +-----+------+--------+----+-------------------+----+----+------+ | 7499| ALLEN|SALESMAN|7698|2002-01-02 22:12:13|1600| 300| 30| | 7521| WARD|SALESMAN|7698|2003-01-02 22:12:13|1250| 500| 30| | 7654|MARTIN|SALESMAN|7698|2005-01-02 22:12:13|1250|1400| 30| | 7844|TURNER|SALESMAN|7698|2009-07-02 22:12:13|1500| 0| 30| +-----+------+--------+----+-------------------+----+----+------+
### --- 11、窗口函数 ~~~ 一般情况下窗口函数不用 DSL 处理,直接用SQL更方便 ~~~ 参考源码Window.scala、WindowSpec.scala(主要)
scala> import org.apache.spark.sql.expressions.Window import org.apache.spark.sql.expressions.Window scala> val w1 = Window.partitionBy("cookieid").orderBy("createtime") w1: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec@32a97ef8 scala> val w2 = Window.partitionBy("cookieid").orderBy("pv") w2: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec@dac633 scala> val w3 = w1.rowsBetween(Window.unboundedPreceding, | Window.currentRow) w3: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec@2b1e2939 scala> val w4 = w1.rowsBetween(-1, 1) w4: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec@217d9e78
~~~ # 聚组函数【用分析函数的数据集】 scala> df.select($"cookieid", $"pv", sum("pv").over(w1).alias("pv1")).show scala> df.select($"cookieid", $"pv", sum("pv").over(w3).alias("pv1")).show scala> df.select($"cookieid", $"pv", sum("pv").over(w4).as("pv1")).show
~~~ # 排名 scala> df.select($"cookieid", $"pv", rank().over(w2).alias("rank")).show scala> df.select($"cookieid", $"pv", dense_rank().over(w2).alias("denserank")).show scala> df.select($"cookieid", $"pv", row_number().over(w2).alias("rownumber")).show
~~~ # lag、lead scala> df.select($"cookieid", $"pv", lag("pv", 2).over(w2).alias("rownumber")).show scala> df.select($"cookieid", $"pv", lag("pv", -2).over(w2).alias("rownumber")).show
### --- 12、内建函数 ~~~ http://spark.apache.org/docs/latest/api/sql/index.html二、编程代码实现
### --- 编程代码实现 package cn.yanqi.sparksql import org.apache.spark.sql.{DataFrame, SparkSession} import org.apache.spark.sql.functions._ object TransformationDemo { def main(args: Array[String]): Unit = { val spark = SparkSession .builder() .appName("Demo1") .master("local[*]") .getOrCreate() val sc = spark.sparkContext sc.setLogLevel("warn") import spark.implicits._ val df1: DataFrame = spark.read .option("header", "true") .option("inferschema", "true") .csv("data/emp.dat") // df1.printSchema() // df1.map(row=>row.getAs[Int](0)).show // // randomSplit(与RDD类似,将DF、DS按给定参数分成多份) // val Array(dfx, dfy, dfz) = df1.randomSplit(Array(0.5, 0.6, 0.7)) // dfx.count // dfy.count // dfz.count // // // 取10行数据生成新的DataSet // val df2 = df1.limit(10) // // // distinct,去重 // val df3 = df1.union(df1) // df3.distinct.count // // // dropDuplicates,按列值去重 // df2.dropDuplicates.show // df2.dropDuplicates("mgr", "deptno").show // df2.dropDuplicates("mgr").show // df2.dropDuplicates("deptno").show // // // 返回全部列的统计(count、mean、stddev、min、max) // df1.describe().show // // // 返回指定列的统计 // df1.describe("sal").show // df1.describe("sal", "comm").show // // df1.createOrReplaceTempView("t1") // spark.sql("select * from t1").show // spark.catalog.cacheTable("t1") // spark.catalog.uncacheTable("t1") import org.apache.spark.sql.functions._ df1.groupBy("Job").agg(min("sal").as("minsal"), max("sal").as("maxsal")).where($"minsal" > 2000).show val lst = List(StudentAge(1,"Alice", 18), StudentAge(2,"Andy", 19), StudentAge(3,"Bob", 17), StudentAge(4,"Justin", 21), StudentAge(5,"Cindy", 20)) val ds1 = spark.createDataset(lst) ds1.show() // 定义第二个数据集 val rdd = sc.makeRDD(List(StudentHeight("Alice", 160), StudentHeight("Andy", 159), StudentHeight("Bob", 170), StudentHeight("Cindy", 165), StudentHeight("Rose", 160))) val ds2 = rdd.toDS spark.close() } } case class StudentAge(sno: Int, sname: String, age: Int) case class StudentHeight(sname: String, height: Int)
### --- 编译打印 ~~~ # 准备数据文件:data/emp.dat
~~~ # 编译打印 D:\JAVA\jdk1.8.0_231\bin\java.exe "-javaagent:D:\IntelliJIDEA\IntelliJ IDEA 2019.3.3\lib\idea_rt.jar=56671:D:\IntelliJIDEA\IntelliJ IDEA 2019.3.3\bin" -Dfile.encoding=UTF-8 -classpath D:\JAVA\jdk1.8.0_231\jre\lib\charsets.jar;D:\JAVA\jdk1.8.0_231\jre\lib\deploy.jar;D:\JAVA\jdk1.8.0_231\jre\lib\ext\access-bridge-64.jar;D:\JAVA\jdk1.8.0_231\jre\lib\ext\cldrdata.jar;D:\JAVA\jdk1.8.0_231\jre\lib\ext\dnsns.jar;D:\JAVA\jdk1.8.0_231\jre\lib\ext\jaccess.jar;D:\JAVA\jdk1.8.0_231\jre\lib\ext\jfxrt.jar;D:\JAVA\jdk1.8.0_231\jre\lib\ext\localedata.jar;D:\JAVA\jdk1.8.0_231\jre\lib\ext\nashorn.jar;D:\JAVA\jdk1.8.0_231\jre\lib\ext\sunec.jar;D:\JAVA\jdk1.8.0_231\jre\lib\ext\sunjce_provider.jar;D:\JAVA\jdk1.8.0_231\jre\lib\ext\sunmscapi.jar;D:\JAVA\jdk1.8.0_231\jre\lib\ext\sunpkcs11.jar;D:\JAVA\jdk1.8.0_231\jre\lib\ext\zipfs.jar;D:\JAVA\jdk1.8.0_231\jre\lib\javaws.jar;D:\JAVA\jdk1.8.0_231\jre\lib\jce.jar;D:\JAVA\jdk1.8.0_231\jre\lib\jfr.jar;D:\JAVA\jdk1.8.0_231\jre\lib\jfxswt.jar;D:\JAVA\jdk1.8.0_231\jre\lib\jsse.jar;D:\JAVA\jdk1.8.0_231\jre\lib\management-agent.jar;D:\JAVA\jdk1.8.0_231\jre\lib\plugin.jar;D:\JAVA\jdk1.8.0_231\jre\lib\resources.jar;D:\JAVA\jdk1.8.0_231\jre\lib\rt.jar;E:\NO.Z.80000.Hadoop.spark\SparkBigData\target\classes;C:\Users\Administrator\.m2\repository\org\scala-lang\scala-library\2.12.10\scala-library-2.12.10.jar;C:\Users\Administrator\.m2\repository\org\apache\spark\spark-core_2.12\2.4.5\spark-core_2.12-2.4.5.jar;C:\Users\Administrator\.m2\repository\com\thoughtworks\paranamer\paranamer\2.8\paranamer-2.8.jar;C:\Users\Administrator\.m2\repository\org\apache\avro\avro\1.8.2\avro-1.8.2.jar;C:\Users\Administrator\.m2\repository\org\codehaus\jackson\jackson-core-asl\1.9.13\jackson-core-asl-1.9.13.jar;C:\Users\Administrator\.m2\repository\org\apache\commons\commons-compress\1.8.1\commons-compress-1.8.1.jar;C:\Users\Administrator\.m2\repository\org\tukaani\xz\1.5\xz-1.5.jar;C:\Users\Administrator\.m2\repository\org\apache\avro\avro-mapred\1.8.2\avro-mapred-1.8.2-hadoop2.jar;C:\Users\Administrator\.m2\repository\org\apache\avro\avro-ipc\1.8.2\avro-ipc-1.8.2.jar;C:\Users\Administrator\.m2\repository\com\twitter\chill_2.12\0.9.3\chill_2.12-0.9.3.jar;C:\Users\Administrator\.m2\repository\com\esotericsoftware\kryo-shaded\4.0.2\kryo-shaded-4.0.2.jar;C:\Users\Administrator\.m2\repository\com\esotericsoftware\minlog\1.3.0\minlog-1.3.0.jar;C:\Users\Administrator\.m2\repository\org\objenesis\objenesis\2.5.1\objenesis-2.5.1.jar;C:\Users\Administrator\.m2\repository\com\twitter\chill-java\0.9.3\chill-java-0.9.3.jar;C:\Users\Administrator\.m2\repository\org\apache\xbean\xbean-asm6-shaded\4.8\xbean-asm6-shaded-4.8.jar;C:\Users\Administrator\.m2\repository\org\apache\hadoop\hadoop-client\2.6.5\hadoop-client-2.6.5.jar;C:\Users\Administrator\.m2\repository\org\apache\hadoop\hadoop-common\2.6.5\hadoop-common-2.6.5.jar;C:\Users\Administrator\.m2\repository\xmlenc\xmlenc\0.52\xmlenc-0.52.jar;C:\Users\Administrator\.m2\repository\commons-collections\commons-collections\3.2.2\commons-collections-3.2.2.jar;C:\Users\Administrator\.m2\repository\commons-configuration\commons-configuration\1.6\commons-configuration-1.6.jar;C:\Users\Administrator\.m2\repository\commons-digester\commons-digester\1.8\commons-digester-1.8.jar;C:\Users\Administrator\.m2\repository\commons-beanutils\commons-beanutils\1.7.0\commons-beanutils-1.7.0.jar;C:\Users\Administrator\.m2\repository\com\google\code\gson\gson\2.2.4\gson-2.2.4.jar;C:\Users\Administrator\.m2\repository\org\apache\hadoop\hadoop-auth\2.6.5\hadoop-auth-2.6.5.jar;C:\Users\Administrator\.m2\repository\org\apache\directory\server\apacheds-kerberos-codec\2.0.0-M15\apacheds-kerberos-codec-2.0.0-M15.jar;C:\Users\Administrator\.m2\repository\org\apache\directory\server\apacheds-i18n\2.0.0-M15\apacheds-i18n-2.0.0-M15.jar;C:\Users\Administrator\.m2\repository\org\apache\directory\api\api-asn1-api\1.0.0-M20\api-asn1-api-1.0.0-M20.jar;C:\Users\Administrator\.m2\repository\org\apache\directory\api\api-util\1.0.0-M20\api-util-1.0.0-M20.jar;C:\Users\Administrator\.m2\repository\org\apache\curator\curator-client\2.6.0\curator-client-2.6.0.jar;C:\Users\Administrator\.m2\repository\org\htrace\htrace-core\3.0.4\htrace-core-3.0.4.jar;C:\Users\Administrator\.m2\repository\org\apache\hadoop\hadoop-hdfs\2.6.5\hadoop-hdfs-2.6.5.jar;C:\Users\Administrator\.m2\repository\org\mortbay\jetty\jetty-util\6.1.26\jetty-util-6.1.26.jar;C:\Users\Administrator\.m2\repository\xerces\xercesImpl\2.9.1\xercesImpl-2.9.1.jar;C:\Users\Administrator\.m2\repository\xml-apis\xml-apis\1.3.04\xml-apis-1.3.04.jar;C:\Users\Administrator\.m2\repository\org\apache\hadoop\hadoop-mapreduce-client-app\2.6.5\hadoop-mapreduce-client-app-2.6.5.jar;C:\Users\Administrator\.m2\repository\org\apache\hadoop\hadoop-mapreduce-client-common\2.6.5\hadoop-mapreduce-client-common-2.6.5.jar;C:\Users\Administrator\.m2\repository\org\apache\hadoop\hadoop-yarn-client\2.6.5\hadoop-yarn-client-2.6.5.jar;C:\Users\Administrator\.m2\repository\org\apache\hadoop\hadoop-yarn-server-common\2.6.5\hadoop-yarn-server-common-2.6.5.jar;C:\Users\Administrator\.m2\repository\org\apache\hadoop\hadoop-mapreduce-client-shuffle\2.6.5\hadoop-mapreduce-client-shuffle-2.6.5.jar;C:\Users\Administrator\.m2\repository\org\apache\hadoop\hadoop-yarn-api\2.6.5\hadoop-yarn-api-2.6.5.jar;C:\Users\Administrator\.m2\repository\org\apache\hadoop\hadoop-mapreduce-client-core\2.6.5\hadoop-mapreduce-client-core-2.6.5.jar;C:\Users\Administrator\.m2\repository\org\apache\hadoop\hadoop-yarn-common\2.6.5\hadoop-yarn-common-2.6.5.jar;C:\Users\Administrator\.m2\repository\javax\xml\bind\jaxb-api\2.2.2\jaxb-api-2.2.2.jar;C:\Users\Administrator\.m2\repository\javax\xml\stream\stax-api\1.0-2\stax-api-1.0-2.jar;C:\Users\Administrator\.m2\repository\org\codehaus\jackson\jackson-jaxrs\1.9.13\jackson-jaxrs-1.9.13.jar;C:\Users\Administrator\.m2\repository\org\codehaus\jackson\jackson-xc\1.9.13\jackson-xc-1.9.13.jar;C:\Users\Administrator\.m2\repository\org\apache\hadoop\hadoop-mapreduce-client-jobclient\2.6.5\hadoop-mapreduce-client-jobclient-2.6.5.jar;C:\Users\Administrator\.m2\repository\org\apache\hadoop\hadoop-annotations\2.6.5\hadoop-annotations-2.6.5.jar;C:\Users\Administrator\.m2\repository\org\apache\spark\spark-launcher_2.12\2.4.5\spark-launcher_2.12-2.4.5.jar;C:\Users\Administrator\.m2\repository\org\apache\spark\spark-kvstore_2.12\2.4.5\spark-kvstore_2.12-2.4.5.jar;C:\Users\Administrator\.m2\repository\org\fusesource\leveldbjni\leveldbjni-all\1.8\leveldbjni-all-1.8.jar;C:\Users\Administrator\.m2\repository\com\fasterxml\jackson\core\jackson-core\2.6.7\jackson-core-2.6.7.jar;C:\Users\Administrator\.m2\repository\com\fasterxml\jackson\core\jackson-annotations\2.6.7\jackson-annotations-2.6.7.jar;C:\Users\Administrator\.m2\repository\org\apache\spark\spark-network-common_2.12\2.4.5\spark-network-common_2.12-2.4.5.jar;C:\Users\Administrator\.m2\repository\org\apache\spark\spark-network-shuffle_2.12\2.4.5\spark-network-shuffle_2.12-2.4.5.jar;C:\Users\Administrator\.m2\repository\org\apache\spark\spark-unsafe_2.12\2.4.5\spark-unsafe_2.12-2.4.5.jar;C:\Users\Administrator\.m2\repository\javax\activation\activation\1.1.1\activation-1.1.1.jar;C:\Users\Administrator\.m2\repository\org\apache\curator\curator-recipes\2.6.0\curator-recipes-2.6.0.jar;C:\Users\Administrator\.m2\repository\org\apache\curator\curator-framework\2.6.0\curator-framework-2.6.0.jar;C:\Users\Administrator\.m2\repository\com\google\guava\guava\16.0.1\guava-16.0.1.jar;C:\Users\Administrator\.m2\repository\org\apache\zookeeper\zookeeper\3.4.6\zookeeper-3.4.6.jar;C:\Users\Administrator\.m2\repository\javax\servlet\javax.servlet-api\3.1.0\javax.servlet-api-3.1.0.jar;C:\Users\Administrator\.m2\repository\org\apache\commons\commons-lang3\3.5\commons-lang3-3.5.jar;C:\Users\Administrator\.m2\repository\org\apache\commons\commons-math3\3.4.1\commons-math3-3.4.1.jar;C:\Users\Administrator\.m2\repository\com\google\code\findbugs\jsr305\1.3.9\jsr305-1.3.9.jar;C:\Users\Administrator\.m2\repository\org\slf4j\slf4j-api\1.7.16\slf4j-api-1.7.16.jar;C:\Users\Administrator\.m2\repository\org\slf4j\jul-to-slf4j\1.7.16\jul-to-slf4j-1.7.16.jar;C:\Users\Administrator\.m2\repository\org\slf4j\jcl-over-slf4j\1.7.16\jcl-over-slf4j-1.7.16.jar;C:\Users\Administrator\.m2\repository\log4j\log4j\1.2.17\log4j-1.2.17.jar;C:\Users\Administrator\.m2\repository\org\slf4j\slf4j-log4j12\1.7.16\slf4j-log4j12-1.7.16.jar;C:\Users\Administrator\.m2\repository\com\ning\compress-lzf\1.0.3\compress-lzf-1.0.3.jar;C:\Users\Administrator\.m2\repository\org\xerial\snappy\snappy-java\1.1.7.3\snappy-java-1.1.7.3.jar;C:\Users\Administrator\.m2\repository\org\lz4\lz4-java\1.4.0\lz4-java-1.4.0.jar;C:\Users\Administrator\.m2\repository\com\github\luben\zstd-jni\1.3.2-2\zstd-jni-1.3.2-2.jar;C:\Users\Administrator\.m2\repository\org\roaringbitmap\RoaringBitmap\0.7.45\RoaringBitmap-0.7.45.jar;C:\Users\Administrator\.m2\repository\org\roaringbitmap\shims\0.7.45\shims-0.7.45.jar;C:\Users\Administrator\.m2\repository\commons-net\commons-net\3.1\commons-net-3.1.jar;C:\Users\Administrator\.m2\repository\org\json4s\json4s-jackson_2.12\3.5.3\json4s-jackson_2.12-3.5.3.jar;C:\Users\Administrator\.m2\repository\org\json4s\json4s-core_2.12\3.5.3\json4s-core_2.12-3.5.3.jar;C:\Users\Administrator\.m2\repository\org\json4s\json4s-ast_2.12\3.5.3\json4s-ast_2.12-3.5.3.jar;C:\Users\Administrator\.m2\repository\org\json4s\json4s-scalap_2.12\3.5.3\json4s-scalap_2.12-3.5.3.jar;C:\Users\Administrator\.m2\repository\org\scala-lang\modules\scala-xml_2.12\1.0.6\scala-xml_2.12-1.0.6.jar;C:\Users\Administrator\.m2\repository\org\glassfish\jersey\core\jersey-client\2.22.2\jersey-client-2.22.2.jar;C:\Users\Administrator\.m2\repository\javax\ws\rs\javax.ws.rs-api\2.0.1\javax.ws.rs-api-2.0.1.jar;C:\Users\Administrator\.m2\repository\org\glassfish\hk2\hk2-api\2.4.0-b34\hk2-api-2.4.0-b34.jar;C:\Users\Administrator\.m2\repository\org\glassfish\hk2\hk2-utils\2.4.0-b34\hk2-utils-2.4.0-b34.jar;C:\Users\Administrator\.m2\repository\org\glassfish\hk2\external\aopalliance-repackaged\2.4.0-b34\aopalliance-repackaged-2.4.0-b34.jar;C:\Users\Administrator\.m2\repository\org\glassfish\hk2\external\javax.inject\2.4.0-b34\javax.inject-2.4.0-b34.jar;C:\Users\Administrator\.m2\repository\org\glassfish\hk2\hk2-locator\2.4.0-b34\hk2-locator-2.4.0-b34.jar;C:\Users\Administrator\.m2\repository\org\javassist\javassist\3.18.1-GA\javassist-3.18.1-GA.jar;C:\Users\Administrator\.m2\repository\org\glassfish\jersey\core\jersey-common\2.22.2\jersey-common-2.22.2.jar;C:\Users\Administrator\.m2\repository\javax\annotation\javax.annotation-api\1.2\javax.annotation-api-1.2.jar;C:\Users\Administrator\.m2\repository\org\glassfish\jersey\bundles\repackaged\jersey-guava\2.22.2\jersey-guava-2.22.2.jar;C:\Users\Administrator\.m2\repository\org\glassfish\hk2\osgi-resource-locator\1.0.1\osgi-resource-locator-1.0.1.jar;C:\Users\Administrator\.m2\repository\org\glassfish\jersey\core\jersey-server\2.22.2\jersey-server-2.22.2.jar;C:\Users\Administrator\.m2\repository\org\glassfish\jersey\media\jersey-media-jaxb\2.22.2\jersey-media-jaxb-2.22.2.jar;C:\Users\Administrator\.m2\repository\javax\validation\validation-api\1.1.0.Final\validation-api-1.1.0.Final.jar;C:\Users\Administrator\.m2\repository\org\glassfish\jersey\containers\jersey-container-servlet\2.22.2\jersey-container-servlet-2.22.2.jar;C:\Users\Administrator\.m2\repository\org\glassfish\jersey\containers\jersey-container-servlet-core\2.22.2\jersey-container-servlet-core-2.22.2.jar;C:\Users\Administrator\.m2\repository\io\netty\netty-all\4.1.42.Final\netty-all-4.1.42.Final.jar;C:\Users\Administrator\.m2\repository\io\netty\netty\3.9.9.Final\netty-3.9.9.Final.jar;C:\Users\Administrator\.m2\repository\com\clearspring\analytics\stream\2.7.0\stream-2.7.0.jar;C:\Users\Administrator\.m2\repository\io\dropwizard\metrics\metrics-core\3.1.5\metrics-core-3.1.5.jar;C:\Users\Administrator\.m2\repository\io\dropwizard\metrics\metrics-jvm\3.1.5\metrics-jvm-3.1.5.jar;C:\Users\Administrator\.m2\repository\io\dropwizard\metrics\metrics-json\3.1.5\metrics-json-3.1.5.jar;C:\Users\Administrator\.m2\repository\io\dropwizard\metrics\metrics-graphite\3.1.5\metrics-graphite-3.1.5.jar;C:\Users\Administrator\.m2\repository\com\fasterxml\jackson\core\jackson-databind\2.6.7.3\jackson-databind-2.6.7.3.jar;C:\Users\Administrator\.m2\repository\com\fasterxml\jackson\module\jackson-module-scala_2.12\2.6.7.1\jackson-module-scala_2.12-2.6.7.1.jar;C:\Users\Administrator\.m2\repository\org\scala-lang\scala-reflect\2.12.1\scala-reflect-2.12.1.jar;C:\Users\Administrator\.m2\repository\com\fasterxml\jackson\module\jackson-module-paranamer\2.7.9\jackson-module-paranamer-2.7.9.jar;C:\Users\Administrator\.m2\repository\org\apache\ivy\ivy\2.4.0\ivy-2.4.0.jar;C:\Users\Administrator\.m2\repository\oro\oro\2.0.8\oro-2.0.8.jar;C:\Users\Administrator\.m2\repository\net\razorvine\pyrolite\4.13\pyrolite-4.13.jar;C:\Users\Administrator\.m2\repository\net\sf\py4j\py4j\0.10.7\py4j-0.10.7.jar;C:\Users\Administrator\.m2\repository\org\apache\spark\spark-tags_2.12\2.4.5\spark-tags_2.12-2.4.5.jar;C:\Users\Administrator\.m2\repository\org\apache\commons\commons-crypto\1.0.0\commons-crypto-1.0.0.jar;C:\Users\Administrator\.m2\repository\org\spark-project\spark\unused\1.0.0\unused-1.0.0.jar;C:\Users\Administrator\.m2\repository\org\apache\spark\spark-sql_2.12\2.4.5\spark-sql_2.12-2.4.5.jar;C:\Users\Administrator\.m2\repository\com\univocity\univocity-parsers\2.7.3\univocity-parsers-2.7.3.jar;C:\Users\Administrator\.m2\repository\org\apache\spark\spark-sketch_2.12\2.4.5\spark-sketch_2.12-2.4.5.jar;C:\Users\Administrator\.m2\repository\org\apache\spark\spark-catalyst_2.12\2.4.5\spark-catalyst_2.12-2.4.5.jar;C:\Users\Administrator\.m2\repository\org\scala-lang\modules\scala-parser-combinators_2.12\1.1.0\scala-parser-combinators_2.12-1.1.0.jar;C:\Users\Administrator\.m2\repository\org\codehaus\janino\janino\3.0.9\janino-3.0.9.jar;C:\Users\Administrator\.m2\repository\org\codehaus\janino\commons-compiler\3.0.9\commons-compiler-3.0.9.jar;C:\Users\Administrator\.m2\repository\org\antlr\antlr4-runtime\4.7\antlr4-runtime-4.7.jar;C:\Users\Administrator\.m2\repository\org\apache\orc\orc-core\1.5.5\orc-core-1.5.5-nohive.jar;C:\Users\Administrator\.m2\repository\org\apache\orc\orc-shims\1.5.5\orc-shims-1.5.5.jar;C:\Users\Administrator\.m2\repository\com\google\protobuf\protobuf-java\2.5.0\protobuf-java-2.5.0.jar;C:\Users\Administrator\.m2\repository\commons-lang\commons-lang\2.6\commons-lang-2.6.jar;C:\Users\Administrator\.m2\repository\io\airlift\aircompressor\0.10\aircompressor-0.10.jar;C:\Users\Administrator\.m2\repository\org\apache\orc\orc-mapreduce\1.5.5\orc-mapreduce-1.5.5-nohive.jar;C:\Users\Administrator\.m2\repository\org\apache\parquet\parquet-column\1.10.1\parquet-column-1.10.1.jar;C:\Users\Administrator\.m2\repository\org\apache\parquet\parquet-common\1.10.1\parquet-common-1.10.1.jar;C:\Users\Administrator\.m2\repository\org\apache\parquet\parquet-encoding\1.10.1\parquet-encoding-1.10.1.jar;C:\Users\Administrator\.m2\repository\org\apache\parquet\parquet-hadoop\1.10.1\parquet-hadoop-1.10.1.jar;C:\Users\Administrator\.m2\repository\org\apache\parquet\parquet-format\2.4.0\parquet-format-2.4.0.jar;C:\Users\Administrator\.m2\repository\org\apache\parquet\parquet-jackson\1.10.1\parquet-jackson-1.10.1.jar;C:\Users\Administrator\.m2\repository\org\apache\arrow\arrow-vector\0.10.0\arrow-vector-0.10.0.jar;C:\Users\Administrator\.m2\repository\org\apache\arrow\arrow-format\0.10.0\arrow-format-0.10.0.jar;C:\Users\Administrator\.m2\repository\org\apache\arrow\arrow-memory\0.10.0\arrow-memory-0.10.0.jar;C:\Users\Administrator\.m2\repository\com\carrotsearch\hppc\0.7.2\hppc-0.7.2.jar;C:\Users\Administrator\.m2\repository\com\vlkan\flatbuffers\1.2.0-3f79e055\flatbuffers-1.2.0-3f79e055.jar;C:\Users\Administrator\.m2\repository\joda-time\joda-time\2.9.7\joda-time-2.9.7.jar;C:\Users\Administrator\.m2\repository\mysql\mysql-connector-java\5.1.44\mysql-connector-java-5.1.44.jar;C:\Users\Administrator\.m2\repository\org\apache\spark\spark-hive_2.12\2.4.5\spark-hive_2.12-2.4.5.jar;C:\Users\Administrator\.m2\repository\com\twitter\parquet-hadoop-bundle\1.6.0\parquet-hadoop-bundle-1.6.0.jar;C:\Users\Administrator\.m2\repository\org\spark-project\hive\hive-exec\1.2.1.spark2\hive-exec-1.2.1.spark2.jar;C:\Users\Administrator\.m2\repository\commons-io\commons-io\2.4\commons-io-2.4.jar;C:\Users\Administrator\.m2\repository\javolution\javolution\5.5.1\javolution-5.5.1.jar;C:\Users\Administrator\.m2\repository\log4j\apache-log4j-extras\1.2.17\apache-log4j-extras-1.2.17.jar;C:\Users\Administrator\.m2\repository\org\antlr\antlr-runtime\3.4\antlr-runtime-3.4.jar;C:\Users\Administrator\.m2\repository\org\antlr\stringtemplate\3.2.1\stringtemplate-3.2.1.jar;C:\Users\Administrator\.m2\repository\antlr\antlr\2.7.7\antlr-2.7.7.jar;C:\Users\Administrator\.m2\repository\org\antlr\ST4\4.0.4\ST4-4.0.4.jar;C:\Users\Administrator\.m2\repository\com\googlecode\javaewah\JavaEWAH\0.3.2\JavaEWAH-0.3.2.jar;C:\Users\Administrator\.m2\repository\org\iq80\snappy\snappy\0.2\snappy-0.2.jar;C:\Users\Administrator\.m2\repository\stax\stax-api\1.0.1\stax-api-1.0.1.jar;C:\Users\Administrator\.m2\repository\net\sf\opencsv\opencsv\2.3\opencsv-2.3.jar;C:\Users\Administrator\.m2\repository\org\spark-project\hive\hive-metastore\1.2.1.spark2\hive-metastore-1.2.1.spark2.jar;C:\Users\Administrator\.m2\repository\com\jolbox\bonecp\0.8.0.RELEASE\bonecp-0.8.0.RELEASE.jar;C:\Users\Administrator\.m2\repository\commons-cli\commons-cli\1.2\commons-cli-1.2.jar;C:\Users\Administrator\.m2\repository\commons-logging\commons-logging\1.1.3\commons-logging-1.1.3.jar;C:\Users\Administrator\.m2\repository\org\datanucleus\datanucleus-api-jdo\3.2.6\datanucleus-api-jdo-3.2.6.jar;C:\Users\Administrator\.m2\repository\org\datanucleus\datanucleus-rdbms\3.2.9\datanucleus-rdbms-3.2.9.jar;C:\Users\Administrator\.m2\repository\commons-pool\commons-pool\1.5.4\commons-pool-1.5.4.jar;C:\Users\Administrator\.m2\repository\commons-dbcp\commons-dbcp\1.4\commons-dbcp-1.4.jar;C:\Users\Administrator\.m2\repository\javax\jdo\jdo-api\3.0.1\jdo-api-3.0.1.jar;C:\Users\Administrator\.m2\repository\javax\transaction\jta\1.1\jta-1.1.jar;C:\Users\Administrator\.m2\repository\commons-httpclient\commons-httpclient\3.1\commons-httpclient-3.1.jar;C:\Users\Administrator\.m2\repository\org\apache\calcite\calcite-avatica\1.2.0-incubating\calcite-avatica-1.2.0-incubating.jar;C:\Users\Administrator\.m2\repository\org\apache\calcite\calcite-core\1.2.0-incubating\calcite-core-1.2.0-incubating.jar;C:\Users\Administrator\.m2\repository\org\apache\calcite\calcite-linq4j\1.2.0-incubating\calcite-linq4j-1.2.0-incubating.jar;C:\Users\Administrator\.m2\repository\net\hydromatic\eigenbase-properties\1.1.5\eigenbase-properties-1.1.5.jar;C:\Users\Administrator\.m2\repository\org\apache\httpcomponents\httpclient\4.5.6\httpclient-4.5.6.jar;C:\Users\Administrator\.m2\repository\org\apache\httpcomponents\httpcore\4.4.10\httpcore-4.4.10.jar;C:\Users\Administrator\.m2\repository\org\codehaus\jackson\jackson-mapper-asl\1.9.13\jackson-mapper-asl-1.9.13.jar;C:\Users\Administrator\.m2\repository\commons-codec\commons-codec\1.10\commons-codec-1.10.jar;C:\Users\Administrator\.m2\repository\org\jodd\jodd-core\3.5.2\jodd-core-3.5.2.jar;C:\Users\Administrator\.m2\repository\org\datanucleus\datanucleus-core\3.2.10\datanucleus-core-3.2.10.jar;C:\Users\Administrator\.m2\repository\org\apache\thrift\libthrift\0.9.3\libthrift-0.9.3.jar;C:\Users\Administrator\.m2\repository\org\apache\thrift\libfb303\0.9.3\libfb303-0.9.3.jar;C:\Users\Administrator\.m2\repository\org\apache\derby\derby\10.12.1.1\derby-10.12.1.1.jar cn.yanqi.sparksql.TransformationDemo Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties +---------+------+------+ | Job|minsal|maxsal| +---------+------+------+ | ANALYST| 3000| 3000| | MANAGER| 2450| 2975| |PRESIDENT| 5000| 5000| +---------+------+------+ +---+------+---+ |sno| sname|age| +---+------+---+ | 1| Alice| 18| | 2| Andy| 19| | 3| Bob| 17| | 4|Justin| 21| | 5| Cindy| 20| +---+------+---+ Process finished with exit code 0
===============================END===============================
Walter Savage Landor:strove with none,for none was worth my strife.Nature I loved and, next to Nature, Art:I warm'd both hands before the fire of life.It sinks, and I am ready to depart ——W.S.Landor
来自为知笔记(Wiz)
这篇关于|NO.Z.00044|——————————|BigDataEnd|——|Hadoop&Spark.V05|------------------------------------------|Spa的文章就介绍到这儿,希望我们推荐的文章对大家有所帮助,也希望大家多多支持为之网!
- 2023-05-13Windows下hadoop环境搭建之NameNode启动报错
- 2023-04-14hadoop伪分布式集群的安装(不是单机版)
- 2022-12-05Hadoop生态系统—数据仓库Hive的安装
- 2022-11-02Win10搭建Hadoop环境
- 2022-10-19Hadoop生态系统(数据仓库Hive的安装)
- 2022-10-03Hadoop、storm和Spark Streaming简单介绍
- 2022-10-03胖虎的Hadoop笔记——Hadoop的伪分布式部署
- 2022-09-11Ubuntu搭建全分布式Hadoop
- 2022-09-11Ubuntu搭建全分布式Hadoop
- 2022-09-09Ubuntu下安装伪分布式HADOOP遇到的一些问题