在用真实数据测试之前写过的一段代码逻辑时,发现了一个很神奇的错误,sql看着没错,但是判断index_value>0的时候明明有4条数据大于零,但是查询跑出的结果却是2条,由于数据量太大不好验证,所以写了个demo来验证。
public static void main(String[] args) {
Date startTime = new Date(System.currentTimeMillis());
SparkUtil sparkUtil = SparkUtil.getInstance();
JavaSparkContext sparkContext = sparkUtil.getSparkContext(null, "f251");
HiveContext hiveContext = sparkUtil.getHiveContext(sparkContext.sc());
List<Index> indexList = new ArrayList();
indexList.add(new Index("F005", "100"));
indexList.add(new Index("F005", "0.1"));
indexList.add(new Index("F005", "5"));
indexList.add(new Index("F005", "0.6"));
JavaRDD<Index> statMonth0100Rdd = sparkContext.parallelize(indexList);
Dataset<Row> dataSet = hiveContext.createDataFrame(statMonth0100Rdd, Index.class);
dataSet.show();
dataSet.toDF().registerTempTable("F005TB");
String sql = "select * from F005TB where indexValue>0";
Dataset<Row> groupPowerDS = sparkUtil.executeSqlHive(hiveContext, sql);
groupPowerDS.show();
}
查询结果为:
+---------+----------+
|indexCode|indexValue|
+---------+----------+
| F005| 100|
| F005| 5|
+---------+----------+
考虑到可能是类型转换出现了问题,因此修改代码如下:
public static void main(String[] args) {
Date startTime = new Date(System.currentTimeMillis());
SparkUtil sparkUtil = SparkUtil.getInstance();
JavaSparkContext sparkContext = sparkUtil.getSparkContext(null, "f251");
HiveContext hiveContext = sparkUtil.getHiveContext(sparkContext.sc());
List<Index> indexList = new ArrayList();
indexList.add(new Index("F005", "100"));
indexList.add(new Index("F005", "0.1"));
indexList.add(new Index("F005", "5"));
indexList.add(new Index("F005", "0.6"));
JavaRDD<Index> statMonth0100Rdd = sparkContext.parallelize(indexList);
Dataset<Row> dataSet = hiveContext.createDataFrame(statMonth0100Rdd, Index.class);
dataSet.show();
dataSet.toDF().registerTempTable("F005TB");
String sql = "select * from F005TB where cast(indexValueAS float) >0";
Dataset<Row> groupPowerDS = sparkUtil.executeSqlHive(hiveContext, sql);
groupPowerDS.show();
}
运行结果为:
+---------+----------+
|indexCode|indexValue|
+---------+----------+
| F005| 100|
| F005| 0.1|
| F005| 5|
| F005| 0.6|
+---------+----------+
修改代码如下:
public static void main(String[] args) {
Date startTime = new Date(System.currentTimeMillis());
SparkUtil sparkUtil = SparkUtil.getInstance();
JavaSparkContext sparkContext = sparkUtil.getSparkContext(null, "f251");
HiveContext hiveContext = sparkUtil.getHiveContext(sparkContext.sc());
List<Index> indexList = new ArrayList();
indexList.add(new Index("F005", "100"));
indexList.add(new Index("F005", "0.1"));
indexList.add(new Index("F005", "5"));
indexList.add(new Index("F005", "0.6"));
JavaRDD<Index> statMonth0100Rdd = sparkContext.parallelize(indexList);
Dataset<Row> dataSet = hiveContext.createDataFrame(statMonth0100Rdd, Index.class);
dataSet.show();
dataSet.toDF().registerTempTable("F005TB");
String sql = "select * from F005TB where indexValue>0.0";
Dataset<Row> groupPowerDS = sparkUtil.executeSqlHive(hiveContext, sql);
groupPowerDS.show();
}
运行结果为:
+---------+----------+
|indexCode|indexValue|
+---------+----------+
| F005| 100|
| F005| 0.1|
| F005| 5|
| F005| 0.6|
+---------+----------+
在实际表中查询时不会出现这样的问题
具体的原因和机制有待深入的研究
版权声明:本文为chishuazhi5205原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。