1 需求
抓取整个杭州市的百度/腾讯街景地图及其时光机功能(实时图片和历史图片),进行图像分析。
2 分析
百度地图街景模式下,点击向前可发现,街景图片是异步加载的,我们可以打开百度地图的街景模式,f12打开开发者模式,清空所有响应,并点击向前,可以看到产生了很多的图片请求
2.1 街景request简要分析
本文以杭州市余杭区文一西路海创园附近处(由西向东)的街景为例仔细分析这些请求的作用:
首先,
第一条请求:
https://mapsv0.bdimg.com/?qt=qsdata&x=13361258.73664768&y=3518572.1440338427&time=201709&mode=day&px=1336124156&py=351856780&pz=14.89&roadid=eeaa41-bd79-2cf9-3491-2ca35d&type=street&action=0&pc=1&auth=8Nv8G8DBLbO2F1vLvHNAgXOTz4UFPbxHuxHLxBLVHTRt1qo6DF%3D%3DCcvY1SGpuztAFwWv1GgvPUDZYOYIZuVt1cv3uHxtOmm0mEb1PWv3GuxNVt%3DErpTgZp1GHJMP6V8%40aDcEWe1GD8zv7u%40ZPuVteuEthjzgjyBKOBEEUWWOxtx77INHu%3D%3D8x35&udt=20190619&fn=jsonp.p30899897
此请求也可简化为(为简便,以下请求均不带auth参数,不影响结果获取):
https://mapsv0.bdimg.com/?qt=qsdata&x=13361258.73664768&y=3518572.1440338427&time=201709&mode=day
作用:根据地图上点击的位置,生成百度地图坐标x和y值,再得到服务器的json响应,(如下id即为该位置的百度街景ID,也就是后续要用到的panoID)
第二条请求:
https://mapsv0.bdimg.com/?qt=sdata&sid=09025200121709031616142855K&pc=1
作用:该请求包含了panoId参数,返回该位置的街景相关信息(附近panoID,以及该位置的历史图片,拍摄时期等等),下文会详细分析。json响应如下图所示:
第三条请求
:
https://mapsv0.bdimg.com/?qt=guide&sid=09025200121709031616142855K&fn=jsonp29109114
作用: 获取该位置附近的公司或学校。json响应如下图所示:
第四条请求
:
https://mapsv0.bdimg.com/?qt=pdata&sid=09025200121709031616142855K&pos=0_0&z=1&udt=20190619
作用: 该请求返回该位置街景的全景图片。如下所示:
第五条请求
:
https://mapsv0.bdimg.com/?qt=pr3d&fovy=35&quality=80&panoid=09025200121709031616142855K&heading=72.801&pitch=0&width=198&height=108
作用:该请求返回一个当前视角的小尺寸图片,如图所示:
其中的参数解释,网上找到了一个图片,有些参数未亲自验证。
由于小尺寸图片放大之后比较模糊,所以不对其进行获取。
第六条请求(这是一组大图请求)
:
https://mapsv0.bdimg.com/?qt=pdata&sid=09025200121709031616142855K&pos=1_4&z=4
https://mapsv0.bdimg.com/?qt=pdata&sid=09025200121709031616142855K&pos=2_4&z=4
https://mapsv0.bdimg.com/?qt=pdata&sid=09025200121709031616142855K&pos=1_5&z=4
https://mapsv0.bdimg.com/?qt=pdata&sid=09025200121709031616142855K&pos=2_5&z=4
https://mapsv0.bdimg.com/?qt=pdata&sid=09025200121709031616142855K&pos=1_6&z=4
https://mapsv0.bdimg.com/?qt=pdata&sid=09025200121709031616142855K&pos=2_6&z=4
https://mapsv0.bdimg.com/?qt=pdata&sid=09025200121709031616142855K&pos=1_7&z=4
https://mapsv0.bdimg.com/?qt=pdata&sid=09025200121709031616142855K&pos=2_7&z=4
作用:获取从左至右的街景大图,其中参数pos=1_*表示高角度(就是我们需要获取的),从左至右一共是4张(高角度),这样就拼成了我们在百度街景中所看到的一整张大图;pos=2_*表示低角度。
因此我们要得到一座城市、一条街道的街景图片,首先要获取该城市、该街道所有位置的街景ID,然后通过模拟上述图片请求即可,但是如何获取杭州市所有街道的街景ID呢?
网上查了些资料,有人提出的想法是:
- 1、暴力循环panoID,错误就忽略,正确就返回结果。
- 2、在一条道路寻找一个种子panoID,然后爬取整条道路的所有图片。
- 3、根据百度地图的坐标,设置一个区域,遍历整个区域的所有坐标,正确就返回panoID,错误就不处理。
粗略一看感觉很有道理,但仔细一分析就知扯淡! 首先,第一条,暴利循环相当低效,由于百度没有提供整条街道的经纬度接口,也没有提供获得整条街道的所有panoID值的接口(至少我没找到),所以如果根据网友的意思,在给定起始panoID情况下循环试错获取下一个panoID需要发送上万次、数十万次请求,很可能没获取到下一个ID就已经被百度封IP了,第三条,我确实想设置一个区域,但如何遍历该区域所有坐标呢?如何保证街景图片是同一条街道且连续的呢?有大牛想到解决办法的话烦请告知!再来看第二条,似乎可取,重点是如何获取下一个panoID。答案还是在上述”请求二”中。
2.2 请求2详细分析
我们可以看到,Roads[0].Panos标签下,包含有多个panoID,我们可以在街景模式下点击向前10米,浏览器又加载了下个panoID(09025200121709031616155125K)的图片,这里有一个规律,向前20米左右会加载当前panoID的下下个panoID图片,以此类推。
当再点击向前,加载到panoID为09025200121709031616211215K的图片时,我们再看它的”请求二”的json响应如下图所示:
如图所示,Roads[0]为当前位置节点信息(IsCurrent: 1),Roads是一个数组,通常会有多个元素作为候选节点,这里仅有一个,即Roads[1],Roads[1].ID和Links[0].RID对应,继续点击向前,发现加载的正是我们从图中所见的panoID:09025200121709031616224035K。我们再观察09025200121709031616224035K这个位置的”请求二”响应,看到又有当前位置下的前向panoID集合了。我们再放心大胆地尝试几次点击,发现是有这个规律。紧锁的眉头渐渐舒缓,是时候喝杯咖啡了。
coffee归来,思路再捋一捋。由上,如何根据给定的起始panoID爬取所在街道的所有街景图片?
业务逻辑流程梳理大致如下:
- 选择合适的起始panoID(可将其存入配置文件或数据库);
- 根据panoID拼接上述”请求二”,并发送请求获取响应;
- json解析该响应,获取当前位置附近的前向panoID集合;
- 遍历该集合,拼接每个位置的图片下载链接(4张图片)并下载;
- 当遍历到集合最后一个元素时,还需解析Roads数组中的非当前元素,并将其对应到Links数组中的panoID,我们姑且称之为锚节点;
- 用该锚节点的panoID代入STEP 2中,这样整个逻辑就形成了递归函数。
2.3 图片存储
在这里可以根据不同的业务需求采用不同的数据库,关系型或非关系型均可。某种程度上非关系型列式存储可能更好,因为当我们爬取一个城市所有街道的街景时,有的位置有历史数据,有的则没有,因此大规模数据集下可利用列式存储的弹性伸缩实现数据的高效存取。但考虑到后续图像分析时,可能需要建立空间连续(离散)位置的图像模型,这又更适合采用关系型数据库来存取。以mysql与Hbase为例,再回顾一下典型的关系型数据库和非关系型数据库之间的区别:
2.3.1 HBase与MySQL的区别
属性 | HBase | MySQL |
---|---|---|
存储 | 按列存储,可灵活增加列,列为空时不占存储空间 | 按行存储 |
伸缩扩展性 | 支持 | 需要第三方中间层支持 |
高并发读写 | 支持 | 不支持 |
条件查询 | 只支持按rowkey查询 | 支持 |
数据类型 | 字符串类型 | 多种类型 |
数据操作 | 只有查询、插入、删除、清空等 | 还包括各种连接操作等 |
数据更新 | 实际是插入新的数据,多版本 | 替换修改 |
2.3.2 二进制流的存储
下载几张图片看下大小,然后决定选用何种字段类型,经比较分析,得如下要点:
- 图片转为二进制字节流
-
表中图片字段设为
blob型
MySQL 存储blob型数据可参阅
此处
3 代码
对应上述业务逻辑,编写不同步骤的代码
3.1 建表及读表记录
建表,并将起始panoID存入数据库。建表语句如下:
CREATE TABLE `test`.`baidu_pano` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`pano_id` varchar(32) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL COMMENT '街景ID',
`name` varchar(32) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL COMMENT '街道名称',
`lati` double(32, 0) NULL DEFAULT NULL COMMENT '纬度',
`lonti` double(32, 0) NULL DEFAULT NULL COMMENT '经度',
`direction` bit(1) NULL DEFAULT NULL COMMENT '0:由西向东;1:由东向西',
PRIMARY KEY USING BTREE (`id`)
) ENGINE = InnoDB AUTO_INCREMENT = 8 CHARACTER SET = utf8 COLLATE = utf8_general_ci COMMENT = 'InnoDB free: 30720 kB' ROW_FORMAT = Compact;
插入panoID如图所示:
在coding之前先在builder.gradle文件中一并引入所有依赖:
ext {
commonsDbutilsVersion = '1.6'
druidVersion = '1.0.18'
mysqlConnectorVersion = '5.1.37'
}
dependencies {
compile "com.cetiti.ddc:ddc-core:0.1.11-alpha1"
compile "commons-dbutils:commons-dbutils:${commonsDbutilsVersion}"
compile 'org.apache.logging.log4j:log4j-core:2.8.2'
compile "com.alibaba:druid:${druidVersion}"
compile "mysql:mysql-connector-java:${mysqlConnectorVersion}"
compile 'org.postgresql:postgresql:42.2.5'
testCompile 'junit:junit:4.9'
compile 'org.apache.commons:commons-pool2:2.4.2'
compile 'redis.clients:jedis:2.9.0'
}
// 除此之外,还要引入snakeyaml-1.18.jar、fasjson、okhttp等jar包,对应版本可以自行搜索,本项目因依赖了公司自研的jar包,所以未显式说明
我们需编写一个从MySQL读取记录并封装结果集的工具类MySQLHelper ,源码如下:
public class MySQLHelper {
private static final Logger logger = Logger.getLogger(MySQLHelper.class);
private static QueryRunner runner;
private static ResultSetHandler<HashMap<String,Object>> h = new ResultSetHandler<HashMap<String,Object>>() {
@Override
public HashMap<String,Object> handle(ResultSet rs) throws SQLException {
if (!rs.next()) {
return null;
}
ResultSetMetaData meta = rs.getMetaData();
int cols = meta.getColumnCount();
HashMap<String,Object> result = new HashMap<>(16);
for (int i = 0; i < cols; i++) {
result.put(meta.getColumnLabel(i + 1),rs.getObject(i + 1));
}
return result;
}
};
public MySQLHelper(){
runner = MySQLPool.getInstance().getRunner();
}
@SuppressWarnings("unchecked")
public List<Pano> getAllPanoFromDB(){
try {
String qSql = "select pano_id as panoId, name from baidu_pano limit 10";
@SuppressWarnings("rawtypes")
BeanListHandler blh = new BeanListHandler(Pano.class);
return (ArrayList<Pano>) runner.query(qSql,blh);
} catch (SQLException e) {
logger.error("getAllIpFromDB", e);
return null;
}
}
}
其中连接池MySQLPool代码如下:
public class MySQLPool {
private static MySQLPool instance = null;
private static final Logger logger = Logger.getLogger(MySQLPool.class);
private DruidDataSource dds;
private QueryRunner runner;
private Properties properties;
public QueryRunner getRunner() {
return this.runner;
}
private MySQLPool() {
ConfigParser parser = ConfigParser.getInstance();
String dbAlias = "mysql-data";
Map<String, Object> dbConfig = parser.getModuleConfig("database");
Map<String, Object> mysqlConfig = (Map)parser.assertKey(dbConfig, dbAlias, "database");
Properties properties = new Properties();
String url = (String)parser.assertKey(mysqlConfig, "url", "database." + dbAlias);
String username = (String)parser.assertKey(mysqlConfig, "username", "database." + dbAlias);
String password = (String)parser.assertKey(mysqlConfig, "password", "database." + dbAlias);
properties.setProperty("url", url);
properties.setProperty("username", username);
properties.setProperty("password", password);
properties.setProperty("maxActive", "20");
this.properties = properties;
try {
this.dds = (DruidDataSource)DruidDataSourceFactory.createDataSource(properties);
} catch (Exception var10) {
logger.error("Failed to connect data MySQL db,Exception:{}", var10);
}
this.runner = new QueryRunner(this.dds);
}
public static MySQLPool getInstance() {
if (instance == null) {
Class var0 = MySQLPool.class;
synchronized(MySQLPool.class) {
if (instance == null) {
instance = new MySQLPool();
}
}
}
return instance;
}
}
ConfigParser类为自定义的配置文件读取解析类,源码如下:
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.util.Map;
import org.yaml.snakeyaml.Yaml;
public class ConfigParser {
private static final Logger logger = Logger.getLogger(ConfigParser.class);
private static ConfigParser instance = new ConfigParser();
private static final String CONFIG_FILENAME = "config.yml";
private Yaml yaml = null;
private Object config;
private ConfigParser() {
if (this.yaml == null) {
this.yaml = new Yaml();
}
File f = ResourceUtils.loadResouces("config.yml");
try {
this.config = this.yaml.load(new FileInputStream(f));
logger.info("file {} is loaded", f.getAbsoluteFile());
} catch (FileNotFoundException var3) {
var3.printStackTrace();
}
}
public static ConfigParser getInstance() {
return instance;
}
public Object getConfig() {
return this.config;
}
public Map<String, Object> getModuleConfig(String name) {
return this.getModuleConfig(name, this.config);
}
public Map<String, Object> getModuleConfig(String name, Object parent) {
Map<String, Object> rtn = (Map)((Map)parent).get(name);
return rtn;
}
public Object assertKey(Map<String, Object> config, String key, String parent) {
Object value = config.get(key);
if (value == null) {
logger.error("{}.{} is a mandatory configuration", new Object[]{parent, key});
System.exit(0);
}
return value;
}
public Object getValue(Map<String, Object> config, String key, Object def, String parent) {
Object value = config.get(key);
if (value == null) {
logger.warn("{}.{} is't configured, default value {} is used", new Object[]{parent, key, def});
config.put(key, def);
return def;
} else {
return value;
}
}
public void dumpConfig() {
System.out.println(this.yaml.dump(this.config));
}
}
3.2 配置文件
配置文件config.yml中配置如下:
apps:
## 基本属性
spider-baidupano:
common:
group: ipproxy-xundaili-zhg
cron: "0 0 0 */1 * ?"
firstpage: 1
totalpages: 1
distribute: false
fixed: true
order: desc
## 数据源
source:
baseurl: https://mapsv0.bdimg.com/?qt=pdata&sid=09024600011606211814253666L&pos=1_4&z=4
listpageregex: "https://mapsv0\\.bdimg\\.com/\\?qt\\=sdata"
storage:
## dbType: MySQL HBase Hive MongoDB Kafka PostgreSQL
dbtype: MySQL
dbalias: mysql-data
## 图片存储位置
piclocation: \baidupano
filter:
searchfilter: true
contentfilter: false
## 反爬虫
antirobot:
ipproxy: false
listipproxy: false
sleeptime: 900000
analysis:
sentiment: false
distribute:
scheduler: com.demo.ddc.scheduler.MemberScheduler
# dbtype: redis
# dbalias: redis
database:
mysql-data:
url: "jdbc:mysql://localhost:3306/test?useUnicode=true&characterEncoding=utf8"
username: root
password: '123456'
3.2 图片下载及存储逻辑
首先编写请求发送工具类:
public class OkHttpUtils {
private static volatile OkHttpClient okHttpClient;
private OkHttpUtils(){
}
public static OkHttpClient getInstance(){
if (null==okHttpClient){
synchronized (OkHttpUtils.class){
if (okHttpClient==null){
okHttpClient = new OkHttpClient();
return okHttpClient;
}
}
}
return okHttpClient;
}
}
图片下载及文件路径处理工具类:
public class PicLoadUtils {
private static SpiderConfig spiderConfig;
private final static String WINDOWS_DISK_SYMBOL = ":";
private final static String WINDOWS_PATH_SYMBOL = "\\";
public PicLoadUtils(){
String spiderId = "spider-baidupano";
spiderConfig = new SpiderConfig(spiderId);
}
private String getFileLocation(String storeDirName){
String separator = "/";
ConfigParser parser = ConfigParser.getInstance();
String spiderId = "spider-googlemap";
SpiderConfig spiderConfig = new SpiderConfig(spiderId);
Map<String,Object> storageConfig = (Map<String, Object>) parser.assertKey(spiderConfig.getSpiderConfig(),"storage", spiderConfig.getConfigPath());
String fileLocation = (String) parser.getValue(storageConfig,"piclocation",null,spiderConfig.getConfigPath()+".storage");
String pathSeparator = getSeparator();
String location;
if(fileLocation!=null){
//先区分系统环境,再判断是否为绝对路径
if (separator.equals(pathSeparator)){
//linux
if(fileLocation.startsWith(separator)){
location = fileLocation + pathSeparator + "data";
}else {
location = System.getProperty("user.dir") + pathSeparator + fileLocation;
}
location = location.replace("//", pathSeparator);
return location;
}else {
//windows
if (fileLocation.contains(WINDOWS_DISK_SYMBOL)){
//绝对路径
location = fileLocation + pathSeparator + "data";
}else {
//相对路径
location = System.getProperty("user.dir") + pathSeparator + fileLocation;
}
location = location.replace("\\\\",pathSeparator);
}
}else{
//默认地址
location = System.getProperty("user.dir") + pathSeparator + storeDirName;
}
return location;
}
public String dateToPath(long timestamp) {
String pathSeparator = getSeparator();
Calendar calendar = Calendar.getInstance();
calendar.setTimeInMillis(timestamp*1000);
String year = String.format("%04d",calendar.get(Calendar.YEAR));
String month = String.format("%02d",calendar.get(Calendar.MONTH)+1);
String date = String.format("%02d",calendar.get(Calendar.DATE));
return year + pathSeparator + month + pathSeparator + date;
}
private String getSeparator(){
String pathSeparator = File.separator;
if(!WINDOWS_PATH_SYMBOL.equals(File.separator)){
pathSeparator = "/";
}
return pathSeparator;
}
public void mkDir(File file){
String directory = file.getParent();
File myDirectory = new File(directory);
if (!myDirectory.exists()) {
myDirectory.mkdirs();
}
}
public String downloadPic(String url, String panoId){
okhttp3.Request request = new okhttp3.Request.Builder()
.url(url)
.build();
Response response = null;
InputStream inputStream = null;
FileOutputStream out = null;
String localLocation = null;
String relativePath = null;
try {
response = OkHttpUtils.getInstance().newCall(request).execute();
//将响应数据转化为输入流数据
inputStream = response.body().byteStream();
byte[] buffer = new byte[2048];
localLocation = this.getFileLocation("baidupano");
Date nowTime = new Date(System.currentTimeMillis());
relativePath = this.dateToPath(nowTime.getTime()/1000) + File.separator + panoId + File.separator + nowTime.getTime()+".jpg";
File myPath = new File(localLocation + File.separator + relativePath);
this.mkDir(myPath);
out = new FileOutputStream(myPath);
int len;
while ((len = inputStream.read(buffer)) != -1){
out.write(buffer,0,len);
}
//刷新文件流
out.flush();
} catch (IOException e) {
e.printStackTrace();
}finally {
if (inputStream!=null){
try {
inputStream.close();
}catch (IOException e){
e.printStackTrace();
}
}
if (null!=out){
try {
out.close();
}catch (IOException e){
e.printStackTrace();
}
}
if (null!=response){
response.body().close();
}
}
return localLocation + File.separator + relativePath;
}
public String downloadPic(String url, String curPanoId, String hisPanoId){
okhttp3.Request request = new okhttp3.Request.Builder()
.url(url)
.build();
Response response = null;
InputStream inputStream = null;
FileOutputStream out = null;
String localLocation = null;
String relativePath = null;
try {
response = OkHttpUtils.getInstance().newCall(request).execute();
//将响应数据转化为输入流数据
inputStream = response.body().byteStream();
byte[] buffer = new byte[2048];
localLocation = this.getFileLocation("baidupano");
Date nowTime = new Date(System.currentTimeMillis());
relativePath = this.dateToPath(nowTime.getTime()/1000) + File.separator + curPanoId + File.separator + hisPanoId + File.separator + nowTime.getTime()+".jpg";
File myPath = new File(localLocation + File.separator + relativePath);
this.mkDir(myPath);
out = new FileOutputStream(myPath);
int len;
while ((len = inputStream.read(buffer)) != -1){
out.write(buffer,0,len);
}
//刷新文件流
out.flush();
} catch (IOException e) {
e.printStackTrace();
}finally {
if (inputStream!=null){
try {
inputStream.close();
}catch (IOException e){
e.printStackTrace();
}
}
if (null!=out){
try {
out.close();
}catch (IOException e){
e.printStackTrace();
}
}
if (null!=response){
response.body().close();
}
}
return localLocation + File.separator + relativePath;
}
}
根据上述分析的流程编写核心业务逻辑代码,由于递归的使用,为防止栈溢出,设定一个层级level:
public void getBaiduPanoPics(String curPanoId, int level){
// 递归层级控制
if (level == 0){
logger.info("此街道爬取完毕!");
return;
}
//发送json请求爬取街景图片存库
JSONObject jsonObject = sendPanoJsonRequest(curPanoId);
processStorePanoByID(curPanoId,jsonObject,tableName);
//获取这一段路接下来的panoId
List<String> forwardNodes = getForwardNodes(curPanoId, jsonObject);
if (forwardNodes!=null&&!forwardNodes.isEmpty()){
//遍历爬取
int forwardNodeSize = forwardNodes.size();
JSONObject tempJsonObject;
String id;
for(int i = 0; i < forwardNodeSize-1; i++){
id = forwardNodes.get(i);
tempJsonObject = sendPanoJsonRequest(id);
processStorePanoByID(id,tempJsonObject,tableName);
}
//单独处理集合中最后一个元素,因为需在此得到锚节点进而递归
id = forwardNodes.get(forwardNodeSize-1);
tempJsonObject = sendPanoJsonRequest(id);
processStorePanoByID(id, tempJsonObject, tableName);
String anchorNode = getEasyAnchorNode(tempJsonObject);
if (StringUtils.isBlank(anchorNode)){
return;
}
//递归调用
getBaiduPanoPics(anchorNode,level-1);
}else {
//若返回为空,则切换为获取links
String anchorNode = getEasyAnchorNode(jsonObject);
if (StringUtils.isBlank(anchorNode)){
return;
}
//递归调用
getBaiduPanoPics(anchorNode,level-1);
}
}
上述方法中调用的其他方法完整代码如下:
public class BaiduPanoPics {
private final Logger logger = Logger.getLogger(BaiduPanoPics.class);
private static final String REQUEST_PREX_FORWARD_PANOS = "https://mapsv0.bdimg.com/?qt=sdata&sid=";
private static final String REQUEST_PREX_PICS = "https://mapsv1.bdimg.com/?qt=pdata&sid=";
private static final int PIC_NUM = 8;
private String tableName;
public BaiduPanoPics(String tableName){
this.tableName = tableName;
}
private List<String> getPicsUrl(String panoId){
// 不添加方向判断逻辑
final String param = "&z=4";
String picUrl;
List<String> picUrlList = new ArrayList<>(4);
for (int i = 4; i < PIC_NUM; i++){
picUrl = REQUEST_PREX_PICS + panoId + "&pos=1_"+ i + param;
picUrlList.add(picUrl);
}
return picUrlList;
}
public void getBaiduPanoPics(String curPanoId, int level){
// 递归层级控制
if (level == 0){
logger.info("此街道爬取完毕!");
return;
}
//发送json请求爬取街景图片存库
JSONObject jsonObject = sendPanoJsonRequest(curPanoId);
processStorePanoByID(curPanoId,jsonObject,tableName);
//获取这一段路接下来的panoId
List<String> forwardNodes = getForwardNodes(curPanoId, jsonObject);
if (forwardNodes!=null&&!forwardNodes.isEmpty()){
//遍历爬取
int forwardNodeSize = forwardNodes.size();
JSONObject tempJsonObject;
String id;
for(int i = 0; i < forwardNodeSize-1; i++){
id = forwardNodes.get(i);
tempJsonObject = sendPanoJsonRequest(id);
processStorePanoByID(id,tempJsonObject,tableName);
}
//单独处理集合中最后一个元素,因为需在此得到锚节点进而递归
id = forwardNodes.get(forwardNodeSize-1);
tempJsonObject = sendPanoJsonRequest(id);
processStorePanoByID(id, tempJsonObject, tableName);
String anchorNode = getEasyAnchorNode(tempJsonObject);
if (StringUtils.isBlank(anchorNode)){
return;
}
//递归调用
getBaiduPanoPics(anchorNode,level-1);
}else {
//若返回为空,则切换为获取links
String anchorNode = getEasyAnchorNode(jsonObject);
if (StringUtils.isBlank(anchorNode)){
return;
}
//递归调用
getBaiduPanoPics(anchorNode,level-1);
}
}
/**
* @param curPanoId id
* @param jsonObject json
* @return list
* 根据当前panoID、json响应,提取出前向道路中的panoId
* 若遇到分叉(也可能是人行道),返回null
*/
private List<String> getForwardNodes(String curPanoId, JSONObject jsonObject){
JSONArray roadJsonArray = jsonObject.getJSONArray("Roads");
JSONArray panoJsonArray = roadJsonArray.getJSONObject(0).getJSONArray("Panos");
//总数组大小
int nearByPanoIdSize = panoJsonArray.size();
List<String> panoJsonIdList = new ArrayList<>();
String panoJsonId;
if (nearByPanoIdSize>1){
for (int i=0; i < nearByPanoIdSize; i++){
panoJsonId = panoJsonArray.getJSONObject(i).getString("PID");
//将返回的最近路段panoId存入list
panoJsonIdList.add(panoJsonId);
}
int currentPanoIdIndex = panoJsonIdList.indexOf(curPanoId);
if (currentPanoIdIndex >= 0){
return panoJsonIdList.subList(currentPanoIdIndex+1,nearByPanoIdSize);
}else{
System.out.println("当前节点不在附近节点集中!");
// 返回什么?
}
}else if(nearByPanoIdSize==1){
//前方遇到分叉路
panoJsonId = panoJsonArray.getJSONObject(0).getString("PID");
if (curPanoId.equals(panoJsonId)){
return null;
}else {
panoJsonIdList.add(panoJsonId);
}
}else {
logger.info("页面异常!");
}
return panoJsonIdList;
}
/**
* @param jsonObject json
* @return list
* 根据指定panoId,获取下一段路各方向的起始panoId
* 若路有分叉,返回多个,否则返回一个
*/
private List<String> getAnchorNode(JSONObject jsonObject){
//暂不判断方向
//获取link字段中的数组
JSONArray anchorIdArray = jsonObject.getJSONArray("Links");
List<String> anchorLinks = new ArrayList<>();
for (Object linkJson:anchorIdArray){
JSONObject anchor = (JSONObject)linkJson;
String anchorId = anchor.getString("PID");
anchorLinks.add(anchorId);
}
return anchorLinks;
}
/**
* 易出现环链
* @param jsonObject
* @return string
*/
private String getEasyAnchorNode(JSONObject jsonObject){
JSONArray anchorIdArray = jsonObject.getJSONArray("Links");
int size = anchorIdArray.size();
if (size==0){
return null;
}else if (size ==1){
return ((JSONObject)anchorIdArray.get(0)).getString("PID");
}else {
int index = new Random().nextInt(anchorIdArray.size());
return ((JSONObject)anchorIdArray.get(index)).getString("PID");
}
}
/**
* 也会产生环链
* @param jsonObject
* @return panoId
*/
private String getAnchorNodeByDir(JSONObject jsonObject){
JSONArray anchorIdArray = jsonObject.getJSONArray("Links");
if (anchorIdArray.size()==1){
return ((JSONObject)anchorIdArray.get(0)).getString("PID");
}else{
//选择dir值最小的那个,很可能是前向节点
Map<Integer,String> map = new HashMap<>(4);
int dirTemp = 400;
for (Object linkJson:anchorIdArray){
JSONObject anchor = (JSONObject)linkJson;
Integer anchorDir = anchor.getInteger("DIR");
String anchorPid = anchor.getString("PID");
map.put(anchorDir,anchorPid);
if (dirTemp>anchorDir){
dirTemp = anchorDir;
}
}
return map.get(dirTemp);
}
}
/**
* 获取锚节点,尽量沿着同一条路走,但有时路况不能保证一定
* 易出现环链
* @param jsonObject
* @return string
*/
private String getAdvancedAnchorNode(JSONObject jsonObject){
JSONArray anchorIdArray = jsonObject.getJSONArray("Links");
if (anchorIdArray.isEmpty()){
return null;
}
//若只有一个,直接返回
if(anchorIdArray.size()==1){
return ((JSONObject)anchorIdArray.get(0)).getString("PID");
}
//若有多个,为后续提取方便,用map存储诸锚节点
Map<String,String> anchorMap = new HashMap<>(8);
for (Object linkJson:anchorIdArray){
JSONObject anchor = (JSONObject)linkJson;
String anchorPid = anchor.getString("PID");
String anchorRid = anchor.getString("RID");
anchorMap.put(anchorRid, anchorPid);
}
//用list存储RoadBean信息
JSONArray roadJsonArray = jsonObject.getJSONArray("Roads");
List<RoadBean> roadBeanList = new ArrayList<>(8);
for (Object roadJson:roadJsonArray){
JSONObject road = (JSONObject)roadJson;
String roadId = road.getString("ID");
boolean isCurrent = road.getBoolean("IsCurrent");
String roadName = road.getString("Name");
RoadBean roadBean = new RoadBean(roadId,isCurrent,roadName);
roadBeanList.add(roadBean);
}
if (roadBeanList.size()>1&&roadBeanList.get(0).isCurrent()){
//当前位置节点所属街道名
String currentStreetName = roadBeanList.get(0).getRoadName();
for (int i = 1; i < roadBeanList.size(); i++){
RoadBean roadBean = roadBeanList.get(i);
// 尽量沿着同一条路前行
if (currentStreetName.equals(roadBean.getRoadName())){
return anchorMap.get(roadBean.getRid());
}
}
//若遍历完无同名道路, 选择links中的第一个
return ((JSONObject)anchorIdArray.get(0)).getString("PID");
}else {
logger.info("道路异常或解析异常");
}
return null;
}
/**
*
* @param panoId id
* @return JSONObject
* 发送json请求,获取服务器json响应
*/
private JSONObject sendPanoJsonRequest(String panoId){
String suffixParam = "&pc=1";
String url = REQUEST_PREX_FORWARD_PANOS + panoId + suffixParam;
//发送json请求(可从中获取历史panoId及附近路段的panoId)
String jsonPanoResponse = sendGetRequest(url);
JSONObject jsonObject = JSON.parseObject(jsonPanoResponse);
// json数组对象
JSONArray jsonArray = JSON.parseArray(jsonObject.get("content").toString());
return jsonArray.getJSONObject(0);
}
/**
* @param curPanoId
* 根据curPanoId爬取街景图片下载到本地并存储于数据库
*/
private void processStorePanoByID(String curPanoId,JSONObject jsonObject,String tableName){
String curPanoDate = jsonObject.getString("Time");
//下载当前街景当前图片存储于本地
List<String> curPanoPicPath = downloadCurPanoPics(curPanoId);
PanoPic panoPicBean = new PanoPic(curPanoId,curPanoDate,curPanoPicPath);
//下载当前街景历史图片
JSONArray timeLineJsonArray = jsonObject.getJSONArray("TimeLine");
if (timeLineJsonArray.size()>1) {
//为简便,有历史街景则选择第一项
JSONObject timeLineJson = timeLineJsonArray.getJSONObject(1);
if (!timeLineJson.getBooleanValue("IsCurrent")){
String historyPanoId = timeLineJson.getString("ID");
String timeLine = timeLineJson.getString("TimeLine");
List<String> hisPanoPicPath = downloadHisPanoPics(curPanoId,historyPanoId);
panoPicBean.setHisPanoId(historyPanoId);
panoPicBean.setHisShootDate(timeLine);
panoPicBean.setHisPicPath(hisPanoPicPath);
}
}
//将本地图片存于数据库
storePanoPicsToDb(panoPicBean,tableName);
}
private List<String> downloadCurPanoPics(String panoId){
String localPath;
List<String> picsRequestList = this.getPicsUrl(panoId);
PicLoadUtils picLoadUtils = new PicLoadUtils();
List<String> localPathList = new ArrayList<>(4);
for (String picRequest:picsRequestList){
localPath = picLoadUtils.downloadPic(picRequest, panoId);
localPathList.add(localPath);
}
return localPathList;
}
private List<String> downloadHisPanoPics(String panoId,String hisPanoId){
String localPath;
List<String> picsRequestList = this.getPicsUrl(hisPanoId);
PicLoadUtils picLoadUtils = new PicLoadUtils();
List<String> localPathList = new ArrayList<>(4);
for (String picRequest:picsRequestList){
localPath = picLoadUtils.downloadPic(picRequest, panoId, hisPanoId);
localPathList.add(localPath);
}
return localPathList;
}
private void storePanoPicsToDb(PanoPic panoPic,String tableName){
//读本地图片存数据库
BlobInsertUtils blobInsertUtils = new BlobInsertUtils(tableName);
blobInsertUtils.insertAllImage2DBWithNoCheck(panoPic);
}
public String sendGetRequest(String url){
okhttp3.Request request = new okhttp3.Request.Builder()
.url(url).build();
Response response;
String result = null;
try {
response = OkHttpUtils.getInstance().newCall(request).execute();
result = response.body().string();
response.body().close();
} catch (IOException e) {
logger.error("发送请求失败--"+url);
e.printStackTrace();
}
return result;
}
}
再编写测试代码:
public class ImageStoreTest {
private static final Logger logger = Logger.getLogger(ImageStoreTest.class);
public static void main(String[] args) {
// 读取数据库记录存入list
List<Pano> panoList = new MySQLHelper().getAllPanoFromDB();
BaiduPanoPics baiduPanoPics = new BaiduPanoPics("baidu_pano_pics");
for (Pano pano:panoList){
String panoID = pano.getPanoId();
logger.info("----------开始爬取起始id:"+ panoID+"-----------");
baiduPanoPics.getBaiduPanoPics(panoID,50);
logger.info("-----------"+panoID+"爬取结束!"+"-----------");
}
}
}
上述代码块中存储二进制流至MySQL相关逻辑BlobInsertUtils可查阅另一篇博客—-
-百度街景图片存MySQL
private void storePanoPicsToDb(PanoPic panoPic,String tableName){
//读本地图片存数据库
BlobInsertUtils blobInsertUtils = new BlobInsertUtils(tableName);
blobInsertUtils.insertAllImage2DBWithNoCheck(panoPic);
}
这样就可以看到爬取结果存入数据库了,在这过程中,如果发生请求报错,很可能是因为JDK版本太低导致的发送https请求存在bug,需要升级到1.8.0_211以上版本即可。整个百度街景图片爬取就写到这里,欢迎留言交流。
4 遗留问题
- 在递归遍历的方法中会产生类似环形链表情况,针对此只能多做尝试了,选择可递归层次更多的起始id,这样也能满足本项目的基本需求;
- 理想情况下,锚节点的选择可以帮助我们只凭一个起始panoID就可沿着一条路走到底,但现实情况不是这样,锚节点选择需再优化;
- 大部分图片下载链接参数范围均为4-7,有些则不是(由西向东、由南至北有的是0-3),这个跟前进方向有关,因此最好根据不同的方向和路况信息设计更优的链接拼接规则,这就需要再更深层次的分析百度地图街景的相关接口了,但从爬取效果看,4-7的范围仍可获取不同角度的街景图片。
1和2本质上还是因为没有获取百度地图各街道街景ID或百度坐标所致,经和GIS开发组同事讨论后终于找到了解决办法,即:通过OpenStreetMap获取城市路网数据,得到城市路网的谷歌坐标后,再进行百度坐标转换(这里提供一个
坐标转换的测试网址
),得到百度坐标后,即可通过发送上述请求1就能获取百度坐标系下的位置街景ID。有兴趣的朋友可移步
另一篇路网数据获取的博客
。
若大牛有更好的方法解决街道坐标问题,欢迎交流!
参考
- https://www.jianshu.com/p/3a0fa1e57ff6