我试图重写hbase方法:multitableinputformat.getsplits(),我有如下实现:
public List<InputSplit> getSplits(JobContext context) throws IOException {
List<Scan> scans = getScans();
List<InputSplit> splits = new ArrayList<>();
Scan sampleScan = scans.get(0);
byte[] tableNameBytes = sampleScan.getAttribute(Scan.SCAN_ATTRIBUTES_TABLE_NAME);
TableName tableName = TableName.valueOf(tableNameBytes);
Table table = null;
RegionLocator regionLocator = null;
Connection conn = null;
conn = ConnectionFactory.createConnection(context.getConfiguration());
table = conn.getTable(tableName);
regionLocator = conn.getRegionLocator(tableName);
regionLocator = (RegionLocator) table;
Pair<byte[][], byte[][]> keys = regionLocator.getStartEndKeys();
RegionSizeCalculator sizeCalculator = new RegionSizeCalculator(
regionLocator, conn.getAdmin()
);
int regionCount = keys.getFirst().length;
for (int i = 0; i < regionCount; i++) {
calculateSplits(
keys.getFirst()[i],
keys.getSecond()[i],
regionLocator,
sizeCalculator,
splits
);
}
return splits;
}
private void calculateSplits(
final byte[] startKey,
final byte[] endKey,
RegionLocator regionLocator,
RegionSizeCalculator sizeCalculator,
List<InputSplit> splits
) throws IOException {
HRegionLocation hregionLocation = regionLocator.getRegionLocation(startKey, false);
String regionHostname = hregionLocation.getHostname();
HRegionInfo regionInfo = hregionLocation.getRegionInfo();
for (Scan scan : getScans()) {
byte[] startRow = scan.getStartRow();
byte[] stopRow = scan.getStopRow();
// determine if the given start and stop keys fall into the range
if (
(startRow.length == 0 || endKey.length == 0 || Bytes.compareTo(startRow, endKey) < 0) &&
(stopRow.length == 0 || Bytes.compareTo(stopRow, startKey) > 0)
) {
byte[] splitStart = startRow.length == 0 || Bytes.compareTo(startKey, startRow) >= 0 ?
startKey : startRow;
byte[] splitStop =
(stopRow.length == 0 || Bytes.compareTo(endKey, stopRow) <= 0) && endKey.length > 0 ?
endKey : stopRow;
long regionSize = sizeCalculator.getRegionSize(regionInfo.getRegionName());
TableSplit split = new TableSplit(
regionLocator.getName(), scan, splitStart, splitStop, regionHostname, regionSize
);
splits.add(split);
}
}
}
这段代码的基本思想是获取所有区域及其开始和结束键。我们还有一份扫描清单。我们将检查所有扫描*所有区域以获得所有分割。但是这段代码非常慢,主要是因为我们有大约10000个区域。因此,扫描和计算每个区域的信息需要花费大量的时间。
我注意到在regionlocator中还有一个名为getallregionlocations()的方法,我想我可以使用这个方法一次获取所有区域并节省大量时间。但问题是如果我使用这种方法,我不能得到相应的开始和结束键,那么我就不能决定分割的范围。有没有更好的解决方法让这个方法更快的想法?
1条答案
按热度按时间zazmityj1#
解决了的!我发现我们可以从regioninfo得到startkey和endkey。因此,首先获取一个列表,扫描列表中的所有regionlocation,第二个方法变为: