我有一个文本文件(大约10,000行),下面给出了其中的一些行
易混淆文件.txt
1F110 ; 0028 0041 0029 ; MA #* ( 🄐 → (A) ) PARENTHESIZED LATIN CAPITAL LETTER A → LEFT PARENTHESIS, LATIN CAPITAL LETTER A, RIGHT PARENTHESIS #
FF21 ; 0041 ; MA # ( A → A ) FULLWIDTH LATIN CAPITAL LETTER A → LATIN CAPITAL LETTER A # →А→
FF22 ; 0042 ; MA # ( B → B ) FULLWIDTH LATIN CAPITAL LETTER B → LATIN CAPITAL LETTER B # →Β→
212C ; 0042 ; MA # ( ℬ → B ) SCRIPT CAPITAL B → LATIN CAPITAL LETTER B #
1F110 ; 0028 0041 0029 ; MA #* ( 🄐 → (A) ) PARENTHESIZED LATIN CAPITAL LETTER A → LEFT PARENTHESIS, LATIN CAPITAL LETTER A, RIGHT PARENTHESIS #
1D435 ; 0042 ; MA # ( 𝐵 → B ) MATHEMATICAL ITALIC CAPITAL B → LATIN CAPITAL LETTER B #
213B ; 0046 0041 0058 ; MA #* ( ℻ → FAX ) FACSIMILE SIGN → LATIN CAPITAL LETTER F, LATIN CAPITAL LETTER A, LATIN CAPITAL LETTER X #
我想根据搜索字符串(例如上面第4行中的'LATIN CAPITAL LETTER B'中的)获取每行括号后的第一个字符(Unicode,original char),我可以使用以下代码来实现
<?php
/**
* @return Generator
*/
// read file
$fileData = function () {
$file = fopen(__DIR__ . './confusables.txt', 'r');
if (!$file) {
return;
}
while (($line = fgets($file)) !== false) {
yield $line;
}
fclose($file);
};
// output array
$output_string = [
'uni-code' => '',
'original' => '',
'des' => '',
];
$search_string = 'LATIN CAPITAL LETTER A';
$initial_line_count = 1; // variable to count lines before we start slicing
$final_count = 0; // final line count
// loop to get final count
foreach ($fileData() as $line) {
// $line contains current line
if (preg_match_all("/{$search_string}/i", $line)) {
$initial_line_count++;
$final_count = $initial_line_count;
// echo $final_count.'<br>';
}
}
$line_count = 1; // loop termination counter
$html = '<table>
<tr>
<th style="border:1px solid #000">ORIGINAL LETTERS</th>
<th style="border:1px solid #000">UNICODE CHARACTER</th>
<th style="border:1px solid #000">Description</th>
</tr>';
// loop to slice and append in array
foreach ($fileData() as $line) {
// $line contains current line
if (preg_match_all("/{$search_string}/i", $line)) {
// start slicing
$slice_after = substr($line, 0, strpos($line, ' ) ')); // slice everything after )
$slice_before = ltrim(stristr($slice_after, '('), '('); // slice everything upto (
$first_char = substr($slice_before, 0, strpos($slice_before, "→")); // get every first character
$split_Real_char = ltrim(stristr($search_string, 'LETTER'), 'LETTER'); // get every real character
$real_Char = $output_string['original'] .= $split_Real_char; // append to array
$split_Unicode_char = $output_string['uni-code'] .= $first_char . ','; // append to array
$line_count++; // loop termination counter
// loop termination
if ($line_count == $final_count) {
$html .= ' <tr>
<td style=" border:1px solid black;"><pre>' . $split_Real_char . '</pre></td>
<td style=" border:1px solid black;"><pre>' . $split_Unicode_char . '</pre></td>
<td style=" border:1px solid black;"><pre>' . $search_string . '</pre></td>
</tr>';
$html .= '</table>';
echo $html;
break;
}
}
}
我得到的输出如下
| ORIGINAL LETTER | UNICODE CHARACTER | Description |
| -------------------- | ------------------------- | -------------------------------- |
| B | B, ℬ , 𝐵 | LATIN CAPITAL LETTER B |
对于单个(硬编码字符串)输出看起来很好,但我必须自动化该过程(对于整个10,000行),到目前为止我已经尝试过了
<?php
/**
* @return Generator
*/
// read file
$fileData = function () {
$file = fopen(__DIR__ . './confusables.txt', 'r');
if (!$file) {
return;
}
while (($line = fgets($file)) !== false) {
yield $line;
}
fclose($file);
};
$searchStringArray = array();
// loop to generate search strings
foreach (range('A', 'B') as $alphabet) {
$alphabets = 'LATIN CAPITAL LETTER ' . $alphabet . "";
array_push($searchStringArray, $alphabets);
}
// output array
$output_string = [
'uni-code' => '',
'original' => '',
'des' => '',
];
$initial_line_count = 1; // variable to count lines before we start slicing
$final_count = 0; // final line count
for ($i = 0; $i < count($searchStringArray); $i++) {
$search_string = $searchStringArray[$i];
// loop to get final count
foreach ($fileData() as $line) {
// $line contains current line
if (preg_match_all("/{$search_string}/i", $line)) {
$initial_line_count++;
$final_count = $initial_line_count;
// echo $final_count.'<br>';
}
}
}
$line_count = 1; // loop termination counter
$html = '<table>
<tr>
<th style="border:1px solid #000">ORIGINAL LETTERS</th>
<th style="border:1px solid #000">UNICODE CHARACTER</th>
<th style="border:1px solid #000">Description</th>
</tr>';
for ($i = 0; $i < count($searchStringArray); $i++) {
$search_string = $searchStringArray[$i];
// loop to slice and append in array
foreach ($fileData() as $line) {
// $line contains current line
if (preg_match_all("/{$search_string}/i", $line)) {
// start slicing
$slice_after = substr($line, 0, strpos($line, ' ) ')); // slice everything after )
$slice_before = ltrim(stristr($slice_after, '('), '('); // slice everything upto (
$first_char = substr($slice_before, 0, strpos($slice_before, "→")); // get every first character
$split_Real_char = ltrim(stristr($search_string, 'LETTER'), 'LETTER'); // get every real character
$real_Char = $output_string['original'] .= $split_Real_char; // append to array
$split_Unicode_char = $output_string['uni-code'] .= $first_char . ','; // append to array
$line_count++; // loop termination counter
// loop termination
if ($line_count == $final_count) {
$html .= ' <tr>
<td style=" border:1px solid black;"><pre>' . $split_Real_char . '</pre></td>
<td style=" border:1px solid black;"><pre>' . $split_Unicode_char . '</pre></td>
<td style=" border:1px solid black;"><pre>' . $search_string . '</pre></td>
</tr>';
$html .= '</table>';
echo $html;
break;
}
}
}
}
然后我得到输出
| ORIGINAL LETTER | UNICODE CHARACTER | Description |
| -------------------- | ------------------------- | --------------------------- |
| B | A, 🄐, B, ℬ, 𝐵, ℻ | LATIN CAPITAL LETTER B |
我得到了所有的Unicode字符,但原始字母和搜索字符串出现了问题。所有的Unicode字符不应该落在一个表单元格中,尽管循环运行多次,我只得到了一行。
预期输出
| ORIGINAL LETTER | UNICODE CHARACTER | Description |
| -------------------- | ------------------| ---------------------- |
| A | A, 🄐, ℻ | LATIN CAPITAL LETTER A |
| B | B, ℬ, 𝐵 | LATIN CAPITAL LETTER B |
有什么建议我如何才能做到这一点?
1条答案
按热度按时间0yycz8jy1#
回答我自己的问题。我能够使用一个函数和
array_map()
来让这个工作。输出