shell 如何在Bash或grep或batch中剥离HTML文件的所有链接并将其存储在文本文件中

vulvrdjw  于 2023-03-30  发布在  Shell
关注(0)|答案(7)|浏览(124)

我有一个文件是HTML,它有大约150锚标签。我只需要从这些标签的链接,又名,<a href="*http://www.google.com*"></a>。我想只得到http://www.google.com部分。
当我运行grep时,

cat website.htm | grep -E '<a href=".*">' > links.txt

这将返回整行,它发现不是我想要的链接,所以我尝试使用cut命令:

cat drawspace.txt | grep -E '<a href=".*">' | cut -d’”’ --output-delimiter=$'\n' > links.txt

除了它是错误的,它不工作给予我一些错误的参数...所以我假设该文件也应该沿着。也许像cut -d’”’ --output-delimiter=$'\n' grepedText.txt > links.txt
但我想在一个命令中完成这一点,如果可能的话...所以我尝试做一个AWK命令。

cat drawspace.txt | grep '<a href=".*">' | awk '{print $2}’

但是这个也不能运行。它在要求我更多的输入,因为我还没有完成。
我试着写一个批处理文件,它告诉我FINDSTR不是一个内部或外部命令...所以我假设我的环境变量被搞砸了,而不是修复我试图在Windows上安装grep,但这给了我同样的错误....
问题是,从HTML中剥离HTTP链接的正确方法是什么?有了这个,我将使它适合我的情况。
附:我读了太多的链接/堆栈溢出的帖子,显示我的引用会花太长时间....如果需要示例HTML来显示过程的复杂性,那么我会添加它。
我也有一台Mac和一台PC,我在它们之间来回切换,使用它们的shell/batch/grep命令/terminal命令,所以或者都会帮助我。
我还想指出我在正确的目录中

HTML:

<tr valign="top">
    <td class="beginner">
      B03&nbsp;&nbsp;
    </td>
    <td>
        <a href="http://www.drawspace.com/lessons/b03/simple-symmetry">Simple Symmetry</a>  </td>
</tr>

<tr valign="top">
  <td class="beginner">
    B04&nbsp;&nbsp;
  </td>
  <td>
      <a href="http://www.drawspace.com/lessons/b04/faces-and-a-vase">Faces and a Vase</a> </td>
</tr>

<tr valign="top">
    <td class="beginner">
      B05&nbsp;&nbsp;
    </td>
    <td>
      <a href="http://www.drawspace.com/lessons/b05/blind-contour-drawing">Blind Contour Drawing</a> </td>
</tr>

<tr valign="top">
    <td class="beginner">
        B06&nbsp;&nbsp;
    </td>
    <td>
      <a href="http://www.drawspace.com/lessons/b06/seeing-values">Seeing Values</a> </td>
</tr>

预期产出:

http://www.drawspace.com/lessons/b03/simple-symmetry
http://www.drawspace.com/lessons/b04/faces-and-a-vase
http://www.drawspace.com/lessons/b05/blind-contour-drawing
etc.
deikduxw

deikduxw1#

$ sed -n 's/.*href="\([^"]*\).*/\1/p' file
http://www.drawspace.com/lessons/b03/simple-symmetry
http://www.drawspace.com/lessons/b04/faces-and-a-vase
http://www.drawspace.com/lessons/b05/blind-contour-drawing
http://www.drawspace.com/lessons/b06/seeing-values
vmpqdwk3

vmpqdwk32#

您可以使用grep来执行以下操作:

grep -Po '(?<=href=")[^"]*' file

它打印href="之后的所有内容,直到出现新的双引号。
对于给定的输入,它返回:

http://www.drawspace.com/lessons/b03/simple-symmetry
http://www.drawspace.com/lessons/b04/faces-and-a-vase
http://www.drawspace.com/lessons/b05/blind-contour-drawing
http://www.drawspace.com/lessons/b06/seeing-values
  • 注意,不一定要写cat drawspace.txt | grep '<a href=".*">',可以用grep '<a href=".*">' drawspace.txt去掉useless use of cat。*

另一个例子

$ cat a
hello <a href="httafasdf">asdas</a>
hello <a href="hello">asdas</a>
other things

$ grep -Po '(?<=href=")[^"]*' a
httafasdf
hello
2fjabf4q

2fjabf4q3#

我猜你的PC或Mac默认情况下不会安装lynx命令(它可以在网络上免费获得),但lynx可以让你做这样的事情:
$lynx -dump -image_links -listonly/usr/share/xdiagnose/workloads/youtube-reload.html
输出:参考文献

  1. file://localhost/usr/share/xdiagnose/workloads/youtube-reload.html
  2. http://www.youtube.com/v/zeNXuC3N5TQ&hl=en&fs=1&autoplay=1
    这是一个简单的事情grep为http:行。甚至可能有lynx选项来打印http:lines(lynx有很多很多的选择)。
vmjh9lq9

vmjh9lq94#

使用grep提取所有包含链接的行,然后使用sed提取URL:

grep -o '<a href=".*">' *.html | sed 's/\(<a href="\|\">\)//g' > link.txt;
zynd9foi

zynd9foi5#

根据triplee的评论,使用正则表达式解析HTML或XML文件本质上是没有完成的。像sedawk这样的工具在处理文本文件时非常强大,但是当它归结为解析复杂结构的数据时-例如XML,HTML,JSON... -它们只不过是一把大锤。是的,你可以完成这项工作。但有时要付出巨大的代价。为了处理如此精细的文件,您需要通过使用一组更有针对性的工具来进行更多的技巧。
在解析XML或HTML的情况下,可以很容易地使用xmlstarlet
如果是XHTML文件,可以用途:

xmlstarlet sel --html  -N "x=http://www.w3.org/1999/xhtml" \
               -t -m '//x:a/@href' -v . -n

其中-N给出XHTML名称空间(如果有的话),这由

<html xmlns="http://www.w3.org/1999/xhtml">

然而,由于HTML页面通常不是格式良好的XML,使用tidy对其进行清理可能会很方便。在上面的示例中,这给出了:

$ tidy -q -numeric -asxhtml --show-warnings no <file.html> \
  | xmlstarlet sel --html -N "x=http://www.w3.org/1999/xhtml" \
                   -t -m '//x:a/@href' -v . -n
http://www.drawspace.com/lessons/b03/simple-symmetry
http://www.drawspace.com/lessons/b04/faces-and-a-vase
http://www.drawspace.com/lessons/b05/blind-contour-drawing
http://www.drawspace.com/lessons/b06/seeing-values
vecaoik1

vecaoik16#

假设一个格式良好的HTML文档,每行只有1个href链接,这里有一个awk方法,而不需要backreferencesregexcapturing groups

{m,g}awk 'NF*=2<NF' OFS= FS='^.*<[Aa] [^>]*[Hh][Rr][Ee][Ff]=\"|\".*$'
http://www.drawspace.com/lessons/b03/simple-symmetry
http://www.drawspace.com/lessons/b04/faces-and-a-vase
http://www.drawspace.com/lessons/b05/blind-contour-drawing
http://www.drawspace.com/lessons/b06/seeing-values
3npbholx

3npbholx7#

下面是一个(更通用的)dash脚本,它可以比较两个文件中的URL(由://分隔)/或一个文件中的URL与一组文件中的URL(使用--help标志调用此脚本以了解如何使用它-该脚本应在Linux和Mac OS中开箱即用):

#!/bin/dash

PrintURLs () {
    extract_urls_command="$insert_NL_after_URLs_command|$strip_NON_URL_text_command"
    if [ "$domains_flag" = "1" ]; then
        extract_urls_command="$extract_urls_command|$get_domains_command"
    fi
    {
        eval path_to_search=\"\$$1\"
        current_file_group="$2"
        
        if [ ! "$skip_non_text_files_flag" = "1" ]; then
            printf "\033]0;%s\007" "Loading non text files from group [$current_file_group]...">"$print_to_screen"
            eval find \"\$path_to_search\" ! -type d ! -path '.' -a \\\( -name '*.docx' \\\) "$find_params" -exec unzip -q -c '{}' 'word/_rels/document.xml.rels' \\\;
            eval find \"\$path_to_search\" ! -type d ! -path '.' -a \\\( -name '*.xlsx' \\\) "$find_params" -exec unzip -q -c '{}' 'xl/worksheets/_rels/*' \\\;
            eval find \"\$path_to_search\" ! -type d ! -path '.' -a \\\( -name '*.pptx' -o -name '*.ppsx' \\\) "$find_params" -exec unzip -q -c '{}' 'ppt/slides/slide1.xml' \\\;
            eval find \"\$path_to_search\" ! -type d ! -path '.' -a \\\( -name '*.odt' -o -name '*.ods' -o -name '*.odp' \\\) "$find_params" -exec unzip -q -c '{}' 'content.xml' \\\;
            eval find \"\$path_to_search\" ! -type d ! -path '.' -a \\\( -name '*.pdf' \\\) "$find_params" -exec pdftotext '{}' '-' \\\;
        fi
        eval find \"\$path_to_search\" ! -type d ! -path '.' "$find_params"|{
            count=0
            while IFS= read file; do
                if [ ! "$(file -bL --mime-encoding "$file")" = "binary" ]; then
                    count=$((count+1))
                    printf "\033]0;%s\007" "Loading text files from group [$current_file_group] - file $count...">"$print_to_screen"
                    cat "$file"
                fi
            done
        }
        printf "\033]0;%s\007" "Extracting URLs from group [$current_file_group]...">"$print_to_screen"
    } 2>/dev/null|eval "$extract_urls_command"
}

StoreURLsWithLineNumbers () {
    
    count_all="0"
    mask="00000000000000000000"
    
    #For <file group 1>: initialise next variables:
    file_group="1"
    count=0
    
    dff_command_text=""
    if [ ! "$dff_command_flag" = "0" ]; then
        dff_command_text="Step $dff_command_flag - "
    fi
    
    for line in $(PrintURLs file_params_1 1; printf "%s$NL" "### Sepparator ###";    for i in $(seq 2 $file_params_0); do PrintURLs file_params_$i 2; done;); do
        if [ "$line" = "### Sepparator ###" ]; then
            eval lines$file_group\1\_0=$count
            eval lines$file_group\2\_0=$count
            
            #For <file group 2>: initialise next variables:
            file_group="2";
            count="0"
            continue;
        fi
        
        printf "\033]0;%s\007" "Storing URLs into memory [$dff_command_text""group $file_group]: $((count + 1))...">"$print_to_screen"
        count_all_prev=$count_all
        count_all=$((count_all+1))
        count=$((count+1))
        if [ "${#count_all_prev}" -lt "${#count_all}" ]; then
            mask="${mask%?}"
        fi
        number="$mask$count_all"
        
        eval lines$file_group\1\_$count=\"\$number\"
        eval lines$file_group\2\_$count=\"\$line\" #URL
    done;
    eval lines$file_group\1\_0=$count
    eval lines$file_group\2\_0=$count
}

GetOSType () {
    if [ -n "$1" ]; then
        case "$(uname -s)" in
        *"Darwin"* | *"BSD"* )
            eval $1="BSD-based"
            ;;
        *"Linux"* )
            eval $1="Linux"
            ;;
        * )
            eval $1="Other"
            ;;
        esac
    else
        echo 'ERROR: GetOSType: Expected 1 parameter!'>&2
        echo 'Press Enter to exit...'>"$print_to_screen"
        read temp
    fi
}

trap1 () {
    CleanUp
    #if not running in a subshell: print "Aborted"
    if [ "$dff_command_flag" = "0" ]; then
        printf "$NL""Aborted.""$NL">"$print_to_screen"
    fi
    
    #kill all children processes, suppressing "Terminated" message:
    kill -s PIPE -- -$$
    
    exit
}

CleanUp () {
    
    #Restore "INTERRUPT" (CTRL-C) and "TERMINAL STOP" (CTRL-Z) signals:
    trap - INT
    trap - TSTP
    
    #Clear the title:
    printf "\033]0;%s\007" "">"$print_to_screen"
    
    #Restore initial IFS:
    #IFS="$initial_IFS"
    unset IFS
    
    #Restore initial directory:
    cd "$initial_dir"
    
    DestroyArray flag_params
    DestroyArray file_params
    DestroyArray find_params
    DestroyArray lines11
    DestroyArray lines12
    DestroyArray lines21
    DestroyArray lines22
    
    ##Kill current shell with PID $$:
    #kill -INT $$

}

DestroyArray () {
    eval eval array_length=\'\$$1\_0\'
    if [ -z "$array_length" ]; then array_length=0; fi
    for i in $(seq 1 $array_length); do
        eval unset $1\_$i
    done
    eval unset $1\_0
}

PrintErrorExtra () {
    {
    
        printf "%s$NL" "Command path:"
        printf "%s$NL" "$current_shell '$current_script_path'"
        
        printf "$NL"
        
        #Flag parameters are printed non-quoted:
        printf "%s$NL" "Flags:"
        for i in $(seq 1 $flag_params_0); do
            eval current_param="\"\$flag_params_$i\""
            printf "%s$NL" "$current_param"
        done
        if [ "$flag_params_0" = "0" ]; then printf "%s$NL" "<none>"; fi
        printf "$NL"
        
        #Path parameters are printed quoted with '':
        printf "%s$NL" "Paths:"
        for i in $(seq 1 $file_params_0); do
            eval current_param="\"\$file_params_$i\""
            printf "%s$NL" "'$current_param'"
        done
        if [ "$file_params_0" = "0" ]; then printf "%s$NL" "<none>"; fi
        printf "$NL"

        #Find parameters are printed quoted with '':
        printf "%s$NL" "'find' parameters:"
        for i in $(seq 1 $find_params_0); do
            eval current_param="\"\$find_params_$i\""
            printf "%s$NL" "'$current_param'"
        done
        if [ "$find_params_0" = "0" ]; then printf "%s$NL" "<none>"; fi
        printf "$NL"
    }>"$print_error_messages"
}

DisplayHelp () {
    printf "%s$NL" ""
    printf "%s$NL" " - A script to compare URLs ( containing '://' ) in a file compared to a group of files"
    printf "%s$NL" "     "
    printf "%s$NL" "     Usage:"
    printf "%s$NL" "         "
    printf "%s$NL" "         dash '/path/to/this/script.sh' <flags> '/path/to/file1' ... '/path/to/fileN' [ --find-parameters <find_parameters> ]"
    printf "%s$NL" "         where:"
    printf "%s$NL" "         - The group 1: '/path/to/file1' and the group 2: '/path/to/file2' ... '/path/to/fileN' - are considered the two groups of files to be compared"
    printf "%s$NL" "         "
    printf "%s$NL" "         - <flags> can be:"
    printf "%s$NL" "             --help"
    printf "%s$NL" "                 - displays this help information"
    printf "%s$NL" "             --different or -d"
    printf "%s$NL" "                 - find URLs that differ"
    printf "%s$NL" "             --common or -c"
    printf "%s$NL" "                 - find URLs that are common"
    printf "%s$NL" "             --domains"
    printf "%s$NL" "                 - compare and print only the domains (plus subdomains) of the URLs for: the group 1 and the group 2 - for the '-c' or the '-d' flag"
    printf "%s$NL" "             --domains-full"
    printf "%s$NL" "                 - compare only the domains (plus subdomains) of the URLs but print the full URLs for: the group 1 and the group 2 - for the '-c' or the '-d' flag"
    printf "%s$NL" "             --preserve-order or -p"
    printf "%s$NL" "                 - preserve the order and the occurences in which the links appear in group 1 and in group 2"
    printf "%s$NL" "                 - Warning: when using this flag - process substitution is used by this script - which does not work with the \"dash\" shell (throws an error). For this flag, you can use other \"dash\" syntax compatible shells, like: bash, zsh, ksh"
    printf "%s$NL" "             --skip-non-text"
    printf "%s$NL" "                 - skip non-text files from search (does not look into: .docx, .xlsx, .pptx, .ppsx, .odt, .ods, .odp and .pdf files)"
    printf "%s$NL" "             --find-parameters <find_parameters>"
    printf "%s$NL" "                 - all the parameters given after this flag, are considered 'find' parameters"
    printf "%s$NL" "                 - <find_parameters> can be: any parameters that can be passed to the 'find' utility (which is used internally by this script) - such as: name/path filters"
    printf "%s$NL" "             -h"
    printf "%s$NL" "                 - also look in hidden files"
    printf "%s$NL" "     "
    printf "%s$NL" "     Output:"
    printf "%s$NL" "         - '<' - denote URLs from the group 1: '/path/to/file1'"
    printf "%s$NL" "         - '>' - denote URLs from the group 2: '/path/to/file2' ... '/path/to/fileN'"
    printf "%s$NL" "     "
    printf "%s$NL" "     Other commands that might be useful:"
    printf "%s$NL" "         "
    printf "%s$NL" "         - filter results - print lines containing string (highlight):"
    printf "%s$NL" "             ...|grep \"string\""
    printf "%s$NL" "         "
    printf "%s$NL" "         - filter results - print lines not containing string:"
    printf "%s$NL" "             ...|grep -v \"string\""
    printf "%s$NL" "         "
    printf "%s$NL" "         - filter results - print lines containing: string1 or string2 or ... stringN:"
    printf "%s$NL" "             ...|awk '/string1|string2|...|stringN/'"
    printf "%s$NL" "         "
    printf "%s$NL" "         - filter results - print lines not containing: string1 or string2 or ... stringN:"
    printf "%s$NL" "             ...|awk '"'!'"/string1|string2|...|stringN/'"
    printf "%s$NL" "         "
    printf "%s$NL" "         - filter results - print lines in '/file/path/2' that are in '/file/path/1':"
    printf "%s$NL" "             grep -F -f '/file/path/1' '/file/path/2'"
    printf "%s$NL" "         "
    printf "%s$NL" "         - filter results - print lines in '/file/path/2' that are not in '/file/path/1':"
    printf "%s$NL" "             grep -F -vf '/file/path/1' '/file/path/2'"
    printf "%s$NL" "         "
    printf "%s$NL" "         - filter results - print columns <1> and <2> from output:"
    printf "%s$NL" "             awk '{print \$1, \$2}'"
    printf "%s$NL" ""
}

# Print to "/dev/tty" = Print error messages to screen only
print_to_screen="/dev/tty"

#print_error_messages='&2'
print_error_messages="$print_to_screen"

initial_dir="$PWD" #Store initial directory value

initial_IFS="$IFS" #Store initial IFS value

GetOSType OS_TYPE

if [ "$OS_TYPE" = "Linux" -o "$OS_TYPE" = "Other" ]; then
    NL=$(printf '%s' "\n")
    #or:
    #NL=$'\n'
    
    NL2=$(printf '%s' "\n\n") #Store New Line for use with sed
elif [ "$OS_TYPE" = "BSD-based" ]; then
    NL=$(printf '%s' "\r")
    #or:
    #NL=$'\r'
    
    NL2=$(printf '%s' "\r\r") #Store New Line for use with sed
fi

insert_NL_after_URLs_command='sed -E '"'"'s/([a-zA-Z]*\:\/\/)/'"\\${NL2}"'\1/g'"'"
strip_NON_URL_text_command='sed -n '"'"'s/\(\(.*\([^a-zA-Z+]\)\|\([a-zA-Z]\)\)\)\(\([a-zA-Z]\)*\:\/\/\)\([^ ^\t^>^<]*\).*/\4\5\7/p'"'"
get_domains_command='sed '"'"'s/.*:\/\/\(.*\)/\1/g'"'"'|sed '"'"'s/\/.*//g'"'"
prepare_for_output_command='sed -E '"'"'s/ *([0-9]*)[\ *](<|>) *([0-9]*)[\ *](.*)/\2 \4 \1/g'"'"
remove_angle_brackets_command='sed -E '"'"'s/(<|>) (.*)/\2/g'"'"

find_params=""

#Process parameters:

different_flag="0"
common_flag="0"
domains_flag="0"
domains_full_flag="0"
preserve_order_flag="0"
dff_command1_flag="0"
dff_command2_flag="0"
dff_command3_flag="0"
dff_command4_flag="0"
dff_command_flag="0"
skip_non_text_files_flag="0"
find_parameters_flag="0"
hidden_files_flag="0"
help_flag="0"

flag_params_count=0
file_params_count=0
find_params_count=0

for param; do
    if [ "$find_parameters_flag" = "0" ]; then
        case "$param" in
            "--different" | "-d" | "--common" | "-c" | "--domains" | \
            "--domains-full" | "--preserve_order" | "-p" | "--dff_command1" | "--dff_command2" | \
            "--dff_command3" | "--dff_command4" | "--skip-non-text" | "--find-parameters" | "-h" | \
            "--help" )
                flag_params_count=$((flag_params_count+1))
                eval flag_params_$flag_params_count=\"\$param\"
                case "$param" in
                    "--different" | "-d" )
                        different_flag="1"
                    ;;
                    "--common" | "-c" )
                        common_flag="1"
                    ;;
                    "--domains" )
                        domains_flag="1"
                    ;;
                    "--domains-full" )
                        domains_full_flag="1"
                    ;;
                    "--preserve_order" | "-p" )
                        preserve_order_flag="1"
                    ;;
                    "--dff_command1" )
                        dff_command1_flag="1"
                        dff_command_flag="1"
                    ;;
                    "--dff_command2" )
                        dff_command2_flag="1"
                        dff_command_flag="2"
                    ;;
                    "--dff_command3" )
                        dff_command3_flag="1"
                        dff_command_flag="3"
                    ;;
                    "--dff_command4" )
                        dff_command4_flag="1"
                        dff_command_flag="4"
                    ;;
                    "--skip-non-text" )
                        skip_non_text_files_flag="1"
                    ;;
                    "--find-parameters" )
                        find_parameters_flag="1"
                    ;;
                    "-h" )
                        hidden_files_flag="1"
                    ;;
                    "--help" )
                        help_flag="1"
                    ;;
                esac
            ;;
            * )
                file_params_count=$((file_params_count+1))
                eval file_params_$file_params_count=\"\$param\"
            ;;
        esac
    elif [ "$find_parameters_flag" = "1" ]; then
        find_params_count=$((find_params_count+1))
        eval find_params_$find_params_count=\"\$param\"
    fi
done
flag_params_0="$flag_params_count"
file_params_0="$file_params_count"
find_params_0="$find_params_count"

if [ "$help_flag" = "1" -o \( "$file_params_0" = "0" -a "$find_params_0" = "0" \) ]; then
    DisplayHelp
    exit 0
fi

#Check if any of the necessary utilities is missing:

error="false"
man -f find >/dev/null 2>/dev/null || { printf '\n%s\n' "ERROR: the 'find' utility is not installed!"; error="true"; }
man -f file >/dev/null 2>/dev/null || { printf '\n%s\n' "ERROR: the 'file' utility is not installed!"; error="true"; }
man -f kill >/dev/null 2>/dev/null || { printf '\n%s\n' "ERROR: the 'kill' utility is not installed!"; error="true"; }
man -f seq >/dev/null 2>/dev/null || { printf '\n%s\n' "ERROR: the 'seq' utility is not installed!"; error="true"; }
man -f ps >/dev/null 2>/dev/null || { printf '\n%s\n' "ERROR: the 'ps' utility is not installed!"; error="true"; }
man -f sort >/dev/null 2>/dev/null || { printf '\n%s\n' "ERROR: the 'sort' utility is not installed!"; error="true"; }
man -f uniq >/dev/null 2>/dev/null || { printf '\n%s\n' "ERROR: the 'uniq' utility is not installed!"; error="true"; }
man -f sed >/dev/null 2>/dev/null || { printf '\n%s\n' "ERROR: the 'sed' utility is not installed!"; error="true"; }
man -f grep >/dev/null 2>/dev/null || { printf '\n%s\n' "ERROR: the 'grep' utility is not installed!"; error="true"; }

if [ "$skip_non_text_files_flag" = "0" ]; then
    man -f unzip >/dev/null 2>/dev/null || { printf '\n%s\n' "ERROR: the 'unzip' utility is not installed!"; error="true"; }
    man -f pdftotext >/dev/null 2>/dev/null || { printf '\n%s\n' "ERROR: the 'pdftotext' utility is not installed!"; error="true"; }
fi

if [ "$error" = "true" ]; then
    printf "\n"
    CleanUp; exit 1
fi

#Process parameters/flags and check for errors:

find_params="$(for i in $(seq 1 $find_params_0;); do eval printf \'\%s \' "\'\$find_params_$i\'"; done;)"
if [ -z "$find_params" ]; then
    find_params='-name "*"'
fi

if [ "$hidden_files_flag" = "1" ]; then
    hidden_files_string=""
elif [ "$hidden_files_flag" = "0" ]; then
    hidden_files_string="\( "'! -path '"'"'*/.*'"'"" \)"
fi

find_params="$hidden_files_string"" -a ""$find_params"

current_shell="$(ps -p $$ 2>/dev/null)"; current_shell="${current_shell##*" "}"
current_script_path=$(cd "${0%/*}" 2>/dev/null; printf '%s' "$(pwd -P)/${0##*/}")

error="false"

if [ "$different_flag" = "0" -a "$common_flag" = "0" ]; then
    error="true"
    printf '\n%s\n' "ERROR: Expected either -c or -d flag!">"$print_error_messages"
elif [ "$different_flag" = "1" -a "$common_flag" = "1" ]; then
    error="true"
    printf '\n%s\n' "ERROR: The '-c' flag cannot be used together with the '-d' flag!">"$print_error_messages"
fi

if [ "$preserve_order_flag" = "1" -a "$common_flag" = "1" ]; then
    error="true"
    printf '\n%s\n' "ERROR: The '-p' flag cannot be used together with the '-c' flag!">"$print_error_messages"
fi

if [ "$preserve_order_flag" = "1" -a "$current_shell" = "dash" ]; then
    error="true"
    printf '\n%s\n' "ERROR: When using the '-p' flag, the \"process substitution\" feature is needed, which is not available in the dash shell (it is available in shells like: bash, zsh, ksh)!">"$print_error_messages"
fi

eval find \'/dev/null\' "$find_params">/dev/null 2>&1||{
    error="true"
    printf '\n%s\n' "ERROR: Invalid parameters for the 'find' command!">"$print_error_messages"
}

if [ "$error" = "true" ]; then
    printf "\n"
    PrintErrorExtra
    CleanUp; exit 1;
fi

#Check if the file paths given as parameters do exist:
error="false"
for i in $(seq 1 $file_params_0); do
    eval current_file=\"\$file_params_$i\"
    # If current <file> does not exist:
    if [ ! -e "$current_file" ]; then # If current file does not exist:
        printf '\n%s\n' "ERROR: File '$current_file' does not exist or is not accessible!">"$print_error_messages"
        error="true"
    elif [ ! -r "$current_file" ]; then # If current file is not readable:
        printf '\n%s\n' "ERROR: File <$i> = '$current_file' is not readable!">"$print_error_messages"
        error="true"
    fi
done

if [ "$error" = "true" ]; then
    printf "\n"
    PrintErrorExtra
    CleanUp; exit 1;
fi

#Proceed to finding and comparing URLs:

IFS='
'

#Trap "INTERRUPT" (CTRL-C) and "TERMINAL STOP" (CTRL-Z) signals:
trap 'trap1' INT
trap 'trap1' TSTP

if [ "$domains_full_flag" = "0" -o ! "$dff_command_flag" = "0" ]; then
    
    StoreURLsWithLineNumbers

fi

if [ "$domains_full_flag" = "0" ]; then
    if [ "$preserve_order_flag" = "0" ]; then
        {
            for i in $(seq 1 $lines11_0); do
                printf "\033]0;%s\007" "Processing group [1] - URL: $i...">"$print_to_screen"
                eval printf \'\%s\\\n\' \"\< \$lines11_$i \$lines12_$i\"
            done|sort -k 3|uniq -c -f 2
            for i in $(seq 1 $lines21_0); do
                printf "\033]0;%s\007" "Processing group [2] - URL: $i...">"$print_to_screen"
                eval printf \'\%s\\\n\' \"\> \$lines21_$i \$lines22_$i\"
            done|sort -k 3|uniq -c -f 2
        }|sort -k 4|{
            if [ "$different_flag" = "1" ]; then
                uniq -u -f 3|sort -k 3|eval "$prepare_for_output_command"
            elif [ "$common_flag" = "1" ]; then
                uniq -d -f 3|sort -k 3|eval "$prepare_for_output_command"|eval "$remove_angle_brackets_command"
            fi
        }
    elif [ "$preserve_order_flag" = "1" ]; then
        
        if [ "$different_flag" = "1" ]; then
            {
                URL_count=0
                current_line=""
                for line in $(eval diff \
                        \<\(\
                            count1=0\;\
                            for i in \$\(seq 1 \$lines11_0\)\; do\
                                count1=\$\(\(count1 + 1\)\)\;\
                                eval URL=\\\"\\\$lines12_\$i\\\"\;\
                                printf \'\%s\\n\' \"File group: 1 URL: \$count1\"\;\
                                printf \'\%s\\n\' \"\$URL\"\;\
                            done\;\
                            printf \'\%s\\n\' \"\#\#\# Sepparator 1\"\;\
                        \) \
                        \<\(\
                            count2=0\;\
                            for i in \$\(seq 1 $lines21_0\)\; do\
                                count2=\$\(\(count2 + 1\)\)\;\
                                eval URL=\\\"\\\$lines22_\$i\\\"\;\
                                printf \'\%s\\n\' \"File group: 2 URL: \$count2\"\;\
                                printf \'\%s\\n\' \"\$URL\"\;\
                            done\;\
                            printf \'\%s\\n\' \"\#\#\# Sepparator 2\"\;\
                        \) \
                    ); do
                    URL_count=$((URL_count + 1))
                    previous_line="$current_line"
                    current_line="$line"
                    #if ( current line starts with "<" and previous line starts with "<" ) OR ( current line starts with ">" and previous line starts with ">" ):
                    if [ \( \( ! "${current_line#"<"}" = "${current_line}" \) -a \( ! "${previous_line#"<"}" = "${previous_line}" \) \) -o \( \( ! "${current_line#">"}" = "${current_line}" \) -a  \( ! "${previous_line#">"}" = "${previous_line}" \) \) ]; then
                        printf "%s$NL" "$previous_line"
                    fi
                done
            }
        
        fi
    fi
elif [ "$domains_full_flag" = "1" ]; then
    # Command to find common domains:
    script_command1="$current_shell '$current_script_path' -c --domains $(for i in $(seq 1 $file_params_0); do eval printf \'%s \' \\\'\$file_params_$i\\\'; done)"
    
    # URLs that are only in first parameter file (file group 1):
    script_command2="$current_shell '$current_script_path' -d '$file_params_1' \"/dev/null\""
    
    # Command to find common domains:
    script_command3="$current_shell '$current_script_path' -c --domains $(for i in $(seq 1 $file_params_0); do eval printf \'%s \' \\\'\$file_params_$i\\\'; done)"
    
    # URLs that are only in 2..N parameter files (file group 2):
    script_command4="$current_shell '$current_script_path' -d \"/dev/null\" $(for i in $(seq 2 $file_params_0); do eval printf \'%s \' \\\'\$file_params_$i\\\'; done)"
    
    #Store one <command substitution> at a a time (syncronously):
    script_command1_output="$(eval $script_command1 --dff_command1 --find-parameters "$find_params"|sed 's/\([^ *]\) \(.*\)/\1/')"
    script_command2_output="$(eval $script_command2 --dff_command2 --find-parameters "$find_params")"
    script_command3_output="$(eval $script_command3 --dff_command3 --find-parameters "$find_params"|sed 's/\([^ *]\) \(.*\)/\1/')"
    script_command4_output="$(eval $script_command4 --dff_command4 --find-parameters "$find_params")"
    
    if [ "$different_flag" = "1" ]; then
    # Find URLs (second escaped process substitution: \<\(...\)) that are not in the common domains list (first escaped process substitution: \<\(...\)):
        # URLs in the first file given as parameter (second escaped process substitution: \<\(...\)):
        eval grep \-F \-vf \<\( printf \'\%s\' \"\$script_command1_output\"\; \) \<\( printf \'\%s\' \"\$script_command2_output\"\; \)
        # URLs in the files 2..N - given as parameters (second escaped process substitution: \<\(...\)):
        eval grep \-F \-vf \<\( printf \'\%s\' \"\$script_command3_output\"\; \) \<\( printf \'\%s\' \"\$script_command4_output\"\; \)
    elif [ "$common_flag" = "1" ]; then
    # Find URLs (second escaped process substitution: \<\(...\)) that are in the common domains list (first escaped process substitution: \<\(...\)):
        # URLs in the first file given as parameter (second escaped process substitution: \<\(...\)):
        eval grep \-F \-f \<\( printf \'\%s\' \"\$script_command1_output\"\; \) \<\( printf \'\%s\' \"\$script_command2_output\"\; \)
        # URLs in the files 2..N - given as parameters (second escaped process substitution: \<\(...\)):
        eval grep \-F \-f \<\( printf \'\%s\' \"\$script_command3_output\"\; \) \<\( printf \'\%s\' \"\$script_command4_output\"\; \)
    fi
    # grep flags explained:
    #    -F = do not interpret pattern string (treat string literally)
    #    -v = select non-matching lines
fi

CleanUp

对于所问的问题-这应该做到这一点:

dash '/path/to/the/above/script.sh' -d '/path/to/file1/containing/URLs.txt' '/dev/null'

相关问题