nodirver 取得目前瀏覽器網址的解法

nodriver 是一個 web automation, webscraping, bots, 下載點:
https://github.com/ultrafunkamsterdam/nodriver

在 nodriver 取得目前 url 解法有很多, 某一個解法如下:

async def nodriver_current_url(tab):
    is_quit_bot = False
    exit_bot_error_strings = [
        "server rejected WebSocket connection: HTTP 500",
        "[Errno 61] Connect call failed ('127.0.0.1',",
        "[WinError 1225] ",
    ]

    url = ""
    if tab:
        url_dict = {}
        try:
            url_dict = await tab.js_dumps('window.location.href')
        except Exception as exc:
            print(exc)
            str_exc = ""
            try:
                str_exc = str(exc)
            except Exception as exc2:
                pass
            if len(str_exc) > 0:
                for each_error_string in exit_bot_error_strings:
                    if each_error_string in str_exc:
                        #print('quit bot by error:', each_error_string, driver)
                        is_quit_bot = True

        url_array = []
        if url_dict:
            for k in url_dict:
                if k.isnumeric():
                    if "0" in url_dict[k]:
                        url_array.append(url_dict[k]["0"])
            url = ''.join(url_array)
    return url, is_quit_bot

這個解法是透過 javascript 去取得 window.location.href 來解決.

另一個解法:

async def nodriver_current_url(driver, tab):
    exit_bot_error_strings = [
        "server rejected WebSocket connection: HTTP 500",
        "[Errno 61] Connect call failed ('127.0.0.1',",
        "[WinError 1225] ",
    ]
    # return value
    url = ""
    is_quit_bot = False
    last_active_tab = None

    driver_info = await driver._get_targets()
    if not tab.target in driver_info:
        print("tab may closed by user before, or popup confirm dialog.")
        tab = None
        await driver
        try:
            for i, each_tab in enumerate(driver):
                target_info = each_tab.target.to_json()
                target_url = ""
                if target_info:
                    if "url" in target_info:
                        target_url = target_info["url"]
                if len(target_url) > 4:
                    if target_url[:4]=="http" or target_url == "about:blank":
                        print("found tab url:", target_url)
                        last_active_tab = each_tab
        except Exception as exc:
            print(exc)
            if str(exc) == "list index out of range":
                print("Browser closed, start to exit bot.")
                is_quit_bot = True
                tab = None
                last_active_tab = None

        if not last_active_tab is None:
            tab = last_active_tab

    if tab:
        try:
            target_info = tab.target.to_json()
            if target_info:
                if "url" in target_info:
                    url = target_info["url"]
            #url = await tab.evaluate('window.location.href')
        except Exception as exc:
            print(exc)
            str_exc = ""
            try:
                str_exc = str(exc)
            except Exception as exc2:
                pass
            if len(str_exc) > 0:
                if str_exc == "server rejected WebSocket connection: HTTP 404":
                    print("目前 nodriver 還沒準備好..., 請等到沒出現這行訊息再開始使用。")

                for each_error_string in exit_bot_error_strings:
                    if each_error_string in str_exc:
                        #print('quit bot by error:', each_error_string, driver)
                        is_quit_bot = True
    return url, is_quit_bot, last_active_tab

與前一個解法相比, 多回傳一個最後的作用中的tab, 不是從 ‘window.location.href’ 取得內容, 而是直接從 tab.target.target_info, 理論上, 效率應該會微微提升一點點, 因為使用的 cdp 指令不是較高階或複雜的 js_dumps, 而是只有 await driver.

發佈留言

發佈留言必須填寫的電子郵件地址不會公開。 必填欄位標示為 *