using python, Remove HTML tags/formatting from a string

Posted in :

在使用 selenium 時, 之前使用 element.text 都可以正確地取得 TEXT 內容, 很奇怪, 目前使用 selenium 4.13.0 + python 3.9.13 在 Win 10 環境, 有時候會正常, 但有時會失敗, 可以確定取得的內容是正確的, 取得的 innerHTML 長這樣:

<div data-v-337697a8="" class="sesstion-item"><div data-v-337697a8="" class="row pa-4 flex-column flex-sm-row no-gutters"><div data-v-337697a8="" class="col-sm-10 col-md-10 col-12"><div data-v-337697a8="" class="row mx-0 flex-column flex-md-row no-gutters"><div data-v-337697a8="" class="d-flex text-left font-weight-bold text-regular py-2 is-word-break col-sm-12 col-md-4 col-12 align-self-center">
                2023 JO1 1ST ASIA TOUR 'BEYOND THE DARK' LIMITED EDITION IN TAIPEI
              </div><div data-v-337697a8="" class="px-md-2 d-flex justify-md-center align-left align-md-center col-sm-12 col-md-2 col-12"><div data-v-337697a8="" class="d-flex mb-2 mb-md-0"><div data-v-337697a8="" class="grey--text text--darken-1 d-block d-md-none mr-2 wordBreakKeepAll">日期</div><div data-v-337697a8="">2023-11-11(六)</div></div></div><div data-v-337697a8="" class="px-md-2 d-flex justify-md-center align-md-center col-sm-12 col-md-2 col-12"><div data-v-337697a8="" class="d-flex mb-2 mb-md-0"><div data-v-337697a8="" class="grey--text text--darken-1 d-block d-md-none mr-2 wordBreakKeepAll">時間</div><div data-v-337697a8="">19:00</div></div></div><div data-v-337697a8="" class="px-md-2 d-flex justify-md-center align-md-center is-word-break col-sm-12 col-md-4 col-12"><div data-v-337697a8="" class="d-flex mb-2 mb-md-0"><div data-v-337697a8="" class="grey--text text--darken-1 d-block d-md-none mr-2 wordBreakKeepAll">地點</div><div data-v-337697a8="" class="text-left text-md-center location-content">
                    Zepp New Taipei
                    <div data-v-337697a8="" class="grey--text text--darken-3 text-small d-inline d-md-block">
                      新北市新莊區新北大道四段3號8樓
                    </div></div></div></div></div></div><div data-v-337697a8="" class="font-weight-bold col-sm-2 col-md-2 col-12 align-self-center"><hr data-v-337697a8="" role="separator" aria-orientation="horizontal" class="d-block d-sm-none mb-5 mt-3 v-divider theme--light"><div data-v-9536f93e="" data-v-337697a8=""><!----><!----><!----><!----><button data-v-9536f93e="" type="button" class="nextBtn float-right v-btn v-btn--block v-btn--has-bg v-btn--rounded theme--dark v-size--default"><span class="v-btn__content">立即購票</span></button><!----></div></div></div><!----></div>

使用 .text 取得內容, 居然是空值! 但多試幾次, 偶爾可以取得正確的 text, 既然可以拿 innerHTML 就自己來去 tag 就好了.

import re
def striphtml(data):
    p = re.compile(r'<.*?>')
    return p.sub('', data)

>>> striphtml('<a href="foo.com" class="bar">I Want This <b>text!</b></a>')
'I Want This text!'

Below you will find the syntax which require as per different binding. Change the innerHTML to outerHTML as per required.

Python:

element.get_attribute('innerHTML')

發佈留言

發佈留言必須填寫的電子郵件地址不會公開。 必填欄位標示為 *