[파이썬 실험]XPath 대신 텍스트 정제! 간단한 보고서 정보 파서

2025-06-14 08:23:36

이 코드는 DART 시스템에서 제공하는 보고서 HTML에서

BeautifulSoup과 정규표현식을 활용하여

회사명, 사업연도, 보고서 종류를 추출하는 간단한 파서입니다.

이전에는 XPath로 위치를 하드코딩했지만,

텍스트 기반 정제로 전환함으로써 더 유연하게 동작할 수 있게 되었습니다.

import requests

from bs4 import BeautifulSoup

import re

def parse_report_info(url):

html = requests.get(url).text

soup = BeautifulSoup(html, "html.parser")

text = soup.get_text()

# 공백 제거 + 줄 정리

lines = text.splitlines()

cleaned_lines = [line.strip() for line in lines if line.strip() != ""]

cleaned_text = "\n".join(cleaned_lines)

# 줄바꿈은 유지하고 스페이스/탭 제거

cleaned_text = re.sub(r"[ \t]+", "", cleaned_text)

# 보고서 종류

report_match = re.search(r"(분기보고서|반기보고서|사업보고서)", cleaned_text)

report_type = report_match.group(1) if report_match else "N/A"

# 사업연도

year_match = re.search(r"(20[0-9]{2})년", cleaned_text)

bsns_year = year_match.group(1) if year_match else "N/A"

# 회사명

corp_match = re.search(r"회\s*사\s*명\s*:\s*([가-힣㈜\w\s]+?)(?=\s*대\s*표|$)", cleaned_text)

corp_name = corp_match.group(1).strip() if corp_match else "N/A"

return {

"corp_name": corp_name,

"bsns_year": bsns_year,

"report_type": report_type

}

※ 이 코드는 예시일 뿐이며, 본문 전체 내용과 보고서 구조에 따라 정규식은 조정이 필요합니다 :)

이 댓글은 페이스북 로그인으로만 작성할 수 있어요.