환경 빅데이터 분석 및 서비스 개발 Ⅲ

강성원

Ⅰ. Background and Aims of Research ？ We tried to apply the Big Data analysis methodology to environmental policy research. ？ Applying Machine Learning to build up an ‘Environmental Policy Monitoring System’ dedicated to periodically searching environmental policy needs and assessing timeliness and effectiveness of environmental policy ㅇ Search for needs: Pollution prediction, policy consumer text analysis, environment-related social media sentiment analysis, issue-based data analysis - Pollution prediction: Early detection of policy areas in need of policy intervention - Policy consumer text analysis: Detect policy areas which draw consumers’ attention - Social media sentiment analysis: Detect environmental issues related to negative sentiment - Issue-based data analysis: Link preselected environmental issues with simple data analysis, and identify issues with non-environmentally friendly data analysis outcomes ㅇ Timeliness assessment: Analyze text produced by policy provider to check if keywords/topics match the policy needs at hand ㅇ Effectiveness assessment: Check pollution improvement, social media sentiment improvement, and environmental issue improvement - Pollution improvement: Compare pollution estimates before and after policy intervention - Social media sentiment: Compare social media sentiment before and after policy intervention - Environmental issue improvement: Check issues with non-environmentally friendly data analysis outcomes before policy intervention ？Begin to construct three compartments of an ‘Environmental Policy Monitoring System’: ‘Deep Learning-Based Pollution Prediction algorithm’, ‘Environmental Text Analysis algorithm’, and ‘Issue-Based Database’ ㅇ Deep Learning-Based Pollution algorithm: An air pollution prediction algorithm which estimates the pollution of 6 air pollutants and a water pollution prediction algorithm which estimates chlorophyll-a pollution ㅇ Environmental Text Analysis algorithm: A text mining algorithm for text produced by policy consumer and policy provider, and a sentiment analysis algorithm for climate change related to social media ㅇ Issue-Based Database: An environmental issue network and data analysis for each issue in the network ？ Perform research that cannot be integrated into the ‘Environment Policy Monitoring System’: Two cases ㅇ Deep Learning-Based Wind Power Generation Prediction: Estimate wind power generation using climate data ㅇ Deep Learning-Based Death Risk Estimation of Korean Senior COPD Patients: Estimate the effect of PM pollution on the death risk of COPD patients using NHI (National Health Insurance) data Ⅱ. Environmental Policy Monitoring System 1. Deep Learning-Based Pollution Prediction algorithm ？ Air pollution Prediction algorithm: A CNN algorithm which estimates the air pollution of 6 pollutants (PM10, PM2.5, O3, CO, SO2, NO2) ㅇ Estimate the air pollution of a 10km x 10km grid on the South Korean Peninsula 1~24 hours in advance ㅇ Reduce RMSE of CO pollution predictions, PM10 pollution predictions, PM2.5 pollution predictions to 14.8-44.0% of sample standard deviation ㅇ Predict ‘high concentrations’ of PM10 with 90.4% accuracy, and predict ‘high concentrations’ of PM2.5 with 92.2% accuracy ？ Chlorophyll-a Pollution Prediction algorithm: A CNN algorithm which estimates 29 water pollution measuring stations on 4 major rivers a day in advance ㅇ Reduce RMSE of chlorophyll-a pollution prediction to 30.3% of sample standard deviation 2. Environmental Text Analysis algorithm ？ Environmental Text Mining algorithm: Periodically collect text produced by policy consumer and policy provider, and perform topic derivation and keyword analysis ㅇ We used Naver environmental news to collect text produced by policy consumer and press releases from the Ministry of Environment for text produced by policy provider ㅇ Perform Keyword frequency count, keyword network extraction, auto text summarization, keyword group extraction, keyword group composition calculation ㅇ Accumulate data and update results twice a day ？Climate Change related social media sentiment analysis algorithm: Automatize pre-processing and construct an ensemble algorithm of four sentiment analysis algorithms ㅇ Four sub algorithms: 2 token-based algorithms, 1 syllable-based algorithm, 1 character-based algorithm - Different tokenizers were used for each token-based algorithm: MECAB, TWITTER ㅇ Perform Keyword frequency count, keyword network extraction, auto text summarization, keyword group extraction, keyword group composition calculation 3. Issue-Based Database ？ Applying the principle of ‘identify the issue first and then provide data analysis related to the issue’ to Environmental Policy Analysis ㅇ Compensate for the rigidness of the current method of ‘identify data, and then analyze the issues related to that data’ ㅇ Identifying issues: Select 18 issues from text mining results in the national assembly minutes and newspaper articles - 18·19th Assembly Environment and Labor Committee Bill Subcommittee Minutes, 20th Assembly Environment and Labor Committee Environment Subcommittee Minutes, 176,633 newspaper articles related to environmental issues from 2008 to 2018 - The issue of fine particles dominated all text sources ㅇ Organize 18 issues into a three-level hierarchy network and link issue- specific data analysis to each issue ㅇ Automatize data collection and data analysis : Real-time new data collection and updates Ⅲ. Separate Research ？ Deep Learning-Based Wind Power Generation Prediction: Predict wind power generation of Korea Southern Power Co. Jeju Hankyung 1 plant, Hankyung 2 with the climate data from the Gosan Weather Station using RNN, LSTM ㅇ Used wind speed, wind direction, temperature, rainfall, humidity, air pressure ㅇ Compared to Linear Regression, RNN, LSTM algorithm reduced RMSE of 1-day-ahead prediction by 11.6%, 12-hour-ahead prediction by 43.9%, 6-hour-ahead prediction by 56.9% ？ ‘Deep Learning-Based Death Risk Estimation of Korean Senior COPD Patients’: Estimate the effect of short？term exposure within 1 month on the death risk of COPD patients aged 65 and older ㅇ Perform survival analysis using the cox proportional hazard model ㅇ Combine national health insurance medical data of COPD patients aged 65 and older in Seoul from 2006-2015 with air pollution and climate data ㅇ Exposure variables: Dummy variables indicating the number of days of ‘high concentrations of PM a month before death ㅇ Compared to 0 day exposure, patients exposed to ‘high concentrations’ of PM10 for 6 days or more had the hazard risk two times higher. ㅇ Compared to 0 day exposure, patients exposed to ‘high concentrations’ of PM2.5 for 9 days or more had the hazard risk two times higher. Ⅳ. Conclusion and Suggestions ？ Confirm that the ‘Environmental Policy Monitoring System’ can be used in practice: Algorithms developed up to 2019 can be used for fine particle policy monitoring ㅇ Deep Learning-Based Pollution Prediction algorithm: Detect possible areas of ‘high concentrations’ of PM in a 10km x 10km area a day in advance ㅇ Environment text mining algorithm: Check whether keywords from press releases of the Ministry of Environment were related to fine particles when the keywords from Naver news were mostly related to fine particles ㅇ Deep Learning-Based Pollution Prediction algorithm: Compare PM pollution predictions before policy implementation with the actual level of pollution after policy implementation - Making comparisons by regions can be staggered ㅇ Issue-Based Database: Compare Data analysis results for 18 issues before and after policy implementation ？ RNN, LSTM model can be used to construct a smart grid: Predict wind power generation and make alternative generators ready for predicted power shortages ？ Fine particle policy should take into account the health risks of senior COPD patients: Being exposed to the ‘high’ concentrations of PM for more than a week could be critical for COPD patients aged 65 and older. Active policy involvement is needed.

Ⅰ. 연구의 배경 및 목적 ？ 환경연구에 빅데이터 분석 방법론을 접목하여 환경정책 개발 가능성을 모색 ？ 빅데이터 분석 방법론의 정확성 및 재생성을 활용하여 정책수요 파악 및 정책 시의성 평가, 정책 유효성 평가를 주기적으로 시행하는 ‘(가칭)환경정책 모니터링 시스템’ 구축 ㅇ 정책 수요 파악: 환경오염도 예측, 환경 수요자 생성 텍스트 및 환경 관련 SNS 감성분석, 환경이슈 기반 데이터 분석 - 환경오염도 예측: 정책 개입이 필요한 환경정책 분야 사전 파악 - 환경 수요자 생성 텍스트 분석: 수요자의 관심이 집중되는 환경 분야 파악 - 환경 관련 SNS 감성분석: 국민의 불안을 야기하는 환경이슈 파악 - 환경이슈 기반 데이터 분석: 국민적 관심 대상 이슈를 선정하고 이슈별 데이터 분석을 연계하여 분석 결과가 환경에 부정적인 이슈를 파악 ㅇ 정책 시의성 평가: 환경 공급자 생성 텍스트를 분석하여 해당 시점의 환경수요와 조응 여부 평가 ㅇ 정책 유효성 평가: 환경오염 개선 여부, 환경 SNS 감성 개선 여부, 환경이슈 개선 여부 진단 - 환경오염 개선: 정책 시행 이전 예측치와 이후 실측치 비교 - 환경 SNS 감성 개선: 정책 시행 이전과 이후 SNS 감성 비교 - 환경이슈 개선 여부: 환경에 부정적인 분석 결과에 대한 개선 여부 점검 ？ ‘(가칭)환경정책 모니터링 시스템’을 구성하는 ‘딥러닝 기반 환경오염 종합예측 알고리즘’, ‘실시간 환경 텍스트 분석 알고리즘’, ‘질문 중심 데이터베이스’ 구축작업을 시작 ㅇ 딥러닝 기반 환경오염 종합예측 알고리즘: 6개 대기오염물질 오염도를 예측하는 대기오염 예측모형과 클로로필-a 오염도를 예측하는 수질오염 예측모형으로 구성 ㅇ 실시간 환경 텍스트 분석 알고리즘: 환경정책수요자 생성 텍스트 및 환경정책 공급자 생성 텍스트 주제-키워드를 추출하는 ‘환경 텍스트 정보 추출’ 알고리즘과 기후변화 SNS 감성을 분류하는 ‘기후변화 감성분석기’로 구성 ㅇ 질문 중심 데이터베이스: 환경 관련 주요 이슈 네트워크와 각 이슈 관련 데이터 분석으로 구성 ？ ‘(가칭)환경정책 모니터링 시스템’으로 포괄하기 어려운 2건의 개별 연구 수행 ㅇ 딥러닝 기반 풍력발전량 예측: 기상데이터를 이용하여 풍력발전량 예측 ㅇ 딥러닝 이용 국내 노인인구 COPD 사망 추정: 건강보험 코호트 데이터를 이용하여 미세먼지 오염도가 만성 폐쇄성 폐질환 환자 사망위험에 미치는 영향 추정 Ⅱ. ‘(가칭)환경정책 모니터링 시스템’ 구현 1. 딥러닝 기반 환경오염 통합예측 ？ 대기오염 오염도 예측 알고리즘: CNN 알고리즘을 이용하여 6개 대기오염물질(PM10, PM2.5, O3, CO, SO2, NO2) 오염도 예측 ㅇ 전국을 10㎞ × 10㎞로 분할, 각 권역의 시간별 오염도를 1~24시간 전 예측 ㅇ 일산화탄소(CO) 오염도 예측치, 미세먼지(PM10) 오염도 예측치, 초미세먼지(PM2.5) 오염도 예측치의 평균제곱근오차를 표본표준편차의 14.8~44.0%로 축소 ㅇ 미세먼지 오염도가 ‘나쁨’이상으로 분류되는 상황은 정확도 90.4%, 초미세먼지 오염도가 ‘나쁨’이상으로 분류되는 상황은 정확도 92.2%로 예측 ？ 녹조 예측 알고리즘: CNN 알고리즘을 이용하여 4대강 유역 29개 측성소의 클로로필-a 오염도를 1일 전 예측 ㅇ 예측치의 평균제곱근오차를 표본표준편차의 30.3%로 축소 2. 실시간 환경 텍스트 분석 알고리즘 ？ 환경 텍스트 정보 추출 알고리즘: 환경정책수요자 생성 텍스트 및 환경정책 공급자 생성 텍스트를 주기적으로 수집하고 주제 및 키워드 분석을 수행 ㅇ 환경정책수요자 생성 텍스트로는 네이버 뉴스, 환경정책 공급자 생성 텍스트로는 환경부 보도자료 및 환경부 e-News를 사용 ㅇ 키워드 빈도 수 파악, 키워드 네트워크 추출, 문서요약, 키워드 그룹 추출 및 키워드 그룹 구성비 추출 ㅇ 수집 및 정보 추출을 1일 2회 반복하여 결과를 축적 ？ 기후변화 감성분류기: 기후변화 SNS 전처리를 자동화하고 4개 감성분류 알고리즘의 앙상블(ensemble) 모형을 개발하여 감성분류 정확도를 제고 ㅇ 4개 알고리즘: 형태소 단위 감성분류 알고리즘 2개, 음절 단위 감성분류 알고리즘, 자모 단위 감성분류 알고리즘 - 형태소 단위: 형태소 분류기 MECAB, TWITTER 2가지 사용 ㅇ SNS 텍스트의 감성을 79.9% 정확도로 ‘긍정’ 및 ‘부정’으로 분류 가능하고 평균정확도(Average Precision)는 0.846 달성 3. 질문 중심 데이터베이스 ？ ‘중요한 이슈를 파악하고, 이슈와 관련된 데이터 분석을 제공’하는 질문기반 데이터 활용을 환경 분야에 적용 ㅇ ‘존재하는 데이터로 분석할 수 있는 이슈를 분석’하는 기존 데이터 연구 방식이 정책이슈의 동태적 변화에 대응하기 어려운 약점을 보완 ㅇ 국회회의록 및 신문기사로부터 환경정책 부문 주요 이슈 도출 - 18·19대 국회 환경노동위원회 법안심사소위원회 회의록, 20대 국회 환경노동위원회 환경소위 회의록, 2008~2018년 13개 신문사 환경 관련 기사 17만 6,663개 사용 - 미세먼지 관련 이슈가 압도적인 비중을 차지 ㅇ 선정된 이슈 18개를 3개 층위 네트워크로 구성하고, 각 이슈별 관련 데이터 분석 연계 ㅇ 데이터 수집 및 분석 과정 자동화: 실시간 분석 결과 확인 및 신규 데이터를 반영하여 분석 결과를 갱신하는 기능 부여 Ⅲ. 개별 연구 ？ ‘딥러닝 기반 풍력발전량 예측’: 한국남부발전 제주 한경 1호기, 2호기 풍력발전량을 고산 기상관측소의 기상데이터를 이용하여 예측하는 RNN, LSTM 알고리즘 개발 ㅇ 풍속, 풍향, 기온, 강수량, 습도, 기압 자료를 이용하여 1시간 및 1일 이후 발전량 예측 ㅇ 단순회귀분석 대비 1일 이후 발전량 예측치 평균제곱근오차 11.6%, 12시간 이후 발전량 예측치 평균제곱근 오차 43.9%, 6시간 이후 발전량 예측치 평균제곱근 오차 56.9% 축소 ？ ‘딥러닝 이용 국내 노인인구 COPD 사망위험 추정’: 1개월 미만의 단기 미세먼지 노출이 65세 이상 만성폐쇄성폐질환 환자의 사망위험에 미치는 영향 추정 ㅇ cox proportional hazard 모형을 이용한 생존분석(Survival analysis) 이용 ㅇ 서울시 거주 65세 이상 만성폐쇄성폐질환 환자의 2006~2015년 건강보험 자료와 시군구 단위 환자 거주지의 기상 및 대기오염 오염도 자료를 결합 ㅇ 미세먼지 노출 정도: 사망 1개월 전에 일평균 미세먼지 오염도가 ‘나쁨’ 이상인 일수가 1~14일임을 나타내는 더미변수 ㅇ 노출일수가 0인 경우에 비해서 PM10은 노출일수 6일 이상, PM 2.5는 노출일수 9일 이상일 경우 사망위험의 hazard ratio가 2이상으로 증가 Ⅳ. 결론 및 정책 제언 ？ ‘(가칭)환경정책 모니터링 시스템’ 운용 가능성 확인: 2019년 개발 알고리즘을 사용하면 미세먼지 관련 정책수요 진단, 정책 시의성 평가, 정책 유효성 평가 가능 ㅇ 딥러닝 기반 환경오염 통합예측 알고리즘: 1일 후 미세먼지 오염도 ‘나쁨’ 이상 예측 지역을 10㎞ × 10㎞ 단위로 파악 ㅇ ‘환경 텍스트 정보 추출’ 알고리즘: 수요자 생성문서(네이버 뉴스)의 미세먼지 관련 키워드 출연 시점과 동일 시점의 공급자 생성문서(환경부 보도자료, 환경부 e-News) 키워드를 비교하여 정책의 시의성 진단 ㅇ 딥러닝 기반 환경오염 통합예측 알고리즘: 정책 개입 이전 미세먼지 오염도 예측치와 정책 개입 이후 실측치를 비교 - 다양한 시차를 두고 지역별로 파악 가능 ㅇ ‘질문 중심 데이터베이스’: 정책 개입 이전과 이후의 각 이슈 관련 데이터 분석 결과를 비교하여 정책 개입의 성과를 18개 이슈에 대하여 진단 가능 ？ 스마트 그리드 구축에 딥러닝 모형 적용 가능: 발전량이 불안정한 풍력발전량의 변화를 예측하여 발전량 부족이 예측될 경우 가동 시간이 짧은 대체 발전원 연계 ？ 미세먼지 위험 관리 시 고령 만성폐쇄성폐질환 환자 사망위험 고려 필요 ㅇ 65세 이상 만성폐쇄성폐질환 환자는 일평균 농도가 ‘나쁨’ 이상이 되는 기간이 1주일 이상 지속되는 경우에는 심각한 위험에 노출되므로 적극적 개입 필요

BROWSE

Browse