ES-Hadoop(1)

Hadoop 과 Elasticsearch를 연결하여 사용하기 위한 방법

1. elasticsearch on yanr

=> Hadoop yarn cluster 안에 elasticsearch가 동작 하도록 하는 방식

(현재 사용 목적에 부합하지도 않고 Beta version 이므로 Pass)

2. elasticsearch for apache hadoop

=> 독립된 elasticserch cluster 와 hadoop cluster 사이를 driver를 통해 연결하여 사용

%4년 전에 mongo-hadoop 을 잠시 이용한 적이 있었다.

결론은 엄청난 I/O 부하와 network 부하로 인해 별로 유용하게 사용하지 못했던 기억이 있다.

그 때는 한 대의 MongoDB에서 무식한 짓을 했던 거긴 하지만..

ElasticSearch 사이트에서 퍼왔다. 이런 구조라고 보면 된다!

일단 elasticsearch site에 가서 es-hadoop 관련 파일을 다운받아야 한다.

wget http://download.elastic.co/hadoop/elasticsearch-hadoop-2.3.3.zip

그리고 압축을 풀고

아무 장소에 해당 파일을 저장해 놓는다.

기본 준비는 끝났다.

Hive를 이용한 접근을 할 것이다.

간단하게 연동 TABLE을 만들 수 있다.

CREATE EXTERNAL TABLE elastic_table(

value BIGINT,

result BIGINT)

STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'

TBLPROPERTIES('es.nodes' = 'ela ip', 'es.resource' = 'test/log');

이렇게 테이블을 생성하면 된다.

자 이제 insert 를 해보자

insert overwrite table elastic_table select 10,1 from temptable limit 1;

자 그럼 에러가 발생한다.

class를 못 찾는단다...

빼먹은게 있다....

하드에 저장한 elasticsearch-hadoop-2.3.3 관련 파일을 읽을 수 있도록 만들어야 한다.

임시로 hive shell에서

add file /path/elasticsearch-hadoop-2.3.3.jar

해줘도 된다.

하지만 귀찮다면..

<value>/path/to/elasticsearch-hadoop.jar</value>

</property>

설정을 외부 jar 파일을 읽을 수 있도록 해주면 된다.

hive-site.xml

그런데... 현재 사용 중인 HDP 2.4 버젼에서는 안먹는다.. 이게 무슨 상황인지...

내가 못하는 건지... 안돼는 건지..

암튼 add file /path/elasticsearch-hadoop-2.3.3.jar 후 재 시도

두 번째 ERR 발생

위 관련 jar 파일은 lib 폴더에 있는데 인식을 하지 못하는 현상이 발생했다.

결국 같은 방법으로 강제로 인식하게 만들어야 한다.

아! 그리고 모든 node의 hadoop/lib/ 폴더에 elasticsearch-hadoop-2.3.3.jar를 넣어 놔야 한다.

다시 add jar /path/commons-httpclient-3.0.1.jar

다시 실행

성공!

select 도 잘 된다!

elsaticsearch를 확인해 보면 잘 들어 가 있는 것을 확인 할 수 있다.

그리고

es-hadoop configuration 파라미터 잘 정리 해 놓으신 분이 있다.

1) es.resource (ex: es.resource=<index>/<type>)

es.resource.read

es.resource.write

2) es.nodes (ex: es.nodes=localhost)

3) es.port (ex: es.port=9200)

4) es.query (uri / query dsl / external resource)

5) es.input.json (default : false)

6) es.write.operation (document 존재할때 처리방안, index/create/update/upsert)

7) es.update.script, es.update.script.lang

8) es.update.script.params, es.update.script.params.json

9) es.batch.size.bytes (bulk api writing size, default: 1mb)

10) es.batch.size.entries (maximum number of entries, default: 1000)

11) es.batch.write.refresh (default: true)

12) es.batch.write.retry.count (default: 3, -값으면 무한정 수행)

--- 13번부터 mapping 관련 config

13) es.mapping.id (es.mapping.id=id id를 es에 _id에 맵핑)

14) es.mapping.parent (default:none)

15) es.mapping.version (es에 _version과 맵핑)

16) es.mapping.version.type (internal/external/external_gt/external_gte/force)

17) es.mapping.routing (es에 _routing과 맵핑)

18) es.mapping.ttl (es에 _ttl과 맵핑)

19) es.mapping.timestamp (es에 _timestamp와 맵핑)

20) es.mapping.date.rich (default : true)

21) es.mapping.include (comma로 분리된 field 기술, 해당 데이터만 사용)

22) es.mapping.exclude (comma로 분리된 field 기술, 해당 데이터는 사용하지 않음)

--- 23번부터 index 관련 config

23) es.index.auto.create (default:yes, index가 없으면 자동으로 생성)

24) es.index.read.missing.as.empty (default:no, index가 없으면 exception 발생)

25) es.field.read.empty.as.null (default:yes, empty field를 null 처리할지)

26) es.field.read.validate.persense (default:warn, ignore/warn/strict, missing field가 발견될때 대응)

-- 27번부터 network 관련 config

27) es.nodes.discovery (default:true, cluster내 다른 노드를 찾을지...)

28) es.nodes.client.only (default:false, true일 경우 모든 요청을 client를 통해서)

29) es.http.timeout (default:1m, es와 http connection 타임아웃값 설정)

30) es.http.retries (default:3, http fail시 retry 횟수)

31) es.scroll.keepalive (default:10m, scroll query에 타임아웃값 설정)

32) es.scroll.size (default:50, 각 scroll 요청시 return되는 document 수)

33) es.action.heart.beat.lead (default: 15s)

-- 34번부터 인증관련 config

34) es.net.http.auth.user

35) es.net.http.auth.pass

출처 : http://semode.tistory.com/m/post/24

그리고 반대로 역으로 기존에 있던 elasticsearch 데이트를 hive에서 불러와 분석 할 수도 있다.

그건 다음에

저작자표시 비영리 동일조건

'프로그래밍 > Elasticserach' 카테고리의 다른 글

zeppelin, Elasticsearch 와 spark (0)	2019.02.28
ElasticSearch unassigned_shard 발생 시 (0)	2018.11.15
ElasticSearch window 10 설치 시 'common은 예상되지 않았습니다. 오류' (0)	2018.09.06

Life is... with something

ES-Hadoop(1)

'프로그래밍 > Elasticserach' 카테고리의 다른 글

티스토리툴바

ES-Hadoop(1)

'프로그래밍 > Elasticserach' 카테고리의 다른 글

관련글

티스토리툴바