Python 하둡 스트리밍 (Hadoop Streaming) #2

이전글: Python 하둡 스트리밍 (Hadoop Streamming) #1

참조
[1] : https://github.com/jhofman/icwsm2010_tutorial/blob/master/hstream.py
[2] : http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python
[3] : http://jakehofman.com/icwsm2010

파이썬으로 hadoop streaming을 편하게 할수 있는 파이썬 클래스 소개

# 실행

./bin/hadoop jar contrib/streaming/hadoop-0.20.2-streaming.jar \

-file .../wordcount.py \
-file .../hstream.py \
-mapper '.../wordcount.py -m' \

-reducer '.../wordcount.py -r' \

-input input_data \

-output output_data

# wordcount.py

#!/usr/bin/env python

from hstream import HStream

import sys

import re

from collections import defaultdict

class WordCount(HStream):

def mapper(self, record):

for word in " ".join(record).split():

self.write_output((word,1))

def reducer(self, key, records):

total = 0

for record in records:

word, count = record

total += int(count)

self.write_output((word,total))

if __name__== '__main__':

WordCount()

저작자표시 비영리 변경금지

'Mining' 카테고리의 다른 글

Hadoop LZO 압축 설정 (0)	2011.07.06
로그 분석 (0)	2011.06.09
Python 제너레이터 재사용. (Reseting generator object) (0)	2011.04.19
Python 하둡 스트리밍 (Hadoop Streaming) #1 (0)	2011.04.18
R - Special Values (0)	2011.04.14

無逸

Python 하둡 스트리밍 (Hadoop Streaming) #2

'Mining' 카테고리의 다른 글

티스토리툴바

Python 하둡 스트리밍 (Hadoop Streaming) #2

'Mining' 카테고리의 다른 글

'Mining' Related Articles

티스토리툴바