2013년 7월 13일 토요일

HBase에서 Column Family 개수

HBase안에 Column Family를 2개나 3개 이상만들경우 비효율적인 Compaction으로 인해 좋지않음.

정확한 이유는 나중에 분석해봐야 할듯하다.

왜 pagerank에서 map skew가 생기는가 ?


왜 pagerank에서 map skew가 생기는가 ?
Map에서 data를 읽을때 일정크기의 block으로 잘라서 나눠주게된다.
그러면 line의 크기가 달라도 연산이 비슷할 경우 skew가 안 일어날 것 같다.
왜냐면 크기가 큰곳이 배정된 곳은 적은 수의 line이 할당 될 것이기때문이다.
하지만 skew가 일어난다.
아마 output하는 과정에서 동일한 id를 가지게하면서 출력을 해야한다.
따라서 file size가 비례해보이지만 write할때
file size차이만큼 key를 더 출력해야하는 것 그래서 map skew가 일어나는 것 같다.

pagerank 1 2 3 4 5 6     8+12  20

pagerank 1             8+2     10

pagerank 4             8+2     10

split은 같은 size로 이루어지기에
2+3 = 1 과같게본다
하지만 실제적으로는
2 < 6이다. 3배나 크다.

즉 이게 엄청나게 커지면 별로 차이가 없지만
Real Graph에서는 대다수의 Graph가 작고 몇몇의 Graph만 엄청크기때문에
Skew가 일어난 것으로 예상한다.



http://lintool.github.io/Cloud9/docs/content/Lin_Schatz_MLG2010.pdf

2013년 7월 12일 금요일

When is the earliest point at which the reduce method

When is the earliest point at which the reduce method…

When is the earliest point at which the reduce method of a given Reducer can be called?
  • As soon as at least one mapper has finished processing its input split.
  • As soon as a mapper has emitted at least one record.
  • Not until all mappers have finished processing all records.
  • It depends on the InputFormat used for the job.

Answer

  • Not until all mappers have finished processing all records.

Explanation

In a MapReduce job reducers do not start executing the reduce method until the all Mapjobs have completed. Reducers start copying intermediate key-value pairs from themappers as soon as they are available. The programmer defined reduce method is calledonly after all the mappers have finished.Note:The reduce phase has 3 steps: shuffle, sort, reduce. Shuffle is where the data iscollected by the reducer from each mapper. This can happen while mappers aregenerating data since it is only a data transfer. On the other hand, sort and reduce canonly start once all the mappers are done.Why is starting the reducers early a good thing? Because it spreads out the data transferfrom the mappers to the reducers over time, which is a good thing if your network is thebottleneck.Why is starting the reducers early a bad thing? Because they "hog up" reduce slots whileonly copying data. Another job that starts later that will actually use the reduce slotsnow can't use them.You can customize when the reducers startup by changing the default value ofmapred.reduce.slowstart.completed.maps in mapred-site.xml. A value of 1.00 will waitfor all the mappers to finish before starting the reducers. A value of 0.0 will start thereducers right away. A value of 0.5 will start the reducers when half of the mappers arecomplete. You can also change mapred.reduce.slowstart.completed.maps on a job-by-job basis.Typically, keep mapred.reduce.slowstart.completed.maps above 0.9 if the system everhas multiple jobs running at once. This way the job doesn't hog up reducers when theyaren't doing anything but copying data. If you only ever have one job running at a time,doing 0.1 would probably be appropriate.Reference:24 Interview Questions & Answers for Hadoop MapReduce developers,When is thereducers are started in a MapReduce job?


http://certificationpath.com/view/ccd-410--cloudera-certified-developer-for-apache-hadoopccdh/questions/when-is-the-earliest-point-at-which-the-reduce-method-of-a-q69077