As a hadoop platform developer, my job is to maintain the hadoop platform , fix the potential bugs and develop the new feature to meet my company’s need.I don’t get involve in the specfic mr job development.But today,one of my college asked me if i know how to set the map number in mapreduce job, it make to search for a while.
We all know, we can set the reduce number by call setNumReduceTasks.However,when you use setNumMapTasks to set numbers of mappers,it does not work,because this configration is just a hint,the inputformat just ignore it and create the splits for the job.
So i just check the code in FileInputFormat.class, and found how to set number of mappers in mapreduce job.
1.For files which file size is larger then block size
ie. block size is 128MB,and file size is several GB, we can set the mapreduce.input.fileinputformat.split.minsize paramter in configuration class. Configuraiton conf = getConf(); conf.set("mapreduce.input.fileinputformat.split.minsize",1374389534720); this make split size to be 1GB.
2.For small files which file size is smaller then block size
ie the file size is several MB,we have to use the CombineTextInputFormat replace the FileInputFormat, and use mapreduce.input.fileinputformat.split.minsize with the inputformat together to set number of mappers. job.setInputFormatClass(CombineSmallfileInputFormat.class); Configuraiton conf = getConf(); conf.set("mapreduce.input.fileinputformat.split.minsize",1374389534720); Then you can control the number of mappers in you mr programme.