This blog will give the reader a decent start with writing a Spring-based application that writes to MongoDB, retrieves data via queries and finally runs a simple MapReduce query. All this using Spring Data MongoDB support.
The business case we will work with is – look through the 2012 Presidential contribution data set and get a total count of transactions per candidate. In trying to find a decently large data set, I stumbled upon the campaign contribution dataset at – http://fec.gov/disclosurep/PDownload.do. The data set is not really that large but I could not resist the temptation of actually playing with this data set. At the end of this article I have attached the Maven project and also a modified data set file. The file is in Excel CSV format. To import the data into the database I do have a loader in this project but it depends on another library of mine to convert the CSV to POJO. You will find instructions at the end of the article on that jar.
For starters first please install MongoDB. The instructions can be found at http://www.mongodb.org/display/DOCS/Quickstart. As you can see its pretty straightforward. I used my Windows machine – apologies to my Mac. Start MongoDB.
Next run the mongo.exe command to connect to the database via the shell. Here are some basic commands I used when writing this code.
- show dbs (lists all the current databases).
- use <databasename> (database name I used is contributionsdb).
- show collections (collection name I use is contribution).
- db.contribution.findOne() (finds one record and displays it).
- db.contribution.find({candNm:”name here”}) (this will return many rows so use carefully or add more criteria).
- db.contribution.getIndexes() (returns all the current index names).
- db.contribution.dropIndex({candNm:1}) (drop existing index).
- db.contribution.ensureIndex({candNm:1}) (create index).
- db.contribution.count() (get count of records in the collection).
- db.contribution.remove() (removes the collection).
Please note that most of the commands above are working against a collection named ”contribution’.
Now lets move to the Spring part. This is a Maven project. First the spring-config.xml in resources folder.
|
Please review the code in DataMinerImpl.java for the actual code that loads the data and queries the database. I am not going to repeat that here…except for some key points noted below.
- Using @Component to register the bean.
- Using @Autowire to inject the Spring Data MongoTemplate instance into the class.
- All data operations are performed on the MongoTemplate
- Save: mongoTemplate.save(object);
- Query: Retrieve count of records by candidate name:asasas
1 2 |
mongoTemplate.count( <span style="color: blue;">new</span> Query(Criteria.where(<span style="color: maroon;">"candNm"</span>).is(<span style="color: maroon;">"Bachmann, Michelle"</span>)),Contribution.<span style="color: blue;">class</span>); |
To load the data into database execute the program – LoadDataIntoDB. This class is located in the Maven test folder. This could take a few minutes since there are 800,000 rows to be inserted and we are running on a single machine.
[codesyntax lang=”java” lines_start=”1″ tab_width=”4″ title=”LoadDataIntoDB.java” blockstate=”expanded”]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
package com.bigdata.mongodb; import java.io.File; import org.springframework.context.ApplicationContext; import org.springframework.context.support.ClassPathXmlApplicationContext; import com.bigdata.mongodb.miner.DataMiner; /** * This should be a one time action. So its a separate class for now and not a * JUnit test. */ public class LoadDataIntoDB { public static void main(String[] args) throws Exception { ApplicationContext ctx = new ClassPathXmlApplicationContext( "spring-config.xml"); DataMiner data = ctx.getBean(DataMiner.class); // -------------------------------- // load data // -------------------------------- data.loadData(new File("/home/mathew/temp/P00000001-ALL.csv")); // -------------------------------- // print total count for verification // -------------------------------- System.out .println("Total Count of Documents = " + data.getTotalCount()); } } |
[/codesyntax]
Once the data is loaded execute the JUnit test – MongoDBSpringDataTest. The class is noted below…
[codesyntax lang=”java” lines_start=”1″ tab_width=”4″ title=”MongoDBSpringDataTest.java” blockstate=”expanded”]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 |
package com.bigdata.mongodb; import org.junit.Assert; import org.junit.Test; import org.junit.runner.RunWith; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.data.mongodb.core.mapreduce.MapReduceResults; import org.springframework.test.context.ContextConfiguration; import org.springframework.test.context.junit4.SpringJUnit4ClassRunner; import com.bigdata.mongodb.domain.CandidateSummaryResult; import com.bigdata.mongodb.miner.DataMiner; @RunWith(SpringJUnit4ClassRunner.class) @ContextConfiguration(locations = "classpath:/spring-config.xml") public class MongoDBSpringDataTest { @Autowired private DataMiner dataMiner; @Test public void testGetTotalCount() { long count = dataMiner.getTotalCount(); System.out.println(); System.out.format("Total Record Count %d ", count); System.out.println(); System.out.println(); Assert.assertTrue(count > 0); } @Test public void testGetCountByCandidateName() { String name = "Bachmann, Michelle"; long count = dataMiner.getTotalCount(name); System.out.println(); System.out.format("Count for %s is %d", name, count); System.out.println(); System.out.println(); Assert.assertTrue(count > 0); } @Test public void getCandidateSummary() { MapReduceResults<CandidateSummaryResult> results = dataMiner .getContributions(); for (CandidateSummaryResult result : results) { System.out.println(result); } Assert.assertTrue(results != null); Assert.assertTrue(results.getCounts().getOutputCount() > 0); } public DataMiner getDataMiner() { return dataMiner; } public void setDataMiner(DataMiner dataMiner) { this.dataMiner = dataMiner; } } |
[/codesyntax]
MapReduce. Again if you dont understand what MapReduce is please check the web for that. I assume here that you know it.
MongoDB MapReduce scripts are written in JavaScript. You provide a Map function and then a Reduce function – both in JavaScript. Here is the Map function from source file map_by_candidate.js
|
Here is the reduce function from source file reduce_by_candidate.js
|
Map function recieves all the records on that node and you can decide whether or not you care for that data. For those that you care use the emit function to select it. Once the Map function has executed the data is merged together and sent to the reduce function as key and values. We iterate over the values and increment the count and the dollar amounts.
The output will look something like this…
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
Total Record Count 852519 Count for Bachmann, Michelle is 13140 CandidateSummaryResult [id=Bachmann, Michelle[count=13140, amt=2,677,435.18]] CandidateSummaryResult [id=Cain, Herman[count=20107, amt=7,047,264.89]] CandidateSummaryResult [id=Gingrich, Newt[count=42636, amt=11,599,660.38]] CandidateSummaryResult [id=Huntsman, Jon[count=4156, amt=3,204,350.48]] CandidateSummaryResult [id=Johnson, Gary Earl[count=1110, amt=526,884.16]] CandidateSummaryResult [id=McCotter, Thaddeus G[count=74, amt=37,030]] CandidateSummaryResult [id=Obama, Barack[count=485720, amt=117,835,898.72]] CandidateSummaryResult [id=Paul, Ron[count=130793, amt=19,391,710.76]] CandidateSummaryResult [id=Pawlenty, Timothy[count=4555, amt=4,255,054.09]] CandidateSummaryResult [id=Perry, Rick[count=13538, amt=18,477,336.91]] CandidateSummaryResult [id=Roemer, Charles E. 'Buddy' III[count=5845, amt=364,211.42]] CandidateSummaryResult [id=Romney, Mitt[count=90154, amt=76,375,927.11]] CandidateSummaryResult [id=Santorum, Rick[count=40691, amt=10,087,623.54]] |
To run the code…
- Install MongoDB. Please refer to the MongoDB site for instructions.
- Download my CSV reader flatfilereader from GitHub – https://github.com/thomasma/flatfilereader.
- Run “mvn -Dmaven.test.skip=true clean package” to create the jar file.
- Use mvn install:install-file to install library to your local maven repository => mvn install:install-file -Dfile=flatfilereader-0.8.jar -DgroupId=com.aver -DartifactId=flatfilereader -version=0.8 -packaging=jar
- You can now build the main mongo project since it has a dependency on this library.
- Download the code from this blog from GitHub – https://github.com/thomasma/mongo_campaign_finance. Open project in Eclipse and run
- Download the latest data file directly from fec.gov (ALL.zip) – http://fec.gov/disclosurep/PDownload.do.
- In Eclipse select the class LoadDataIntoDB and change location to the downloaded data zip file. Ensure mongodb is running. Run LoadDataIntoDB. If all works this will load the records into the database.
- Run MongoDBSpringDataTest to test some functions.
One final note. My example runs on a single instance of the MongoDB server. The data set is not large enough to require sharding/partitioning. I will though, time permitting, give that a shot sometime soon. If you did have sharding turned ON, the example should work exactly as-is.
If someone want to import the data file (http://54.89.99.169/?attachment_id=513) into mongo … here are the steps.
save this file as data.txt
perl -p -e “s/\|/\t/g” data.txt > new.tsv
and append this line on top of new.tsv to make field names.
cid pid company name source state phone status1 status2 cost joindate t1 t2 t3 t4 t5
and then finally import.
mongoimport –db scratch –collection emp –type tsv –headerline –file new.tsv
Hope it help someone.
Following fields are tab separated (removed by wordpress form)
cid pid company name source state phone status1 status2 cost joindate t1 t2 t3 t4 t5