Sep 13, 2013

Category:Consciousness - PageRanking Wikipedia on your Laptop (Java)

Category:Consciousness  - PageRanking Wikipedia on your Laptop (Java)


In Theory

What are the most relevant pages in Wikipedia? This questions leads to a nice exercise in Java memory management and practical experiences with Big Data right on your own laptop. Can Wikipedias link graph fit in memory (ignoring almost all text)? This would allow us to compute PageRank on Wikipedia. How much memory do we need for this? Let’s start by taking a look how PageRank works.


PageRank
There is PageRank, an algorithm created by Larry Page and Sergej Brin [1]. PageRank is in essence a simple idea: You simulate random web surfers and look where most of them are most of the time. The algorithm goes like this:
  1. Initialize all pages with a PageRank value of 1.
  2. Select a random web page
  3. With a probability of d you are bored and randomly jump to any webpage.
  4. If not bored, you choose one of the links on the page and follow it. Before following it, you take some of your own PageRank value (e.g. 15% of it) and spread it among all edges (i.e. outgoing links).
    If there are no links (the page is a dead-end), you jump to
    any random page. Before, you spread 15% of your PageRank value spread to all pages.
  5. Repeat at step 3.
We use a value of 0,15 for d. If you do the math, this means the average length of pages a surfer will see before jumping away is 6,57.
To perform PageRank, we must fit in memory a double (or float) value for each vertex. Additionally, we must fit the graph in memory.
Let’s now take a look at the data we have.


Source Data
Wikipedia provides dumps of their database in regular intervals. We use the dump from June 2012. But instead of processing it ourselves, we rely on DBpedia and their version of the dump [4]. DBpedia processes the Wikipedia dumps and extracts structured data.  DBpedia provides NTriples files which have a simple format: Each line represents one link. The different files represent different kinds of links but we just use them all together and ignore semantics of links. Taken together, these files represent a graph of which the vertices are Wikipedia pages, including categories, normal articles, disambiguations pages and redirects. In particular, we use the following files:
  • article_categories_en.nt (2,3 GB); Lines: 15,115,486 [5]; contains links from articles to categories;
  • disambiguations_en.nt (0,2 GB) ; Lines: 1,157,484 [6]; contains links from disambiguation pages to other pages;
  • page_links_en.nt (24,1 GB) ; Lines: 158,373,972 [7]; contains links from articles to other articles;
  • redirects_en.nt (0,8 GB) ; Lines: 5,521,403 [8]; contains links from redirect pages to the redirect target;
  • skos_categories_en.nt (0,6 GB) ; Lines: 3,458,049 [9]; contains links from categories to super-categories;
  • Sum: 28 GB; Lines: 183,626,394


First we counted the lines of these files to estimate, how hard it will be to put everything in memory.  We find 183 million lines = 183 million links with a yet unknown number of distinct vertices. The number of vertices is pretty important for the memory requirements.


Java Memory Footprints
The next step is inspecting the number of distinct vertices. A simple approach is a HashSet that simply stores all vertex labels (i.e. page titles) in memory. Let’s use this to take a closer look at Java memory costs.


A String is a Java object that has a reference to a char array. The memory costs for Java data types on the Oracle 64bit JVM are as follows:
  • byte: 1 byte
  • char: 2 bytes
  • int: 4 bytes
  • double: 8 bytes
  • object reference: 8 bytes (4 bytes on 32 bit JVMs)
  • overhead for an object: 16 bytes
  • overhead for an array: 16 bytes
Let’s guess the average Wikipedia title has a length of 20 characters. As a String this costs: 16 (object overhead) + 8 (reference to char array) + 16 (overhead of char array) + 20 * 2 (the chars themselves) = 80 bytes. A HashSet uses references, those will cost another 8 bytes per String. So to store n Strings in a HashSet, we need 88n bytes. For PageRank, we need at runtime another 8n bytes (one double per vertex). Now, how many distinct labels do we have?


Looking at the Labels (Step 1)
While naively reading in the Wikipedia files, we get an out of memory error and die. That was for a run with 1 GB heap memory. Instead of just using more memory, let’s see how we can work with 1 GB of RAM.


So how can we safe memory? We use a nice trick I found on StackOverflow: We get the bytes from the String in UTF8 encoding, instead of using chars. UTF8 maps all ASCII characters to a single byte, and since Wikipedia titles are URL-encoded anyway in our dataset, they are all in the ASCII range. The resulting byte array takes half the space per character. Since we also shift from String objects to byte arrays, we get rid of some overhead. For a label of length 20 we are now down to 16 (overhead of byte array) + 20 (the chars themselves as bytes) = 36 bytes (instead of 88). However, upon trying to index Wikipedia, we still die with an out of memory error. So how can we safe even more space?
We try another approach: We stream in the Wikipedia files and after each unique 1,000,000 labels which we’ve put in a HashSet, we let Java sort them and write them to a file. We end up having 8 files with ca. 100 MB each. Now we use MergeSort [2] recursively to create a single, duplicate free, alphabetically sorted file. Since there is no out-of-the box merge sort in Java for files, we write it ourselves. Result: A 400 MB file in ISO-8859-1 encoding with all labels. Note how this takes less space then the computed in-memory representation in Java, which was ca. 1.4 GB. A file on disk simply has fewer overhead costs.


Next we count the lines in the resulting file and get 17,331,857.  That is our vertex count V. Of course, we could also have used a database to store and sort all labels, but that would have been less fun. As a result, we now also have a unique number for each vertex label, which is its line number in the sorted file.


Graph Representation
A graph can be represented as a set of vertices where each vertex has a list of vertices to which it links. This is called an adjacency list representation. This suffices for directed graphs with unweighted edges – just like our Wikipedia link graph. And it’s great for sparse graphs, as it saves spaces compared to a full matrix representation.


How much space do we need to represent our graph and have a double for each vertex? The naive way in Java would be to have a class Vertex with a double and a List. Costs? 16 (overhead per Vertex instance) + 8 (the double) + 8 (reference to list) + 16 (overhead for list object) + e * 8 (for each edge, we need a reference to another Vertex). Internally, the List instance has another reference (8) to an array (16) if we use an ArrayList. Total cost: 72 bytes + 8e. What is e? On average, e must be number of all edges (184 mio)/all vertices (17 mio) = 10,6. Let’s say 11.  So each Vertex would cost on average 160 bytes. The whole graph would take ca. 2,5 GB.


What is the lower bound, i.e. the minimally required space? Lets call V the vertex count of the graph and E the edge count. A double takes 8 byte in a JVM, so we need at least 8*V bytes to represent the current values of the PageRank algorithm – or any algorithm that assigns values to graph vertices. That is ca. 0,13 GB. As our algorithm operates on a graph, we also need to represent the graph so that we can look up its structure, i.e. retrieve the list of connected vertices. To represent an adjacency list, we can exploit our vertex numbering obtained in the last section. Numbers in the range 0 to 17 Million need 24,04 bits, so we can fit each in a Java integer, which has 32 bits. So here we are wasting almost 8 bits per vertex ID, but it will run fast.*


Costs? For the double numbers assigned to each vertex we can use one gigantic double array with length V. That costs us 16 (overhead) + V*8. The graph boils down to a gigantic int[][] array, that is 16 (overhead) + V * (16 + e * 4). For our vales of e and V this is ca. 0,99 GB. So together we need in theory just ca. 1,12 GB to represent the bare data.  And a jump from one vertex to another one can still be done in O(1).


*We could squeeze out another space reduction of 24,85%, but at the cost of having to pack/unpack bit arrays all the time – probably ruining our attempt to run PageRank fast.


Graph Building
Now we have a nice set of ingredients: We have a defined vertex ID number for each label that occurs, which we can compute by looking up the line number of the label in our sorted vertex label file. And we know how to store all data in memory. The vertex ID label file was estimated to be 1.4 GB in memory, so we load it. We also create an in-memory HashMap from label strings to integer IDs.
Next we stream the N-Triples files from DBpedia and replace each triple by the two vertex IDs for the subject and object label, write out the resulting file. Note that if we encounter s1-p1-o1 and s1-p2-o1 we simply boil this down to s1-o1 and s1-o1, i.e. we keep the redundancy. Doing this we don’t need to maintain a weight for each graph vertex, which would cost more memory then the few redundant links. However, the resulting algorithm behaves the same as if we carried the edge weights around.
Then we use our file-based MergeSort again and sort the integer-represented tuples. Ca. two hours later, we stream the two-integers-per-line file again, and compact it: Every line that starts with the same integer as the last one is combined in to a single line. So from
0 3
0 8
0 4
0 4
0 11
2 5
2 8
2 4
3 7
5 11
6 11
we transform to
0 3 8 4 4 11
2 5 8 4
3 7
5 11
6 11
Note how we keep the first integer, to make debugging easier - and handle very sparse graphs even better. Note also how we simply keep duplicate links as duplicate entries (here: from 0 to 4).
The resulting file can easily be streamed in as the graph representation. Each line becomes an int[] array in the int[][] graph. E.g. the first line turns into graph[0] = new int[] { 3, 8, 4, 4, 11 }. Now we can start to implement and debug PageRank quickly, as we have a pretty small, streamable file to load our graph from. Additionally, streaming in this file results in zero memory fragmentation, as we can always directly allocate the exact amount of memory we need - no resizing of arrays, yay!


Concurrent PageRank
PageRank assigns a number to each page (i.e. vertex in the graph), which is known as “juice” in the SEO community.
In practice, we can implement PageRank as a Java thread, moving happily from vertex to vertex. However, according to the PageRank paper, when we hit a dead-end we have to spread our juice to all other vertices. This would require to update the whole double[] array of all vertices, which takes a lot of time and hampers concurrency. In practice, it seems to suffice to spread to 10,000 other vertices, for our dataset of 17 Million vertices. This tweak allows to run multiple threads in parallel. But wait. What about the usual concurrency problems? We use AtomicDoubles and first addAndGet(-x) the source vertex, then addAndGet(+x) to the target vertex (or vertices, if we are processing a dead-end). This makes sure that the total amount of juice remains equal to the number of vertices, as we initialize each vertex with juice = 1. No further thread synchronization is required. An additional control thread is used to start/stop/control the workers to let them sleep while we write out our PageRank values.

In Practice


Test Environment
All tests have been performed on a MacBook Pro with a 2m4 GHz Intel Core i7 (4 dual-cores) and with RAM of 8 GB 1333 MHz DDR3. All files have been handled on the Apple SSD drive. The source data files have been read directly from their bz2 archives, as it preserves precious SSD disk space and it is maybe even faster to decompress on the fly than having to read ca. 20 times more bytes from disk. Java 7 Update 25 has been used. Inspections have been done with Visual VM 1.3.6 [3].


Step 1: Extract page (vertex) labels
Input: 5 .nt.bz files (total 1.4 GB on disk, decompressed over 20 GB)
Output: 7 .label_fragment files (total 0.8 GB on disk)
Time: 39 min.
Stats: Parsed (non-unique) labels: 32,435,145; Average label length: 23; Total chars read=759,568,242 chars.
Memory required: The more, the better. Using a large HashSet can directly eliminate some initial redundancy. We wrote sorted files to disk whenever we had ca. 100 M characters in it. Memory consumption peaked at ca. 2.5 GB. Most CPU time was spent on bz2 decoding.


Step 2: De-dupe and sort vertex labels
Input: 7 .label_fragment files (total 0.8 GB on disk)
Output: Temporary .mrg fragments and 1 resulting .labels file (0.4 GB)
Time: 65 sec.
Stats: 17,466,171 distinct lables = lines in the file
The resulting label file is available for download from https://www.dropbox.com/s/f0bmyaz1o4p79e5/dbpedia_en.labels.bz2


The top lines look like this:
!
!!
!!!
!=
!?
!Amerika!
!Deladap
!Donnerwetter!
!Garib_Transfrontier_Park
!Huff
!Ich_kann
!K7
!Kai!_Garib
!Kheis
!Kung
!SING_–_DAY_OF_SONG
!Xoo
!Xóõ
!_!
!uinida
!wir
!wir_für_Deutsch-Wagram
"
""
".join(re.compile(param['postproc'][1],_re.S
"25_maisons_écologique
"8zehn48"

Step 3: Create integer ID graph files
Input: 1 .label file (0.4 GB) + 5 .nt.bz files (total 1.4 GB on disk)
Output: 1 .igraph file with 2 integers per line to represent an edge (3 GB)
Time: 71 min.
Memory required: O( V * average label length); This can be further reduced a little bit.*
Stats: 181,900,732 edges processed.


*The brief idea is to exploit the fact that we stream a list of sorted labels and are later asked only, given a label, to return the correct unique vertex ID number. Hence we can prune all characters at the end and build a sorted prefix-like list, keeping just enough characters to distinguish a string from the one before and after it. Later we use binary search on this list and use the position in the list as the vertex ID. This saves something between ca. 20% to 80% memory, depending on the label distribution. Details are left out here for a later post…


Step 4: Optimizing integer graph file
Input: 1 .igraph file with 2 integers per line (3 GB)
Output: 1 .igraph-optimized file with n integers per line (1.7 GB)
Time: 3 min
Memory required: O(1), i.e. one line of text + the last vertex ID used. Everything else can be streamed on the fly.
The resulting file is available for download from https://www.dropbox.com/s/p95783ie64hy5yt/dbpedia_en.igraph-optimized.bz2


Step 5: Restart JVM, Load .igraph, PageRank initialization
Input: 1 .igraph-optimized file with n integers per line (1.7 GB)
Output: None.
Time: 70 sec.


Step 6: Concurrent PageRank
Input: None, all data in memory (yey!)
Output: .pagerank file (0.47 GB) containing 1 double per line
Time: Initial results have written to disk after arbitrary 44 min on manual request.
Memory required: O( V * double + V * e * ID) where e is the average number of outgoing edges. The ID length depends on the number of vertices.
Stats: Running with 8 threads, but getting only 650-700% of the theoretical 800% maximally available – flash-based ads in a chrome tab take their share.
The experiments have been performed with “-Xmx3G”. Used heap memory during PageRank updates was just 1,36 GB.


As there is minimal coordination among threads, there are no centralized stats. Each thread self-reports each 1 million vertices processed to the console. Using the last available sample of each thread (manually sorted) we get this lower bound for the amount of processed data:

Thread(0) processed 2 Mio nodes
Thread(1) processed 3 Mio nodes
Thread(2) processed 2 Mio nodes
Thread(3) processed 3 Mio nodes
Thread(4) processed 2 Mio nodes
Thread(5) processed 3 Mio nodes
Thread(6) processed 3 Mio nodes
Thread(7) processed 3 Mio nodes

Sum: 21 million vertices. That is ca. 477 K edges per minute = 7,9 K edges per second. Roughly, it takes 37 minutes to visit each vertex once. To have some convergence we should probably visit each vertex 20 or 100 times. So we have to run PageRank for 61 hours to have each vertex visited once, statistically. Update: In 14 hours each vertex was visited on average 68 times, so things seem to run faster over night.


Ok, now we have the final results, don’t we? They look like this:

0 0.7883971489161026
1 1.2450609137778839
2 2.9971633411076817
3 1.0771804778485874
4 0.9292089671519788
5 0.9181427990126756
6 0.8232967447449192
7 0.9169842030519191
8 1.131098799705966
9 0.7919052069638721

Gnarf, we need to sort and label the result in order to really know what the most important Wikipedia page in the English Wikipedia really is.


Step 7: Sort .pagerank file
Input: 1 .pagerank file (0.47 GB)
Output: 1 .pagerank-sorted file (same size)
Time: 9 min.
Memory required: O(1), i.e. just two lines of text.


Result (sample):

2975527 88015.30035905691
2871935 64327.277553365326
3071981 55474.79234409293
3088792 45951.642045781715
3128721 44029.60646771444
3331989 35440.596115563996
3531429 31417.359109596753
3393882 20089.214609567432
2941101 16711.39554202121
3187846 13390.22251453764


Step 8: Add labels to .pagerank file
Input: .label file (0.4 GB) + .pagerank-sorted file (0.47 GB)
Output: 1 .pagerank-labeled file
Time: 14 min. for all 1,7 million labels.
Memory required: k labels as strings, where k is the number of top-ranked vertices to be labeled. Here again memory can be saved with a little trick. First we read the first k integers from the resulting sorted pagerank file and put these requested integers in a HashMap. Second, we stream the .labels file and put all requested ones in memory. Third, we stream the first k lines of the pagerank file again, this time we can replace the vertex numbers with the vertex labels. In extreme cases, one could use a sliding window technique to reduce memory consumption even further, i.e. first extract the first x percent of the k-top-ranked; then repeat for the next window.


Result (from running 1,5h):

2975527 88015.30035905691 Category:Contents
2871935 64327.277553365326 Category:Articles
3071981 55474.79234409293 Category:Fundamental_categories
3088792 45951.642045781715 Category:Glossaries
3128721 44029.60646771444 Category:Indexes_of_topics
3331989 35440.596115563996 Category:Portals
3531429 31417.359109596753 Category:Wikipedia_categories
3393882 20089.214609567432 Category:Scientific_disciplines
2941101 16711.39554202121 Category:Categories_by_parameter
3187846 13390.222514537643 Category:Main_topic_classifications
3410610 12598.013619535843 Category:Society
3530926 12098.623945616266 Category:Wikipedia_adminship
3157301 11845.949560711737 Category:Knowledge
3530914 10949.627242225948 Category:Wikipedia_administrators
3410348 10488.44546322527 Category:Social_systems
3495278 10487.592480528032 Category:Universe
2835844 10482.739027547617 Category:Academic_disciplines
3420399 10398.35772395089 Category:Space
3238348 10262.655312316994 Category:Natural_sciences
3315875 10240.11541557626 Category:Places
3320530 9897.208412376885 Category:Political_geography
3360579 9827.093816212142 Category:Recreation
3171194 9792.17229627507 Category:Life
3269187 9791.021230436309 Category:Outlines


Result (top 100 vertices, after running 62 h):

NodeID PageRank Label
2780207 2108723.6242608824 Category:Articles
2883799 1307863.6947257747 Category:Contents
3036993 1081081.7188695967 Category:Indexes_of_topics
2997064 934209.7160333664 Category:Glossaries
2980253 712770.8801865928 Category:Fundamental_categories
3177459 419846.90270063916 Category:Outlines
3240261 412163.87238747487 Category:Portals
3096118 175693.82863426438 Category:Main_topic_classifications
3403550 174172.61872773524 Category:Universe
3110597 168023.8962877334 Category:Mental_content
3096318 155330.5160364421 Category:Maintenance_categories
3439186 147136.6643203586 Category:Wikipedia_administrators
3439198 131078.93354266914 Category:Wikipedia_adminship
3222175 126926.17194982321 Category:Philosophy_of_mind
3079466 124910.81448503835 Category:Life
2744116 111849.98393772673 Category:Academic_disciplines
3437065 111623.11715770181 Category:WikiProject_Philosophy
3146727 108137.71011140377 Category:Nature
3042125 107687.78167120714 Category:Intention
3302154 104531.64815135897 Category:Scientific_disciplines
3441805 93590.36093727 Category:Wikipedia_philosophy_maintenance
3439701 89978.86464022714 Category:Wikipedia_categories
3103729 89383.42334018843 Category:Matter
2818558 83404.77045781519 Category:Branches_of_philosophy
2880844 80889.37543366145 Category:Concepts
3358392 72350.52522479826 Category:Systems
3119915 70872.1040243315 Category:Mind
2849373 70674.27663519335 Category:Categories_by_parameter
2882746 68041.90831119369 Category:Consciousness
3438697 66265.19140006597 Category:Wikipedia
2880852 65549.74946779195 Category:Concepts_in_metaphysics
3171915 64966.64967255687 Category:Ontology
3222185 61227.97770005321 Category:Philosophy_of_psychology
3373442 57716.25467949584 Category:Thought
3029656 50671.17051489544 Category:Humanities_WikiProjects
3350140 49300.59773344442 Category:Subfields_by_academic_discipline
2869978 44958.53199893712 Category:Cognition
3439179 44484.85083396721 Category:Wikipedia_administration
3110621 43859.02045045872 Category:Mental_processes
2980255 43565.96282541698 Category:Fundamental_physics_concepts
3222146 43335.721761853674 Category:Philosophy_by_field
3222121 42326.13778250622 Category:Philosophy
3065573 41827.839353669944 Category:Knowledge
2784337 40220.338375211504 Category:Astronomical_dynamical_systems
3318953 37331.80314073502 Category:Sociological_theories
3146620 34690.9623576251 Category:Natural_sciences
2907770 33300.40379630048 Category:Dimension
2745493 33177.90620493102 Category:Action
3267072 33156.63971786569 Category:Reality
3441605 33068.965815206735 Category:Wikipedia_help
3347308 32943.34859192643 Category:Structure
2869992 32001.40684975351 Category:Cognitive_science
3318620 31009.72544626944 Category:Social_systems
3301557 30126.710741762057 Category:Science
3328671 29424.58965449175 Category:Space
3012823 29197.537021256754 Category:Help
2880849 28774.454279538248 Category:Concepts_in_epistemology
3222754 27416.655264457786 Category:Physical_sciences
3111445 27371.09666000233 Category:Metaphysics
2936802 26832.47364175895 Category:Epistemology
2992165 26544.0247451408 Category:Geometric_measurement
3222094 26347.66302652885 Category:Philosophical_concepts
2849403 25249.013139105606 Category:Categories_by_topic
2972594 24980.308520029124 Category:Form
3060015 24015.872989594434 Category:Justification
3175994 23969.624611874016 Category:Origins
3021941 23695.602474492614 Category:History_of_Wikipedia
3318882 23220.695758861857 Category:Society
3438689 23167.79102155423 Category:Wikimedia_projects
2880846 22356.276966528687 Category:Concepts_by_field
3044287 22292.798004987188 Category:Introductory_physics
3328904 21977.881941643973 Category:Spacetime
3440171 21972.295142919567 Category:Wikipedia_categorization
3221256 21671.39925891586 Category:Phenomena
3041065 21626.50979617279 Category:Information
2885285 21054.214256034218 Category:Corporate_groups
3302124 20420.063985522265 Category:Science_studies
2976993 20198.30650916638 Category:Free_will
2976784 19978.7972533183 Category:Free_encyclopedias
2773231 19938.66237846059 Category:Applied_disciplines
3440174 19509.32987565837 Category:Wikipedia_category_cleanup
2804970 19021.496310405073 Category:Behavior
3127940 18741.74935797963 Category:Motion
3222156 18629.309499195057 Category:Philosophy_maintenance_categories
3042235 18138.194650387893 Category:Interdisciplinary_fields
2779035 17872.755071764765 Category:Art_WikiProjects
3441821 17412.74471237936 Category:Wikipedia_project_help
2885305 16420.166860901958 Category:Corporatism
3446345 16416.92937260458 Category:Wikipedians_by_Wikipedia_status
2784667 16304.681701776566 Category:Astronomical_sub-disciplines
3318950 15960.418504374156 Category:Sociological_paradigms
3438645 15662.069680599265 Category:WikiProjects
3029668 15646.135571567413 Category:Humans
3440169 15445.868254746832 Category:Wikipedia_categories_that_should_not_contain_articles
2923073 15201.925331620165 Category:Education
2898870 14984.738439080706 Category:Data
3065588 14945.782299766714 Category:Knowledge_sharing
2806443 14940.130723693777 Category:Belief
3441613 14612.720339250349 Category:Wikipedia_history
2804983 14578.308647302452 Category:Behavioural_sciences


Figure 1: Top 100 pages (all categories) as a tag cloud scaled according to PageRank value

Figure 2: Top 100 pages (all categories) as a tag cloud
scaled according to sqrt(PageRank value), the term "Category:" has been stripped from all labels

The complete top-100.000 pages are available for download from [10].

More Statistics
During the 62 hours we ran the PageRank with 8 threads we processed 1195 mio. atomic steps of which 441 mio. were handling a dead-end. Each dead-eand handling requires to update 10k doubles, which takes a little time. Overall we can crunch ca. 19 mio. vertices per hour on the MPB. Each vertex has been visited on average 68 times, which is not that much. To further increase performance ways to speed up dead-end handling should be researched. Maybe generating random numbers is too slow? The VisualVM CPU profiler was of no big help here.

Memory Scalability Analysis
Step
Required Memory
1
O(1), > 0.5 GB
2
O(1)
3
O(V), depends on labels “shape”
4
O(1)
5
O(1)
6
O(V + E)
7
O(1)
8
O(k), k <= V


The critical steps are those dealing with labels and the PageRank computation itself. Assuming the labels are not the problem, how much PageRank should fit on a MacBookPro with 8 GB? Let’s say we can run a JVM with 6GB. Here are some examples of graphs and their memory requirements. The cost formula is V*8 (page rank values as doubles) + V*8 (an array for outgoing edges) + E*b (the outgoing edges themselves, b = log_2(V), rounded up). Here are some interesting values for V and E to get a feeling what fits in 6 GB of RAM:

Vertices
Bits theoretically required per vertex ID
Bytes used per vertex ID
Edges in total
Edges per vertex
Memory required (GB)
17,466,171
24.04
4
183,626,394
10.51
0.94
10,000,000
23.25
3
2,094,150,315
209.41
6.00
100,000,000
26.57
4
1,210,612,736
12.10
6.00
200,000,000
27.57
4
810,612,736
4.05
6.00
400,000,000
28.57
4
10,612,736
0.02
6.00

So a Wikipedia-like dataset with 5 times more vertices would still fit on my MacBook.


Conclusion
The goal of this article was to explore how much “big data” can be crunched on a single machine. Actually doing it was a lot of fun last christmas*. Now I finally found the time to write it properly up. I hope you enjoyed it.

*especially at the moment when a test run top-ranked Category:Consciousness :-)

References

[4] http://wiki.dbpedia.org/Downloads38 - Datasets were extracted from Wikipedia dumps generated in late May / early June 2012.
[10] https://dl.dropboxusercontent.com/u/3460852/pagerank/dbpedia_en.pagerank-labeled.bz2


1 comment:

  1. A slightly article about assigning articles to macro-categories: http://airlab.elet.polimi.it/images/3/3e/Macro-categories.pdf

    ReplyDelete