I ran a phpBB3 forum for several years and activity has been slowing down step by step. Now I thought I'd get rid of that old, security-impaired (read seldomly updated) phpBB3 forum and mirror it as plain HTML files. After Googling a bit I found an entry at superuser.com stating that I could use wget which I thought of as "brilliant"!


wget -k -m -E -p -np -R memberlist.php*,faq.php*,viewtopic.php*p=*,posting.php*,search.php*,ucp.php*,viewonline.php*,*sid*,*view=print*,*start=0* -o log.txt 'http://www.example.com/forum/'

I tried to fetch the whole forum while excluding many of the useless pages (reporting posts, sending private messages, ...).

Soon I noticed that I got nearly *no* pages - all were deleted because of the reject list. Checking the logfile it turned out that I had a session id (sid) in all the URLs and they are rejected as per the -R parameter above.
Also it did not even try to fetch many pages as they were for registered users only.
It took me quite some time and some more Googling to find that I needed to have my wget register as a bot as these would not receive a session ID. Using the cookie mechanism in wget did not work. I tried but failed miserably as I soon got a new session id as the downloading took some time.

So next step was to protect the forum via .htaccess against unwanted visitors. Next was adding a new UserAgent string to my phpBB3.

phpbb3 - add bot

As in the picture above you need to go to the administrative panel and choose "Spiders/Robots" and add your own. I chose "FabCopy".

Also I chose to extend the access rights of the "Bots" group to allow access to all forums and the attachments of posts (embedded images and so on). You need to do the same both on the "group-specific forum rights" and the group rights in the first section "global permissions". For me "allow downloading attachments" was the main thing.


phpbb3 - allowing bots to access more forums phpbb3 - allowing bots to download attachments

After some more tuning I got to this command to mirror the whole phpBB3 as HTML:

screen wget -U 'FabCopy' -k -m -E -p -np -R memberlist.php*mode=email*,faq.php*,viewtopic.php*p=*,posting.php*,search.php*,ucp.php*,viewonline.php*,*view=print*,*start=0*,mcp.php*,report.php* --cookies=on --load-cookies=cookie.txt --keep-session-cookies --save-cookies=cookie.txt --user=username --password=**** -o log.txt 'http://www.example.com/forum/index.php'

So a bit of insight on this one:

  • screen = as it may take some time I've put the command in a screen. This will keep it going even if I loose the connection to my shell.

  • -U 'FabCopy' = The user agent. It should match what you chose for your bot in phpBB3.

  • -R = The reject list. I wanted to include the profile pages (memberlist.php) but not sent e-mails to people - so I excluded the email mode. mcp.php is used to sent messages to people - I also did not want that.

  • --user -- password = This is for the .htaccess protection.

  • -o log.txt = As the screen will die once the command is finished I would not see any log messages. Sending the log to disk allows you to check on the progress easily and debug any errors.

Hope this helps others facing a similar issue.

I've been working with PDI (aka. Kettle) for some time and came across the need to URLEncode a link.

Here's how I did:

var url = new org.apache.commons.httpclient.URI(link).toString();
if(new org.apache.commons.httpclient.URI(link).getScheme() == null) {
url = 'http://'+url;

Just put this code in a JavaScript-Step - it will convert the link-field to an url-field which has been URL-Encoded. If you're missing the schema in your link it will be set to "http". It uses some Java-Code for that.

Es ist offiziell! Nach dem großen Erfolg des Remake von Monkey Island 1 wird nun auch der zweite Teile aufgehübscht.

Link zur offiziellen Monkey Island Website

Ich find das natürlich Klasse. Ich hab den ersten Teil gekauft und werde mir auch den zweiten zulegen um mich diebisch über dieses tolle Spiel in aktueller Grafik zu freuen.

Alright... I've shown you how to build a PDI executable from SVN. Now we should use this, shall we?

Why create a date dimension?

This question was discussed many times and Google will get you an insight. I had to discuss this myself at more than one occasion both in my projects and at conferences like TDWI's European Conference at Amsterdam.
To cut this short: I think that a date dimension can offer you much more value than the database or your reporting tool can offer. Both know nothing about the sales seasons your business follows. Neither of them know about the fiscal year cycle you have to follow. And I doubt that many tools know when Easter Sunday is.
All this and almost anything else you can think of will be handled by a date dimension.

How does a date dimension look like?

Like any good dimension our date dimension will have a key (ID) and lots and lots of attributes. Each row in the dimension will represent exactly one day. We will add columns for different formats of a date - this is not necessary for many of you as your tools know how to format a date, but it will make it easier to enforce a standard upon all reports from all systems. Plus, some tools are not aware of special stuff like a week's year - yeah, weeks do have their own year.
Also we want to add stuff that SQL can't tell us. In this example I'll show you how to add the Christian holidays like Easter which have no fixed date. You can easily add your own flavour of holiday or season.

ID? I heard of surrogate keys and smart keys...

Yeah... another old discussion. We'll go with smart keys as they do come in handy. And no, we'll not use the date itself as a key. Too tempting and too fragile if you add in stuff like 'n/a', 'unknown' or 'malformed in the input'.

The set up

Okay... you'll need PDI for this. Go to Pentaho and download it. Or build your very own version like I told you before.
At the end of the article I'll link the transformation - if you're lazy and know your way around data warehousing and Pentaho's kettle tool, skip ahead and miss all the fun.
Still reading? Good - I'll walk you through this and explain stuff.


This is how the transformation looks like in Spoon. There are basically two streams - on the left side single days are flying by and they are enhanced with more and more information about the date they represent.
On the right hand side we have single years and we're adding information about special dates (like Easter Sunday) to each of the years.

Generating days

We use a "Generate rows" step to do this. All we need is a date field that contains the date of the very first day of our date dimension. Make sure you're looking far enough in the past. Think of people's birthday in your CRM system, think of company's create date in your customer database, think of "When was the last time I was home early?" - see... you'll need to go far into the past for this to work!
Remember to set the format - that's important.

We'll add a sequence to our date. It will start with the number 0 and we'll add one for each row that passes this step. This will allow us to have different dates than just the first of January 1900 we've set in the "Generate rows" step.

More attributes, please!

As I wrote above we'll add some nice formatting and we start with 'month in year', 'ISO week number' and stuff like that. PDI offers the "Calculator" step for many date based calculations and we're using it now.
06 HowTo Create a Datedimension in PDI

The most important column we're adding to our date row is the one called "iso_year_of_week' - calculating the year a week falls into is not trivial and you'll need it most likely.
I didn't set the type for these columns - it's safer to set it but I was lazy.

The next step is to add our date-type date in different formats to our row. This is a heavily-debated topic. I wrote some words about this above and I'll stick with: I think it's a good idea.
So what we do is: we add a "Select values" step and use it's 'Metadata' tab to change the format of our columns. We're using Java-syntax for this - and be aware this whole thing it locale aware. So while I'm getting "Januar" you might get "January" or "Janvier" or something.
07 HowTo Create a Datedimension in PDI

This is where I tricked you! Have you noticed that I'm changing the format of columns that were never created? Well... I created them on-the-fly in the 'Select & Alter' tab of this step. You can specify a source column multiple times if you remember to rename it. Now you get copies of that column and you can alter their meta-data on the right-most tab.

Do I really need to work!?

Now this is the nicest part of the date dimension - weekends! I love them and therefore they must be included. What I'm using is probably not the most straight forward way but it came to my mind.
So what we're doing is this: we create two rows that contain information about weekends - one row for Saturday and one for Sunday. They need some kind of key so we can join the information with what we have and all the attributes we like about weekends (I've only put is_weekend there).
I did this with two separate "Generate rows" steps that get pushed in a "Dummy" step - this adds two streams in a random way (there is also a step to add them one after the other).
08 HowTo Create a Datedimension in PDI

Now we need to join these two rows with the hundreds and thousands of lines we're getting from our initial "Generate rows" step that produced all these single day rows.
We're using a "Stream Lookup" step for this as it's perfect for this. It stores the two rows coming from the step indicated in the 'Lookup step' dropdown box. We need to specify the join key (day_of_week is the same as day_in_week in our example) and we can cherry-pick the attributes we'd like on our thousands of rows. Just add them where I have the 'is_weekend' column.
09 HowTo Create a Datedimension in PDI

Can I have a look?

We've been doing a lot so far. We created all those single days, added attributes with the "Calculator" step, added the date in different formats and we've put the weekend flag in there. It's time to have a look, ey?
Right-Click the last step (Stream Lookup in this example) and choose to preview. It should look like this:
10 HowTo Create a Datedimension in PDI

Eastern - the base for many holy days

Some dates are not based on their numeric value but have religious roots. In the Christian religion they are based on the date of the Easter Sunday. In order to have a nice date dimension we will add this special day and some of the days we can derive from it.
They way we're doing this should work for other calendars as well (Orthodox Christians have their own way to calculate theses days, other religions like Islam have a completely different set of calandaric calculations).
First of all we create rows representing single years now. We then want to add the date of Easter Sunday and it's derivatives. We also use these single years to add some fixed holidays in there as well - like New Year's Eve.
Have a look at the overview:
11 HowTo Create a Datedimension in PDI

We're doing the same thing as before - create rows with a base date, add a sequence to it creating single years and then the fancy stuff.

Calculating Easter Sunday

13 HowTo Create a Datedimension in PDI

We're using a well known formula to calculate the date of Easter Sunday. As performance is not of the essence in this one-time executed transformation we're using the "Java Script" step which is generally slow but allows you to do a lot of stuff. Mind you: if you find a combination of other steps it will most likely be faster to use these instead of the JavaScript.

Now we have single year rows that have the date of Easter Sunday as an attribute.

Adding the derivatives

Again we're using the "Calculator" step. We create temporary columns with the number of days a derivative is 'off' Easter Sunday and use these to calculate the proper date in the year we're handling.
14 HowTo Create a Datedimension in PDI

Make it joinable

We're getting to a difficult step now - how do we join our single-year-rows with our single-day-rows based on a number of columns that contain the holiday's date?
Well... we don't. What we need to do now is transform each row into multiple rows of which each represents one day (one of our holidays). Thus each year-row is split into many day-rows that can be easily joined.
For this we use the "Row Normalizer" step:
15 HowTo Create a Datedimension in PDI

The trick is to give the same 'new field' to all of our dates - thus we create just one field with many different values. The 'Type' column will help us identify our holiday later on.

Now let's have a look at before and after:
16 HowTo Create a Datedimension in PDI

17 HowTo Create a Datedimension in PDI

Again we use the "Stream Lookup" to join our many-single-days stream with our few-special-days stream.

Ordinary holidays

As mentioned before there are also ordinary holidays that have a fixed day. Like New Year's Eve on the 31st of December. We also create those.
To do this we copy all our single-year-rows before we calculate Easter Sunday and put these lines into a "Calculator" step.
Again we add temporary columns, this time strings having the day and month part of our date. Adding these to our year column we get a string representing the proper date. We just have to set a format (yyyy-MM-dd) and we can specify the Date type of the columns.
19 HowTo Create a Datedimension in PDI

Do I still have to work!?

Now... with the weekend flag we have a nice indicator telling us that we're facing a day off or two - with the religious holidays and our ordinary holidays added we can also add a flag 'holiday' to our columns. I decided to add some more logic to it. Each day can be a moveable holiday, a fixed one, both or nothing of this.
Again we're using the "Java Script" step to calculate this logic:
20 HowTo Create a Datedimension in PDI

One last look at it...

21 HowTo Create a Datedimension in PDI

So this is what we're getting. A stream of single days with lots of indicators, lots of formats and attributes. The stream knows about movable and fixed holidays.
You can now start adding stuff specific to your use case - add fiscal dates, add sales seasons, add important sports events if you're selling beer. You should now have the tools at hand and a rough idea on how to use them.
Please leave me a note in the comments if you find worthwhile attributes you'd like to share or that you're having a hard time implementing - maybe I can help.

The transformation

All the lazy people: welcome back, you missed a lot!
All the others: You earned it. Get your copy now.

Example transformation to create a date dimension