Tag Archives: talend

Talend: Processing Large Data when Using tUniqRow

In Talend, tUniqRow is a useful component which allows you to filter out only the distinct unique row from a set of data. While using this, you may hit into Java Heap Space issue (java.lang.OutOfMemoryError) if the data to be processed is very large (millions of rows).

One possible solution to this is by increasing the JVM so you can process more records.

JVMSetting

However, there is a limit to this setting depending on the Physical Memory available in the system. Setting it too high will cause performance issue to the system.

 

Another way is to use the Use of disk setting in the tUniqRow advanced settings. By using this setting, data will be stored temporarily in the local disk drive and Talend will process the data using the files instead.

tUniqRow

This approach is somehow more efficient and will prevent Talend from using excessive system memory to process the data.

 

 

 

 

 

Talend tSalesforceOutputBulkExec Component

Salesforce supports cascading update and insert function by using reference key. In Talend, we can perform the cascading UPSERT function by using tSalesforceOutputBulkExec or tSalesforceOutput components. The benefit of using this function is to reduce the API call from Talend to Salesforce. Imagine the number of calls to make if there are millions of record for data migration.In order for us to use this function, we need to have an external Id.
For example, when we upsert a Contact record and want it assigned to an Account record, all we will need to do is pass in the Account External Id. This will help to find the related Account record and assign it back to the Contact via the lookup field.
Now, let’s try to upsert some sample Contact records into Salesforce.
Contact.csv
contact
As you can see, there is a column called as accountId which is the Account’s external Id. Later in the Talend process we will use this value to perform cascading upsert function.
Account.csv
account
Before the following process, the Account.csv has already been preloaded. Below is the sample upsert job in Talend.
finalResult
Below is the mapping result:
talendMap
How to Configure:
In the tSalesforceOutputBulkExec component, select the ‘Advanced Setting’. Now focus on the ‘Relationship mapping for upsert’ column.
2
In the ‘Column name of Talend Schema’, select the external id value from the map. In our case, the AccountId from the contact profile(from the mapping) will be the external id. Secondly, the ‘Lookup field name’ is the lookup field between the Contact and Account objects in the Salesforce. Hence, it will be the Account field in Contact object. Do not be confused by the ‘External id name’ as it indicates the external id in Account object, not the Contact object. Below is the complete configuration:
cofiguration
We are finally done. Check out the results!
result

Talend tContextLoad

In Talend, you can use the tContextLoad to load the value from a properties file to the context variables. The concept is the same with Java properties file. It is extremely handy when it comes to the following situation:

  1. Username and password
  2. Configurable flag
  3. File path to read and write data
Below is an example how you can utilise tContextLoad to save Username and password for Salesforce.
A developer needs to create a three Talend jobs to upsert the data into Salesforce account, contact and opportunity. Instead of keeping Salesforce credential in these jobs separately, you can use tContextLoad to store in one properties file.
Step 1: Create the properties file
Create a configuration folder in your Talend folder and name it as “conf”.
 
Inside the properties file, put in the username, password and security token:
sf_username=insert your Salesforce Username here
sf_password=insert your Salesforce Password here
sf_token=insert your Salesforce Security Token here

* the variable name need to be similar in your Talend context variable name.
Step 2: Get properties file data
Now what we need to do is get back the data from the properties. How we can do it? Actually, we can use tFileInputProperties component to load all the data from the .properties file.
Below is the setting in the tFileInputProperties component:
 
File Format: Choose the right file format that you want to get.
File Name:  Specific where and what is the file name.
Step 3: Set the Context value
First, you need to create the context variables which match the properties file name.
At the end, use the tContextLoad to set all the value from the properties file to the context variables during run time. Below is the final result in your Talend process.

Note: This is just an example of how tContextLoad works. Storing username and password in a clear text file is not recommended in an actual production environment.

Talend – How to use tFileOutputMSDelimited (multi-schema output)

This is a simple guide to show how you can use Talend to generate a complex delimited file (e.g, combination of parent and related child records) by using the tFileOutputMSDelimited component. In this scenario, the sample output that we will be generated is:

Parent Row 1,P1,This is sample description,2013-08-01
Child Row 1,C1,P1
Child Row 2,C2,P1
Parent Row 2,P2,This is sample description,2013-09-01
Child Row 3,C3,P2
Child Row 4,C4,P2

Step 1:
Before we use the tFileOutputMSDelimited component to generate the designated output, we will need to prepare 2 files (we assume that you already know how to do this using Talend 😀 ):

One that contains the unique parent rows, for example:

Parent Name,Parent ID,Description,Date
Parent Row 1,P1,This is sample description,2013-08-01
Parent Row 2,P2,This is sample description,2013-09-01

And one that contains all the related child rows with parent id, for example:

Child Name,Child ID,Parent Ref ID
Child Row 1,C1,P1
Child Row 2,C2,P1
Child Row 3,C3,P2
Child Row 4,C4,P2

Please note that, the Parent ID in the child file is very important as it will be used to match the parent row.

Step 2:
Create a job to read the Parent file and Child file by using the tFileInputDelimited component as shown below.

These are the schemas that we configured in the parentFile and childFile tFileInputDelimited component respectively

Step 3:
Add a tFileOutputMSDelimited component to the job and link the main output of the parentFile and childFile component to it.

Step 4:
Now, let’s configure the tFileOutputMSDelimited component to look like this:

The first schema which appear as “tFileOutputMSDelimited_1” is referring to the parentFile schema. Below will be the explanation of the configuration:

  • Parent row – you should leave this as empty because this schema will be the parent data
  • Parent key column – you should leave this as empty because this schema will be the parent data
  • Key column – this will be the Parent ID column name that is specified in the parentFile schema shown in Step 3. This is important as it will be used as the unique identifier for child row.

The second schema which appear as “row2” is the childFile schema. Below will be the explanation of the configuration:

  • Parent row – set it to row1, this is to tell Talend to park all the related child record under the parent record from row1
  • Parent key column – this will be the same as the key column of the parent row
  • Key column – this should be the Parent Ref ID field in the child schema instead of the Child ID. The reason being is that Talend will use this as the unique key to group all the related child into a list.

Once you have done the configuration above, you are good to go.

Talend: Trigger a Job using Socket Listener

We talked about triggering a Talend job by using file listener previously. Today, let’s proceed to the next component which is the tWaitForSocket.

This component utilizes the ServerSocket object in Java to act as a server to listen to a port and trigger the integration process if there is an activity. As you can see in Figure 1, this component does not have much configuration if compares to the tWaitForFile component. You just need to specify the following:

  1. port to listen on
  2. response message for client
  3. action whether to continue listening on the port or close the socket

Figure 1: tWaitForSocket

Once the configuration is done, you can run the job to have it start listening on the port that you have specified. To test the configuration, you can create a simple Java client as shown:

import java.io.*;
import java.net.*;

public class TestConnection {

    public static void main( String [] args ) throws IOException {

 DataOutputStream dos = null;
 BufferedReader in = null;

        int timeout = 2000; // two seconds       

        String host = args[0];
        String port = args[1];
 String data = args[2];

        System.out.printf("Triggering Talend job: %s port: %s ....n", host, port );

        Socket socket = new Socket();
        InetSocketAddress endPoint = new InetSocketAddress( host, Integer.parseInt( port )  );

        try { 

 socket.connect(endPoint , timeout);
 System.out.printf("Success: %s  n",  endPoint );

 dos = new DataOutputStream(socket.getOutputStream());
 dos.writeBytes(data);

 in = new BufferedReader(new InputStreamReader(socket.getInputStream()));
 System.out.println("Response from Talend: " + in.readLine());

        } catch( IOException ioe ) {

            System.out.printf("Failure: %s message: %s - %s n", endPoint , ioe.getClass().getSimpleName(),  ioe.getMessage());

        } finally {
 dos.close();
 socket.close();
        } 

    }

}

Important note:

This component is useful if you want to trigger a Talend job remotely, however, it does not take or process the data that send to the socket.

Talend: Validating XML file with XML Schema (XSD)

All of us know that generating XML data in integration process is easy, however, generating an XML file that is compliant with the standard of a target application is hard as the data from the source application may be different from the target application. The best practice to prevent sending invalid XML data to the target application is to validate it against the XML schema (XSD).

In Talend, you can use the tXSDValidator component to do this for you. As you can see in Figure 1 below, you just need to specify the XML schema to use and XML data file to be validated.

Figure 1: tXSDValidator configuration

Here are the sample scenario. I have created a XML data that contains “Australia” as the country,

<Contact>
    ...
    <Address>
        <Street>Allan Road</Street>
        <City>Sydney</City>
        <Postcode>5415</Postcode>
        <Country>Australia</Country>
    </Address>
    ...
<Contact>

However, the XML schema only allows a fixed set of values such as “AU”, “NZ” and etc.

<xs:simpleType name="Country">
    <xs:restriction base="xs:string">
        <xs:enumeration value="AU">
            <xs:annotation>
                <xs:documentation>AUSTRALIA</xs:documentation>
            </xs:annotation>
        </xs:enumeration>
        <xs:enumeration value="NZ">
            <xs:annotation>
                <xs:documentation>NEW ZEALAND</xs:documentation>
            </xs:annotation>
        </xs:enumeration>
        ...
        ...
    </xs:restriction>
</xs:simpleType>

When I run the job, I will see the following error.

Figure 2: Invalid XML

Once I fixed the error at line 15 and 26 as shown in the error message, I should get a success validation.

<Contact>
    ...
    <Address>
        <Street>Allan Road</Street>
        <City>Sydney</City>
        <Postcode>5415</Postcode>
        <Country>AU</Country>
    </Address>
    ...
<Contact>


Figure 3: Valid XML

Talend: Returning data from Child to Parent process

Sometimes, I prefer to split a very large and complicated logic process into multiple sub-processes and have them executed in sequence. This way, I can manage the process better and make my testing as well as troubleshooting easier. Ever wonder how does the data in such organized process being transmitted from one sub-process to another?

In Talend, we can use the tBufferOutput component to achieve. This component acts as a temporary memory pool that keeps all the data in a child process and have them return to the parent process to continue the processing. Let’s take a quick look at the simple demo below.

Figure 1: Child process

In child process shown in Figure 1, I output the data from a tFixedFlowInput component into a tBufferOutput component. By doing this, Talend will keep the data into memory temporarily.

Now, what I need to do is to call the child process from a parent process by using tRunJob component and have the output row connected to the next component as shown in Figure 2. In this demo, I will use tLogRow component to print the data that I get from the child process.


Figure 2: tRunJob component in Parent process

Once the configuration is done, I can then run the Parent process and Talend will print out the data from Child process automatically.


Figure 3: Result

Talend: Trigger a Job using File Listener

In Talend, there are 3 useful listener components that we can use to implement an integration process that is being executed/triggered on-demand. The first listener component that we discuss here is tWaitForFile. This component allows you to listen on a file directory and execute an integration process whenever a file is dropped into the directory, updated or even deleted from the directory.

Figure 1: tWaitForFile configuration

In Figure 1, you can see that you can configure how frequent the file listener should run, which directory to scan, type of file to be processed as well as the action on a file (created, updated or deleted) in the component. Another highlights of this listener is that it will ignore the file that has been processed in the directory even though the file is not deleted. This is very handy as you don’t need to cater for the file deletion logic in the process and you can keep those file for future reference.

Hopefully this information give you a quick kick start on how to use the file listener.

Talend: use tSchemaComplianceCheck to validate data

Inserting data from a system to another could be quite troublesome at times especially due to the difference in field length for both systems. However, this can be easily overcome in Talend by utilizing the tSchemaComplianceCheck component. This component can help to validate the data according to the schema that you have defined and it also allows you to catch the problematic data.

Here is a simple demo on how to use the component. I have the following data in a CSV file:

Name,UserID
Lean,WDCI0001
Deborah,WDCI0000000002

In my database, the length of the UserID field is 8 characters and this is the schema that I defined in Talend. Please note that the Nullable, Date Pattern and Length attributes are important as that will be used by the tSchemaComplianceCheck component to validate the data. In this example, I want to validate the length of all the fields that I have. So, I have defined the allowed field length in the Database component schema accordingly.

After that, I map the output row from the tMap component to the Database component through tSchemaComplianceCheck. In the tSchemaComplianceCheck setting, I set the mode to “check all columns from schema”. Once everything is set, I should see the second row of data from my CSV file is rejected by the tSchemaComplianceCheck due to the UserId exceeding the maximum length allowed as shown in the screenshot below.

Talend Tips: User Defined Map Function

We have discussed a lot of Talend Integration Studio’s component previously. Today, let’s move on to another interesting tips where we will define our own custom function and use it as a map (User Defined Map Function).
In order to create a user defined map function, first you will need to create a custom Routine in Talend as shown in Figure 1 below. In this demo, we named the Routine as DemoFunction.

Figure 1: Routine in Talend

Please take note that it is important to provide an informative comment in the Routine function as this will help us to understand the usage of the function. Now, we have created the Routine. Let’s take a look on how to call it in a map.

In the tMap component, you can call the user defined function that you have just created by: clicking on Expression Builder -> choose *User Defined from the Categories -> double click on the function “udfExample” and it will appears in the Expression Builder.
Figure 2: Expression Builder in tMap

Once you have added the user defined function, you just need to pass in a parameter and it will returns you the result. For example, if you pass in a String “Talend Demo”, then it will return you “Talend Demo. This is the “User Defined Function” as the result.

Hopefully this gives you an idea on how to start building your own user defined map function.