shot of parallel network cables

How to do logging in PowerShell without file is locked exceptions

While I’m working on a simple script which calls a REST API to handle pause and resume of monitoring devices, we run in troubles, because the scripts were executed in startup and shutdown event of the various workstations. So, I came up with the idea of logging the script execution to file. This was fine, but my customer reminds me, that a mass of the workstations will startup in parallel and we might get the excpetion ‘The process cannot access the file ‘xxx’ because it is being used by another process.’.

I had two problems:
1. How to test parallel file access
2. How to solve file is being used by another process exeption, when this would be a problem.

First, I developed my simple Log4PowerShell function which writes log entries in an CSV file:

function Write-Log {
	[CmdletBinding()]
	param(
		[Parameter()]
		[ValidateNotNullOrEmpty()]
		[string]$Message,

		[Parameter()]
		[ValidateNotNullOrEmpty()]
		[ValidateSet('DEBUG','INFO','WARN','ERROR')]
		[string]$Severity = 'INFO'
	)
	[pscustomobject]@{
		Date = (Get-Date -Format "dd.MM.yyyy")
		Time = (Get-Date -Format "HH:mm:ss.fff")
		Severity = $Severity
		Message = $Message
	} | Export-Csv -Path "C:\Temp\PowerShell-Log.csv" -useCulture -Append -NoTypeInformation
}

Next I was thinking about how to test this. The Write-Log function will write entries to the file C:\Temp\PowerShell-Log.csv and I want now to force the access to that file in parallel. Therefore the Start-Job cmdlet is the right joice. The Start-Job cmdlet starts a PowerShell background job on the local computer. That’s exactly what I want, but not only one job, I want to start 20 jobs in parallel.

One of the simplest calls of Start-Job is: Start-Job [-ArgumentList ] [-ScriptBlock] .
So we will try that with:

Start-Job -ArgumentList 10 -ScriptBlock {param ($i) Write-Output $i}

As result we will get a output like this:

Id Name PSJobTypeName State HasMoreData Location Command
-- ---- ------------- ----- ----------- -------- -------
123 Job123 BackgroundJob Running True localhost param ($i) Write-Outpu...

To get the output result of the job, we have to use the Receive-Job cmdlet with the ID of the job:

Receive-Job -Id 123

How to run this now 20 times in parallel? We can use a simple loop, but then we will get a list of job Id’s and have to wait until they are finished and then call each seperate. Better is, to create an array of jobs and wait for them all using Wait-Job -Job and Receive-Job :

$jobs = @()
(1..20) | %{$jobs += Start-Job -ArgumentList $_ -ScriptBlock {param ($i) Write-Output $i}}
Wait-Job -Job $jobs | Out-Null
Receive-Job -Job $jobs

The result will be a simple output from 1 to 20.

Now, after we solved that, lets try out the logging function Write-Log. Therfore we create a $ScriptBlock = {...} variable, which also contains the Write-Log function:

$ScriptBlock = {
    param ($init)
    # ---------------------------------------------------------------
    # Log4PowerShell function
    # ---------------------------------------------------------------
    function Write-Log {
        [CmdletBinding()]
        param(
            [Parameter()]
            [ValidateNotNullOrEmpty()]
            [string]$Message,

            [Parameter()]
            [ValidateNotNullOrEmpty()]
            [ValidateSet('DEBUG','INFO','WARN','ERROR')]
            [string]$Severity = 'INFO'
        )
        [pscustomobject]@{
                Date = (Get-Date -Format "dd.MM.yyyy")
                Time = (Get-Date -Format "HH:mm:ss.fff")
                Severity = $Severity
                Message = $Message
        } | Export-Csv -Path "C:\Temp\PowerShell-Log.csv" -useCulture -Append -NoTypeInformation
    }
    $thread = $init
    $start = Get-Date
    (1..30) | % { Start-Sleep -Seconds 1; $init +=1 ; Write-Log -Message "Thread: $($thread) Step: $($_)." -Severity INFO}
    $stop = Get-Date
    Write-Output "Counted from $($init - 30) until $init in $($stop - $start)."
}
$jobs = @()
(1..20) | %{$jobs += Start-Job -ArgumentList $_ -ScriptBlock $ScriptBlock}
Wait-Job -Job $jobs | Out-Null
Receive-Job -Job $jobs

When we execute this in an PowerShell Console Window, we will get a mass of exceptions like this:

The process cannot access the file 'C:\Temp\PowerShell-Log.csv' because it is being used by another process.
    + CategoryInfo          : OpenError: (:) [Export-Csv], IOException
    + FullyQualifiedErrorId : FileOpenFailure,Microsoft.PowerShell.Commands.ExportCsvCommand
    + PSComputerName        : localhost

To solve this, I just separate the creation of the row which should be written to the CSV file and the export itself. A try/catch around the export with a loop of maximum one minute if necessary. So I just substitute the Write-Log function in the above code with:

$ScriptBlock = {
    param ($init)
    # ---------------------------------------------------------------
    # Log4PowerShell function
    # ---------------------------------------------------------------
    function Write-Log {
        [CmdletBinding()]
        param(
            [Parameter()]
            [ValidateNotNullOrEmpty()]
            [string]$Message,
    
            [Parameter()]
            [ValidateNotNullOrEmpty()]
            [ValidateSet('DEBUG','INFO','WARN','ERROR')]
            [string]$Severity = 'INFO'
        )
        $data = [pscustomobject]@{
                Date = (Get-Date -Format "dd.MM.yyyy")
                Time = (Get-Date -Format "HH:mm:ss.fff")
                Severity = $Severity
                Message = $Message
        }
        $done = $false    
        $loops = 1
        While(-Not $done -and $loops -lt 1000) {
            try {
                $data | Export-Csv -Path "C:\Temp\PowerShell-Log.csv" -useCulture -Append -NoTypeInformation
                $done = $true
            } catch {
                Start-Sleep -Milliseconds 10
                $loops += 1
            }
        }
    }
    $thread = $init
    $start = Get-Date
    (1..30) | % { Start-Sleep -Seconds 1; $init +=1 ; Write-Log -Message "Thread: $($thread) Step: $($_)." -Severity INFO}
    $stop = Get-Date
    Write-Output "Counted from $($init - 30) until $init in $($stop - $start)."
}
$jobs = @()
(1..20) | %{$jobs += Start-Job -ArgumentList $_ -ScriptBlock $ScriptBlock}
Wait-Job -Job $jobs | Out-Null
Receive-Job -Job $jobs

Check now the CSV file. I would recommend, to put an additional column ‘Number’ with 1..600. When you then sort ascending at column ‘Time’, you will see, that then ‘Number’ column is not ongoing anymore and you also will see many equal times.

I hope I could help the one or other with this.

Shows a stack of folders.

List latest files from all directories in a given path using PowerShell

PowerShell is mostly used to execute scripts. But as everyone knows, PowerShell is also great for using interactive. For that, the best choice is the PowerShell Integrated Scripting Environment (PowerShell_ise).

But what brought me to this idea? I just want to know, how are the activities on certain directories by listing all the new files. To handle all with files, two of the PowerShell CmdLets are Get-Item and . With pipeing the result to other CmdLet (or Alias) the job can be done.

Here it is :

# List the newest file from each directory in path
Get-ChildItem -Path <Drive & Path where to start> -Recurse | group directory | foreach {$_.group | sort LastWriteTime -Descending | select -First 1}

When you will view over the result, you will maybe recognize, that there are files already years old. That means already for years, there are no new files or no activity in that folder. For that, it’s possible to filter the returned files with a date. When you dont expect hundreds of new files, you can remove the | select -First 1 at the end of the statement or higher the value to maybe 10. Have a look on that:

# List the newest file from each directory in path
Get-ChildItem -Path <Drive & Path where to start> -Recurse | group directory | foreach {$_.group | sort LastWriteTime -Descending | where LastWriteTime -GE (Date).AddDays(-7) | select -First 5}

When the result is spawned over many folders, then maybe the result is better returned as a complete list of files with the full name. Have a look at this:

# List the newest file from each directory in path
Get-ChildItem -Path <Drive & Path where to start> -Recurse | group directory | foreach {$_.group | where LastWriteTime -GE (Date).AddDays(-7)  | sort LastWriteTime -Descending | select FullName, LastWriteTime -First 5}

The samples above works fine in PowerShell 5.1. Just now I found out that there are problems in earlier Versions of PowerShell. If you use a Version before PowerShell 5.1, you have to change the samples to:

# List the newest file from each directory in path
Get-ChildItem -Path <Drive & Path where to start> -Recurse | group directory | foreach {$_.group | where {$_.LastWriteTime -GE (Date).AddDays(-7)}  | sort LastWriteTime -Descending | select FullName, LastWriteTime -First 5}
Big data circular visualization. Futuristic infographic. Information aesthetic design. Visual data complexity. Complex data threads graphic visualization. Social network representation. Abstract graph

The Data Science, Big Data, Data Analytics, Artificial Intelligence and Machine Learning Hype

Not only in Gartner’s Hype Cycle for Emerging Technologies but nearly in every Blog and Newsletter, the topics Data Science, Data Analytics, Big Data, Artificial Intelligence (AI) and Advanced Machine Learning (ML) is number one since some month. The hype about this technologies is on it’s top. Smart Factory (Industry 4.0) also contributes to the fact, because on of the four pillars of Smart Factory (Industry 4.0) is Data Analytics and Big Data.

But how all these relates to each other?
The base for all the listed topics is data, which is first created and saved from various sources (sensors at machines, user behavior on websites, applications and computers and many more), then archived and finally analyzed to answer specific questions, to find patterns or to show special constellations.

The data is the golden asset for a company in the future and it’s very important to save and archive the data now. It’s absolutely worthless to tell everybody that we could have all data i.e. for transactions, customer behavior, machine processes and application logs, but we don’t activate or install the necessary sensors and don’t store and archive this data. Only when much as possible data from start to end of a process will be saved, including also the data of the final result, then a person called a Data Scientist can use this data and try to answer questions which cannot answered otherwise. This leads EMC to the prediction, that “the amount of stored data is growing faster than ever before and experts states that by the year 2020, about 1.7 megabytes of new information will be created every second for every human being on the planet” [1].

But what is the difference between Data Science, Big Data, Data Analytics, Artificial Intelligence and Machine Learning?
With the recent boom about this topics, also a lot of confusion about the terms starts. First of all: There is no clear definition. Lots of companies and Universities have different definitions of that terms, but the most describes Data Science as the overall umbrella over Data Analytics, Big Data, Artificial Intelligence and Machine Learning topics. The most also use the terms Data Analytics and Data Analysis synonymously.

Big Data refers to large and complex data sets (volume & variety) that’s much larger than the traditional data sets with a higher speed of data processing (velocity). Volume, variety and velocity (called the 3Vs) are the three defining dimensions of big data. For more information about traditional data sets, you also might have a look at Do we still need a Enterprise Data Warehouse?.

When we think about the traditional “3V’s”, explained above and mainly accepted in the industry as a definition, we recognize that Enterprises have been handling that for longer than a decade now, without problem. So, there must be a other definition for Big Data.

I will stay with the 3V’s, but will mention the value we are generating for the business out of the analysis of the data. That’s the difference to simply dealing with volume, variety and velocity. So, I think, with the first ‘V’ as ‘business value’ we will be better served. Beside that, a important fact for that is, to successfully combine your analytic capabilities, your source data and your business needs. With that, our second ‘V’ should be the vision, what is required to fulfill that. The complexity of every very large enterprise today requires our new third ‘V’, virtualization to simplify and accelerate the efforts of our new first two ‘Vs’.

To explain the remaining three terms I will write separate posts, because otherwise this post will get to voluminous. So, stay excited for the next post.

Bibliography

  1. EMC: IDC Digital Universe Study: Big Data, Bigger Digital Shadows and Biggest Growth in the Far East 2011.
    Retrieved: 14.06.2017.

Original Post: https://www.redtoo.com/ch/blog/the-data-science-big-data-data-analytics-artificial-intelligence-and-machine-learning-hype/

Do We Still Need a Data Warehouse?

Do we still need a Enterprise Data Warehouse?

On the way studying for a Microsoft Data Warehouse Exam, I was asking myself, if today, a traditional enterprise data warehouse is still needed and the time I’m spending with my studies is worth it. I think there is no question that data has become more and more important and is nowadays a strategic asset for companies to transform their businesses and uncover new insights. But does a traditional data warehouse fit’s into that?

A data warehouse which is categorize as “traditional” and that’s what my studies about, has the main target to be a central repository for all historical information in a company with the assumption, that the data would be captured now but analyzed later. For this, various data from transactional systems like ERP, CRM and LOB applications are extracted, transformed and loaded (ETL), normaly first into an staging area and then cleansed and enriched and afterwards transfered into tables, that means an relational schema, in the data warehouse. The resulting data warehouse becomes the main source of information, a central version of the truth, for report generation, analysis, and presentation through ad hoc reports, portals, and dashboards.

What insiders recognized is, that the data warehouse described ahead is undergoing a transformation. Virtualization and moving resources to the Cloud is one reason. A nother reason is, that organizations try to incorporate insights from data that don’t fit the traditional relational database model and that the velocity of how that data is captured, processed and used is increasing. Companies are using now real-time data to change, build, or optimize their businesses as well as to sell, transact, and engage in dynamic, event-driven processes like market trading. The traditional data warehouse simply was not architected to support near real-time transactions or event processing, resulting in decreased performance and slower time-to-value.

A modern Data Warehouse has to support workloads of relational and non-relational data, whether they are on-premis or in the cloud and whether they use on-premis solutions or solutions and servies in the cloud. The so called “Logical Data Warehouse” (LDW) or “Modern Data Warehouse” uses repositories, virtualization and distributed processes in combination. Instead of working through a requirements-based model of the traditional data warehouse where the schema and data collected is defined upfront, advanced analytics and data science uses the experimentation approach of exploring answers to ill-formed or nonexistent questions. This requires the examination of data before it is curated into a schema allowing the data to drive insight in itself.

So the recommendation and the answer to the opened queetion is, that companies should use both approaches and for established data warehouse teams to collaborate with this new breed of data scientists as part of a move towards the logical or modern data warehouse.

Original Post: https://www.redtoo.com/blog/do-we-still-need-an-enterprise-data-warehouse/

Microsoft Azure SQL Database

Use Microsoft Azure for Ad-Hoc Testing

Microsoft Azure provides a rich set of features which can be used and setup very easy and very quick. Therefore it’s the recommended way for doing ad-hoc tests and try out quick some things. In this post I will show how to use Microsoft Azure SQL Database to quick test some Transact-SQL statements.

All interaction done with a relational database is done in SQL (Structured Query Language). SQL is a standard of both the International Organization for Standards (ISO) and the American National Standards Institute (ANSI). Microsoft’s dialect of the SQL standard, which is used to interact with Microsoft’s SQL Server and Microsoft Azure SQL Database, is called Transact-SQL (T-SQL).

T-SQL is the main language used to manage and manipulate data in Microsoft’s main relational database management system, SQL Server, whether on premise or in the cloud (Microsoft Azure SQL Database).

If you don’t have a Microsoft Azure subscription until now, you can make use of 250 CHF voucher business subscription of Microsoft Azure. Have a look at Trial Offer for Microsoft Azure for more information.

Create an Azure SQL Database

Now that you hopefully have an Azure subscription, you can create an Azure SQL Database instance to use for this post.

    1. Browse to http://portal.azure.com. If you are prompted to sign in, do so with the Microsoft account that is associated with your Azure subscription.
    2. At the bottom of the Hub menu (the vertical bar on the left), click New (represented by a + symbol if the menu is minimized), and then in the New blade that appears, click Databases, and then click SQL Database.
Create Azure SQL Database

Create Azure SQL Database

  1. In the SQL Database blade:
      1. Enter the name AdventureWorksLT
      2. In the Subscription box, ensure that your subscription is listed.
      3. In the Resource group section, ensure that New is selected, and enter TSQL_Quick_Try as the new resource group name.
      4. In the Select Source list, select Sample.
      5. In the Select sample section, ensure that AdventureWorksLT[V12] is selected.
      6. Click Server. Then click Create a new server and enter the following details and click OK.
        • A unique Server name for your server (a red exclamation mark will be displayed if the name you have entered is invalid or already in use, otherwise a green tick is shown).
        • A user name you want to assign to the Server admin login. This can be your
          name or some other name you’ll remember easily – however, you cannot use
          “Administrator”.
        • A Password for your server administrator account. This must meet the password
          complexity rules for Azure SQL Database, so for example it cannot be blank or
          “password”.
        • The Location where your server should be hosted. Choose the location nearest
          to you.
        • Leave the option to allow Azure services to access the server selected (this
          opens an internal firewall port in the Azure datacenter to allow other Azure
          services to use the database).

        New SQL Server

        New SQL Server

      7. In the Pricing Tier section, select Basic.
      8. Ensure that your selections are similar to those below, and click Create.

    SQL Server Pricing Tier

    SQL Server Pricing Tier

  2. After a short time, your SQL Database will be created, and a notification is displayed on the
    dashboard. To view the blade for the database, click Resources Groups and then click on TSQL_Quick_Try Resource Group.

    TSQL_Quick_Try Resource Group Essentials Blade

    TSQL_Quick_Try Resource Group Essentials Blade

Configure Firewall Rules for your Azure SQL Database Server

  1. In the TSQL_Quick_Try blade, under Essentials, click the server name for your database
    server (which should be in the format server_name.database.windows.net). In my case that is tsqlquicktry042.database.windows.net

    Azure SQL Server Show Firewall Settings

    Azure SQL Server Show Firewall Settings

  2. In the blade for your SQL server, under Essentials, click Show firewall settings.
  3. In the Firewall settings blade, click the Add client IP icon to create a firewall rule for your client
    computer, and then click Save.

    Azure SQL Srver Firewall Add Client IP

    Azure SQL Srver Firewall Add Client IP

Note: Azure SQL Database uses firewall rules to control access to
your database. If your computer’s public-facing IP address
changes (or you want to use a different computer), you’ll need
to repeat this step to allow access. Alternatively, you can modify
the firewall settings for your Azure SQL Database server
to allow a range of IP addresses – see the Azure SQL Database
documentation for details of how to do this.

Installing and Connecting from a Client Tool

SQL Server Management Studio is the primary management tool for Microsoft SQL Server, and you can also use it to manage and query Azure SQL Database. If you do not already have SQL Server Management Studio installed, you can download it from Download SQL Server Management Studio (16.5). When the download is complete, run the executable file to install SQL Server management Studio.

After installing SQL Server Management Studio, you can start it and connect to your Azure SQL Database server by selecting the option to use SQL Server authentication, specifying the fully-qualified name of your Azure SQL Database server (<your_server_name>.database.windows.net), and entering your user name in the format <your_user_name>@<your_server_name> and password, as shown here:

Connect to Azure SQL Database

Connect to Azure SQL Database

After connecting, you can create a new query and run it by clicking Execute, and you can save and open Transact-SQL scripts. Be sure to select the AdventureWorksLT database when running your queries as shown here:

Run Query in MS SQL Server Management Studio

Run Query in MS SQL Server Management Studio

Here is also the T-SQL Statement I tried. You can copy it and try it in your Azure SQL Database:

Original Post: https://www.redtoo.com/ch/blog/use-microsoft-azure-for-ad-hoc-testing/

redtoo Trial Offer for Azure

Trial Offer for Microsoft Azure

Just in time for the Swiss National holiday on August 1., there is a new opportunity for Swiss companies to try out Microsoft Azure. The Swiss company where I’m working at, redtoo, located in Basel offers a 30 days trial account for first time subscribers of Microsoft Azure.

To find out more, have a look at: https://www.redtoo.com/ch/de/azure-computing/

WWW Internet

Key Concepts of International Web Design

The primary activity of the World Wide Web Consortium (W3C), is to develop protocols and guidelines that ensure long-term growth for the Web. The widely adopted Web standards define key parts of what actually makes the World Wide Web work.

A fundamental concern and goal of the W3C since the beginning has been the access to the Web for all. It is easy to overlook the needs of people from cultures different to your own, or who use different languages or writing systems, but you have to ensure that any content or application that you design or develop is ready to support the international features that they will need [1].

The following 7 quick tips summarize some important concepts of international Web design:

  1. Encoding: use the UTF-8 (Unicode) character encoding for content, databases, etc. Always declare the encoding.
  2. Language: declare the language of documents and indicate internal language changes.
  3. Navigation: on each page include clearly visible navigation to localized pages or sites, using the target language.
  4. Escapes: use characters rather than escapes (e.g. &#xE1; &#225; or &aacute;) whenever you can.
  5. Forms: use UTF-8 on both form and server. Support local formats of names/addresses, times/dates, etc.
  6. Localizable styling: use CSS styling for the presentational aspects of your page. So that it’s easy to adapt content to suit the typographic needs of the audience, keep a clear separation between styling and semantic content, and don’t use ‘presentational’ markup.
  7. Images, animations & examples: if your content will be seen by people from diverse cultures, check for translatability and inappropriate cultural bias.

Bibliography

  1. Richard Ishida, W3C.: Internationalization Quick Tips for the Web 2015-04-01 13:22.
    Retrieved: 29.06.2016.
Microsoft Logo

Microsoft released Version 1.0 of .NET Core and ASP.NET Core

Today, June 27. Microsoft announce the release of .NET Core 1.0, ASP.NET Core 1.0 and Entity Framework Core 1.0, available on Windows, OS X and Linux after they first announced the project 18 month ago. Microsoft says that more than 18’000 developers from 1’300 companies have contributed to the cross-platform, open source, and modular .NET platform for creating modern web apps, micro services, libraries and console applications. The platform is maintained by Microsoft and the .NET community on GitHub.

The characteristics of .NET Core are:

  • Flexible deployment: Can be included in your app or installed side-by-side user- or machine-wide.
  • Cross-platform: Runs on Windows, macOS and Linux; can be ported to other OS’s. The supported Operating Systems (OS), CPUs and application scenarios will grow over time, provided by Microsoft, other companies, and individuals.
  • Command-line tools: All product scenarios can be exercised at the command-line.
  • Compatible: .NET Core is compatible with .NET Framework, Xamarin and Mono, via the .NET Standard Library.
  • Open source: The .NET Core platform is open source, using MIT and Apache 2 licenses. Documentation is licensed under CC-BY. .NET Core is a .NET Foundation project.
  • Supported by Microsoft: .NET Core is supported by Microsoft, per .NET Core Support

To get more information about .NET Core 1.0 see https://www.microsoft.com/net/core#windows

Functional Programming

The future of programming is functional

Some years ago it goes public that Professor Robert Hagner from the Carnegie Mellon University at Pittsburgh just canceled the freshmen lesson for Object Oriented Programming (OOP). His reason was, that OOP is from nature antimodular and antiparallel and therefore not suitable for modern curriculum of Computer Science Studies.

I thought: “Wow”. Just for generations the thinking of software programmers, analysts and architects and the whole industry was in “Objects” and now it’s simply canceled. But what is true is true. With the limitation of raising processor clock pulse to speed up computer, the trend is going to parallel work instead. With Windows Azure as example it’s easy to scale processor count used by an application from one to thousands, but the application must be able to make use of it. So, let’s start think functional by using Visual F#. Microsoft defines F# as “a multiparadigm language that supports functional programming in addition to traditional object-oriented programming and .NET concepts. It’s a first class member of the .NET Framework languages”.

There is no universally accepted definition of functional programming, but one of the most agreed attributes about the functional programming paradigm is, that programming is done with expressions or declarations instead of statements. To remember, a statement is doing something like assigning a value to a variable, jumping to a other line of the code or calling a subroutine, while an expression is any section of the code that evaluates to a value.

So, let’s start with giving names to values. In F# this will be done with the let keyword. It’s used to declare and to use identifiers:

// Integer and string.
let num = 10
let str = "F#"
// Storing integers and strings in a list.
let integerList = [ 1; 2; 3; 4; 5; 6; 7 ]
let stringList = [ "one"; "two"; "three" ]
// Perform a simple calculation and bind intSquared to the result.
let intSquared = num * num

Beside the simplification of implementing parallel and asynchronous tasks, the greatest benefit for enterprises to use functional programming is, that algorithm can be defined shorter and simpler. That lowers the maintenance costs of applications significant.

For a sample algorithm we will use a daily life example: we are searching on our laptop for documents, in our intranet for information related to our customers or in google for news and other information. This is called information retrieval and algorithms for that a mainly based on calculation of similarity which will be found out by calculating the distance between the searched terms existing in documents. One beside many of such algorithm is the Jaccard similarity coefficient.

The Jaccard similarity coefficient, also known as the Jaccard index, is a statistic method used for comparing the similarity and diversity of finite sample sets. It is defined as the size of the intersection divided by the size of the union of the sample sets.

Jaccard

Jaccard similarity coefficient

One point which should be mentioned is the difference between symmetric versus asymmetric nominal variables inside the data set.

A binary variable has two states, 0 and 1 and is called symmetric, when there is no preference for which outcome of the binary variable should be coded as 0 and which as 1. For example, the binary variable “gender” for a human has the possible states “female” and “male” Both are equally valuable and carry the same weight when a proximity measure is computed.

On the other side, a binary variable is asymmetric if the outcomes of the states are not equally important, such as the positive and negative outcomes of a disease test. It is usually the rares one by 1 (HIV positive) and the other by 0 (HIV negative).

Lets suppose that we have a bunch of sheets defining professional skills:

// A list of skills
// This employee is a system and data integration expert
let employee1Skills = [".NET"; "C#"; "WCF"; "WF"; "BizTalk"; "SOA"; "BPMN"; "EAI"; "ESB"]
// This employee is a project manager
let employee2Skills = ["PM"; "IPMA"; "PMI"; "PMP"; "PRINCE2"]
// This employee is a system engineer
let employee3Skills = ["SCCM"; "AD"; "DNS"; "DHCP"; "GPO"; "SAN"; "LAN"; "WAN"]

and further a set of skills desired by a vacant job of a company:

// A companies vacant job for a Network Project Manager
let vacantJob = ["BPMN"; "WF"; "DNS"; "DHCP"; "LAN"; "PM"; "PMP"; "PRINCE2"]

The Jaccard coefficient is now a useful measure of the overlap of attributes of one of the employee skill sets (we will call ‘A’) and the vacant job skill set (we will call ‘B’).

// Prepare the common set operations needed by the Jaccard Coefficent function
let jaccardCoeff (first : string list) (second : string list) =
    let all = Set.union (first |> Set.ofList) (second |> Set.ofList) |> Set.toList
    let firstMatches = all |> List.map (fun t -> first |> List.contains t)
    let secondMatches = all |> List.map (fun t -> second |> List.contains t)
    // Next, the total number if each combination will be calculated

Each attribute of a A and B can be either present (we will define ‘1’) or absent (we will define ‘0’). The total number of each combination of attributes for both A and B are specified as follows:

M11 represents the total number of attributes where A and B both have a value of 1.
M01 represents the total number of attributes where the attribute of A is 0 and the attribute of B is 1.
M10 represents the total number of attributes where the attribute of A is 1 and the attribute of B is 0.
M00 represents the total number of attributes where A and B both have a value of 0.

   // Now, calculate the total number if each combination
   let zipped = List.zip firstMatches secondMatches
   let M11 = zipped |> List.filter (fun t -> fst t = true && snd t = true ) |> List.length
   let M01 = zipped |> List.filter (fun t -> fst t = false && snd t = true ) |> List.length
   let M10 = zipped |> List.filter (fun t -> fst t = true && snd t = false ) |> List.length
   let M00 = zipped |> List.filter (fun t -> fst t = false && snd t = false ) |> List.length

Each attribute must fall into one of these four categories, meaning that
M11 + M01 + M10 + M00 = n

The Jaccard similarity coefficient, J, is given as:

Jaccard Similarity Coefficient

Jaccard Similarity Coefficient for asymmetric binary attributes

   // Calculate Jaccard Similarity Coefficient
   let J = float M11 / float (M01 + M10 + M11)
   // Return the calculated value
   J

To use the jaccardCoeff function:

[<EntryPoint>]
let main argv = 
    let simJobEmp1 = jaccardCoeff employee1Skills vacantJob
    let simJobEmp2 = jaccardCoeff employee2Skills vacantJob
    let simJobEmp3 = jaccardCoeff employee3Skills vacantJob
    printfn "Emp1: %fnEmp2: %fnEmp3: %f" simJobEmp1 simJobEmp2 simJobEmp3
    0 // Exitcode as int

The result will show, that the employee working as a System Engineer has the highest similarity followed by the employee working as a Project Manager:

Emp1: 0.125000
Emp2: 0.272727
Emp3: 0.307692

Original Post: https://www.redtoo.com/ch/blog/the-future-of-programming-is-functional/

Do you have problems with C# Tuple class because items are read only?

While I was working on a small Windows Form tool in C#, which should help me to save and load parameters for a command line application, I run into the problem that not the complete textbox controls should be saved, but only the name and the text content. So, I had the requirement to save only a subset of the form fields. After thinking a while, I came to the idea to use Linq in combination with the Tuple class and I did some research about Tuples at all.

In mathematics, a finite sequence of elements is called a Tuple and they rise popularity with the implementation of functional programming languages. Like in mathematics, in functional programming a certain variable has a single value at all the time. So, Tuples are immutable or in other words “read only” by design and implementations like in C# are following this design.

Beside other usage of Tuples in programming languages, it’s commonly used to return a subset of data. Think about following use case: i.e. there is a list of territory definitions with fields like ZIP Code, City, Region, Population etc. and this list must be filtered and only a subset of the territory fields should be returned.

Here a code snippet for the above mentioned class definition:

// Define a class to store data
class Territory { public int zip; public string city; public string region; };

With Linq we could pretty easy do the filtering requirement, but we would have difficulties to return only a subset of the fields. Here a sample how it could be done:

// Define a sample list of objects to work with
Territory[] myTerritories = new Territory[] {
                                new Territory{zip=4153,city="Reinach",region="BL"},
                                new Territory{zip=8304,city="Wallisellen",region="ZH"},
                                new Territory{zip=3018,city="Bern",region="BE"}};

// Select a complete Territory subset variant 1
var subset1 = from Territory t in myTerritories
                   where t.region.StartsWith("B")
                         select t;
// Select a complete Territory subset variant 2
var subset2 = myTerritories.Where(t => t.region.StartsWith("B"));

So, here comes the usage of Tuples. To return only a subset of the territory fields, this fields must be stored in new instances of the Tuple class. The result will be a list of Tuples instead of a list of Territories.

Here you find a code snippet, returning a subset of the territory structure in a list of Tuples after the filtering:

// Select only a subset of territory fields as a list of Tuples
var subset3 = myTerritories.Where(t => t.region.StartsWith("B"))
                           .Select(t => new Tuple<int, string>(t.zip, t.city));

The problem is now, that Tuples are immutable / read only and follwing code would not compile:

subset3.ElementAt(0).Item2 = "Reinach BL";

So, the best solution is, to define a new class simillar to the Tuple class with a propper constructor and use this class instead of the Tuple class. Here the complete sample code:

class Program
{
    // Define a class to store data
    class Territory { public int zip; public string city; public string region; };

    static void Main(string[] args)
    {
        // Define a sample list of objects to work with
        Territory[] myTerritories = new Territory[] {
                                      new Territory{zip=4153,city="Reinach",region="BL"},
                                      new Territory{zip=8304,city="Wallisellen",region="ZH"},
                                      new Territory{zip=3018,city="Bern",region="BE"}};

        // Select a complete Territory subset variant 1
        var subset1 = from Territory t in myTerritories
                           where t.region.StartsWith("B")
                                 select t;
        // Select a complete Territory subset variant 2
        var subset2 = myTerritories.Where(t => t.region.StartsWith("B"));

        // Select only a subset of territory fields as a list of Tuples
        var subset3 = myTerritories.Where(t => t.region.StartsWith("B"))
                                   .Select(t => new Tuple<int, string>(t.zip, t.city));

        // because tuples are immutable/read only, these
        // will not compile and therefore it's commented out:
        /* subset3.ElementAt(0).Item2 = "Reinach BL"; */

        // Select only a subset of territory fields as a list of entities
        var subset4 = myTerritories.Where(t => t.region.StartsWith("B"))
                                   .Select(t => new Entity<int, string>(t.zip, t.city));

        // because Entity is mutable these will work:
        subset4.ElementAt(0).Item2 = "Reinach BL";
    }
}
public class Entity<T1, T2>
{
    public Entity(T1 t1, T2 t2)
    {
        Item1 = t1;
        Item2 = t2;
    }
    public T1 Item1 { get; set; }
    public T2 Item2 { get; set; }
}
public class Entity<T1, T2, T3>
{
    public Entity(T1 t1, T2 t2, T3 t3)
    {
        Item1 = t1;
        Item2 = t2;
        Item3 = t3;
    }
    public T1 Item1 { get; set; }
    public T2 Item2 { get; set; }
    public T3 Item3 { get; set; }
}

So, I hope you had fun and please leave a comment.

Original Post: https://www.redtoo.com/ch/blog/do-you-have-problems-with-c-tuple-class-because-items-are-read-only/