Analysis with Julia

After you have Julia installed (see Julia Quick Start), you might want to analyze some data using Julia. Here’s a collection of examples that can help you do that. They are sorted based on a common workflow:

  1. Import data
  2. Plot data Many 2D and 3D plot examples have been collected in Beautiful Makie, and it’s useful to browse through this collection to understand ways to plot your data.
  3. Fit data
  4. Determine significance

Importing data

Here are some strategies to import and save data.

Importing data from a csv file

Use the CSV.jl and DataFrames.jl packages:

using CSV, DataFrames

n = 10 #Let's generate 10 random numbers
x = sort((rand( -10:100, n)))
y = 5/9 .* (x  .- 32) #A conversion from Fahrenheit to Celsius

data = DataFrame(F=x,C=y) #Turn it into a neat little data frame

CSV.write("data.csv", data) #write the data to a CSV
data_new = CSV.read("data.csv", DataFrame) #read the data from a CSV

x_values = data_new.F # Access columns by name
y_values = data_new.C

Importing a tiff image

Use TiffImages.jl for TIFF loading:

using TiffImages

img = TiffImages.load("image.tif")
height, width = size(img)
#Julia uses complicated types for images. I often convert to a simpler format.
img_float = Int16.(img) 

Importing data from a multidimensional tiff image

For stacks, time series, and multi-channel images:

using TiffImages

img_stack = TiffImages.load("stack.tif")

n_frames = size(img_stack, 3) # For 3D data (height × width × frames)

CHANNELS = 4 #As an example
channel_1 = img_stack[:, :, 1:CHANNELS:end]
channel_2 = img_stack[:, :, 2:CHANNELS:end]
channel_3 = img_stack[:, :, 3:CHANNELS:end]
channel_4 = img_stack[:, :, 4:CHANNELS:end]

Saving your Julia workspace with JLD2.jl

Save and load Julia variables and workspaces as .jdl2 files, like you can in Matlab using .mat files.

using JLD2

@save "workspace.jld2" # Save everything

# Save specific variables
@save "workspace.jld2" time_points measurements all_data

# Load variables back
@load "workspace.jld2" time_points measurements all_data

# Or load into a dictionary
data = load("results.jld2")
loaded_time = data["experiment_time"]

Importing data from an xml file

Use EzXML.jl for XML parsing. This is more difficult and probably requires browsing the XML file for a bit to understand it’s structure.

using EzXML

doc = parsexml(read("data.xml", String))

root = doc.root # Navigate to specific elements
for element in eachelement(root)
    value = element["attribute_name"]     # Extract attributes
    text = nodecontent(element)     # Extract text content
end

Fitting data

Use LsqFit.jl for curve fitting:

Exponential decay fit

using GLMakie, LsqFit

t_data = 0:9
y_data = [95.2, 74.1, 58.3, 46.2, 37.1, 30.2, 25.1, 21.5, 18.9, 17.1]

#= Exponential decay: y = A * exp(-t/τ) + baseline
We create a function we call exp_decay of that form
But this could be any function you want to fit to!=#
exp_decay(t, p) = p[1] .* exp.(-t ./ p[2]) .+ p[3]

p_guess = [100.0, 3.0, 10.0] # Initial parameter guesses

fit_result = curve_fit(exp_decay, t_data, y_data, p_guess)
fit_params = coef(fit_result)

#Fit with 100 subsamples
x_fit_range = range(t_data[1],t_data[end], 100)
y_fit = exp_decay(x_fit_range, fit_params)

#Plot
fig = Figure(size = (600, 400))
ax = Axis(fig[1, 1])
scatter!(ax, t_data, y_data, markersize = 10, color=:black)
lines!(ax, x_fit_range, y_fit, linewidth = 3, color=:red)

fig

Determining Significance

Significance is a genuinely helpful tool to help you think about your data critically. The greater scientific community demands the simplicity of presented p-values often in the form of dazzling and deceitful stars ***.

Here I focus on both the bread and butter significance tests, and will add very specific cases where I find significance calculations a little more sticky.

Determining significance between two sets of numbers

Use HypothesisTests.jl for statistical tests. Here we show a t-test unpaired data

using GLMakie, HypothesisTests

#Two independent groups
group1 = [120, 125, 130, 128, 135]
group2 = [100, 105, 110, 108, 115] #AKA Group W

#Perform unpaired t-test (Welch's t-test)
p_val = pvalue(UnequalVarianceTTest(group1, group2))

#a little funtion to get those dazzling stars
stars(p) = p < 0.001 ? "***" : p < 0.01 ? "**" : p < 0.05 ? "*" : "ns"

#Plot
fig = Figure(size = (600, 600))
ax = Axis(fig[1, 1], xticks = ([1, 2], ["Group 1", "Group W"]), ylabel = "Measurement")

#This thing (x -> 1)., is a micro-function that takes everything and sets it to 1 applied dot-wise
boxplot!(ax, (x -> 1).(group1) , group1, color = :lightblue, width = 0.4)
boxplot!(ax, (x -> 2).(group1) , group2, color = :lightcoral, width = 0.4)
scatter!(ax, (x -> 1).(group2) , group1, color = :blue)
scatter!(ax, (x -> 2).(group2) , group2, color = :red)

#Significance Bracket
y_max = max(maximum(group1), maximum(group1)) * 1.1
bracket_points = [(1, y_max - 2), (1, y_max), (2, y_max), (2, y_max - 2)]
lines!(ax, bracket_points, color = :black)
text!(ax, 1.5, y_max, text=stars(p_val), align=(:center, :bottom),fontsize=20)

#Adjust y-axis limits to ensure everything is visible
ylims!(ax, nothing, y_max * 1.05)

fig

Determining if a slope is significantly different from 0

When fitting a line, we test if the slope could be statistically the same as zero just by chance. The null hypothesis is that the slope is 0. So we calculate how many standard errors the slope is from zero? Pop-quiz then, in the inverse case: How do you prove the slope is significantly similar to 0?

using GLMakie, GLM

x_data = 1:10
y_data = [2, 2, 3, 3, 3, 3, 4, 4, 4, 5]
X = hcat((x->1).(x_data), x_data) 

fit = lm(X, y_data)
coeftable(fit) #the coeftable of fit has p-values in it's 4 column
p_val = coeftable(fit).cols[4][2]

x_fit = range(x_data[1]-1,x_data[end]+1,100)
X_fit = hcat((x->1).(x_fit), x_fit)

#Predict the fit line and the 95% conf. int.
preds = predict(fit, X_fit, interval = :confidence)

#Plot
fig = Figure(size=(600, 600))
ax = Axis(fig[1, 1], xlabel="x", ylabel="y")

scatter!(ax, x_data, y_data, color = :black)
lines!(ax, x_fit, preds.prediction, color = :blue, linewidth=2)
band!(ax, x_fit, preds.lower, preds.upper, color=(:skyblue, 0.2))

#p-value text
p_text = p_val < 0.0001 ? "p < 0.0001" : "p = $(round(p_val, digits=4))"
text!(ax, 0.05, 0.95, text=p_text, space=:relative, align=(:left, :top), fontsize=16)

fig

Determining if a slope is significantly 0

Ok so here is the tricky bit. What if it looks like there really is no relation, and we want to see if that is significant? Well… it’s hard. One thing you can do to make it easy is define a small range around zero where you would consider the slope to be functionally zero for your specific problem. Then you can test the hypothesis that it lies inside or outside of that range.


using GLM, Distributions

x_data = 1:20
y_data = [10.1, 9.9, 10.0, 10.2, 9.8, 10.1, 9.9, 10.0, 9.9, 10.1,
          9.8, 10.2, 10.0, 9.9, 10.1, 10.0, 9.8, 10.2, 9.9, 10.0]

#Let any slope between -0.1 and 0.1 be definitionally zero.
delta = 0.1

X = hcat(ones(length(x_data)), x_data)
fit = lm(X, y_data)

slope = coeftable(fit).cols[1][2]   
std_err = coeftable(fit).cols[2][2]
dof = dof_residual(fit)   # Degrees of freedom

#Perform the Two One-Sided T-Tests (requires Distributions.jl) ---
t_dist = TDist(dof) # Create a t-distribution object for our fit

#Is the slope significantly GREATER than the lower bound?
t_lower = (slope - (-delta)) / std_err
p_lower = 1 - cdf(t_dist, t_lower)

# Is the slope significantly LESS than the upper bound?
t_upper = (slope - delta) / std_err
p_upper = cdf(t_dist, t_upper)

#The final TOST p-value is the larger of the two
tost_p_value = max(p_lower, p_upper)

println("Equivalence Test Results (bounds = ±$delta):")
println("Slope: $(round(slope, digits=4)), SE: $(round(std_err, digits=4))")
println("TOST p-value: $(round(tost_p_value, digits=4))")

if tost_p_value < 0.05
    println("Conclusion: The slope is statistically equivalent to zero within delta=$delta.")
else
    println("Conclusion: We cannot conclude the slope is equivalent to zero within delta=$delta.")
end

Determining significance between two slopes

ANCOVA tests whether regression lines from different groups have different slope. This is a complex idea that’s worth spelling out here.

ANOVA tests if the means of two or more groups are statistically different.

The model is: \(Y_{ij} = \mu_i + \epsilon_{ij}\)

  • \(Y_{ij}\) is the measured value for the j-th data point in the i-th group.
  • \(\mu_i\) is the mean of group i.
  • \(\epsilon_{ij}\) is the difference of a data point and the mean.

The null hypothesis (\(H_0\)) is that all group means are equal: \(H_0: \mu_1 = \mu_2 = \dots = \mu_k\)

This is tested statistically using the F-statistic:

\(F = \frac{\text{Variance between groups}}{\text{Variance within groups}}\)

A large F-statistic indicates the variation between groups (the signal) is much larger than the random variation within groups (the noise). A p-value then determines if this F-statistic is large enough to reject the null hypothesis.

ANCOVA extends this by testing if the group means are different after controlling for a continuous variable, the Covariate.

The model becomes: \(Y_{ij} = \mu_i + \beta(X_{ij} - \bar{X}) + \epsilon_{ij}\)

  • \(\mu_i\) represents the adjusted mean of group i.
  • \(\beta\) is the coefficient quantifying the covariate’s effect.
  • \(X_{ij}\) is the covariate score for the j-th person in the i-th group.
  • \(\bar{X}\) is the average of all covariate scores across all groups.

ANCOVA uses the same F-test logic to test two primary hypotheses:

  • Null Hypothesis (\(H_0\)): All adjusted group means are equal. \(H_0: \mu_1 = \mu_2 = \dots = \mu_k\)

  • Null Hypothesis (\(H_0\)): The regression coefficient for the covariate is zero (i.e., there is no linear relationship). \(H_0: \beta = 0\)

In code, this ANCOVA test is:

using GLM, DataFrames, GLMakie, StatsModels

#Define the data for each group separately for clarity
A_X = [1.538, 6.994, 1.019, 6.603, 7.603, 4.914, 8.773, 4.021, 7.186, 6.695]
A_Y = [9.424, 17.639, 6.949, 14.459, 21.308, 14.703, 25.158, 12.910, 23.893, 19.834]

B_X = [3.746, 5.175, 0.720, 9.524, 4.982, 6.769, 7.224, 9.479, 1.495, 1.249]
B_Y = [14.650, 19.031, 9.555, 29.480, 20.433, 23.441, 22.306, 25.264, 6.952, 12.501]

C_X = [0.560, 4.536, 8.414, 5.485, 2.953, 8.107, 2.994, 7.255, 5.984, 3.541]
C_Y = [3.650, 22.332, 41.558, 26.173, 15.339, 38.155, 16.305, 25.017, 27.204, 18.539]

#Combine the separate group data into a single DataFrame for analysis
data = vcat(
    DataFrame(X = A_X, Y = A_Y, Group = "A"),
    DataFrame(X = B_X, Y = B_Y, Group = "B"),
    DataFrame(X = C_X, Y = C_Y, Group = "C")
)

#Little function to get stars
stars(p) = p < 0.001 ? "***" : p < 0.01 ? "**" : p < 0.05 ? "*" : "ns"

# Function to reorder groups to help with multiple pairs
function reorder_groups(df, ref_group)
    ref_rows = filter(row -> row.Group == ref_group, df)
    other_rows = filter(row -> row.Group != ref_group, df)
    return vcat(ref_rows, other_rows)
end

# Fit models with different reference groups
data_A_ref = reorder_groups(data, "A")
ancova_A = lm(@formula(Y ~ 1 + X + Group + X & Group), data_A_ref)
ct_A = coeftable(ancova_A)

data_B_ref = reorder_groups(data, "B")
ancova_B = lm(@formula(Y ~ 1 + X + Group + X & Group), data_B_ref)
ct_B = coeftable(ancova_B)

get_p(model, term) = coeftable(model).cols[4][findfirst(==(term), coeftable(model).rownms)]
p_AB = get_p(ancova_A, "X & Group: B")
p_AC = get_p(ancova_A, "X & Group: C")
p_BC = get_p(ancova_B, "X & Group: C")

fig = Figure(size = (600, 600))
ax = Axis(fig[1, 1])

colors = Dict("A" => :blue, "B" => :orange, "C" => :green)
for group in ["A", "B", "C"]
    subset = filter(row -> row.Group == group, data)
    scatter!(ax, subset.X, subset.Y, color = colors[group])
    
    x_range = range(extrema(subset.X)..., length=100)
    predict_df = DataFrame(X = x_range, Group = group)
    y_pred = predict(ancova_A, predict_df)
    y_pred_float = [y for y in y_pred]

    lines!(ax, x_range, y_pred_float, color = colors[group], linewidth = 2.5)
    text!(ax, x_range[end], y_pred_float[end], text = " $group", align = (:left, :center), color = colors[group])
end

legend_text = """
Slope Significance
A vs B: $(stars(p_AB))
A vs C: $(stars(p_AC))
B vs C: $(stars(p_BC))
"""

# Create a container for the legend
legend_layout = GridLayout(tellwidth = false, tellheight = false, halign = 0.98, valign = 0.05)
fig[1, 1] = legend_layout

Box(legend_layout[1, 1], color = (:white, 0.85), strokecolor = :black, strokewidth = 1)
Label(legend_layout[1, 1], legend_text, padding = (10, 10, 8, 8))

fig